Why RowCount() and Order By are expensive in Hadoop and other distributed environments

Orderby() and RowCount()

There are numerous reasons to include the RowCount() function in your relational data store, the most common being the addition of a primary key column in your data sets.  Sounds simple, but this can be a challenge for a distributed environment.  Let’s look at an example using hive on a two node Hadoop Cluster.

RowCount()

Say we have the following data in a hive table and we want to add a primary key column as an identity.

 CountryName

Continent

USA

North America

China

Asia

Brazil

South America

South Africa

Africa

Chile

South America

My first thought was to just insert the data into a new table using the RowCount() function but I had the following results

Identity

 CountryName

Continent

1

USA

North America

2

China

Asia

1

Brazil

South America

2

South Africa

Africa

3

Chile

South America

So what happened???  Well remember we have a two node Hadoop cluster which is each processing a portion of our data… This means our data was split between two nodes and the RowCount() occurred on two separate “pieces” of data

Machine generated alternative text:<br /> CountryName<br /> ιιςΑ<br /> China<br /> Bralil<br /> South Africa<br /> Chile<br /> CountryName<br /> China<br /> Continent<br /> North America<br /> South America<br /> Africa<br /> South America<br /> CountryName<br /> South Africa<br /> Chile<br /> Continent<br /> North America<br /> Continent<br /> SOllth<br /> Africa<br /> South America<br /> ldentity CountryName<br /> China<br /> Node 2<br /> ldentitv CountrvName<br /> South Africa<br /> Chile<br /> Continent<br /> North America<br /> ldentity CountryName<br /> China<br /> Bralil<br /> South Africa<br /> Chile<br /> Continent<br /> South America<br /> Africa<br /> South America<br /> Continent<br /> North America<br /> South America<br /> Africa<br /> South America

With a dataset of this size the solution is fairly simple, force all data to a single node for the RowCount(). 

Machine generated alternative text:<br /> Coun me<br /> China<br /> Brazil<br /> South Africa<br /> Chile<br /> China<br /> Brazil<br /> South Africa<br /> Chile<br /> Continent<br /> North America<br /> South America<br /> Africa<br /> South America<br /> Continent<br /> North America<br /> South America<br /> Africa<br /> South America<br /> Ncxie I<br /> Identity CountryName<br /> China<br /> Brazil<br /> South Africa<br /> Continent<br /> North America<br /> South America<br /> Africa<br /> South America<br /> Identity CountryName<br /> USA<br /> China<br /> Brazil<br /> South Africa<br /> Chile<br /> Continent<br /> North America<br /> South America<br /> Africa<br /> South America

This becomes troublesome when the size of the data becomes absolutely monstrous, since this creates a bottleneck within your processing.  Unfortunately, calculations like this are a necessary evil sometimes.

 

Sort by vs Order By

This can also be seen in the hive Sort By and Order By functions.  Order By guarantees the order of the entire output where sort by guarantees the order of each data “piece” before sending it to the reducer.  AKA.  Order By must send all the data to a single node, where Sort By can still be distributed, saving serious amounts of time and compute. 

Now you know how code like this could have massive impact on the query performance and resource utilization of your distributed environment! 

~Andrew

Leave a Reply

Your email address will not be published. Required fields are marked *