Category Archives: HDInsight

Building your first Hadoop Jar with maven and eclipse

This guide walks through creating your fist Hadoop program, but it skips over some important details, like how to compile a Jar file… It relies on the assumption that you’ve already compiled all of your Java code. Small assumption, but if you come from a C# programming background this may be confusing. Don’t worry it’s fairly easy if you use eclipse for an IDE and maven to manage the project dependencies.

Continue reading

5 Apache Spark Training Videos with the HDInsight Team

Let’s learn Spark!

After looking around for some readiness materials on getting started with Apache Spark, I noticed a lack of videos explaining the main components! Instead of waiting around for them to be created, we took things into our own hands. Today we’re releasing 5 videos, each with a HDInsight product team member, around the pieces of HDInsight they specifically work on or own. The videos were made by developers, for developers, and therefore contain primarily technical content. It’s advised to have an understanding of Hadoop before jumping in, so make sure you’re up for it! You can access all of the Apache Spark training videos by by clicking below.

Continue reading

Setting Hive Variables in Hadoop


Taking hive from the world of demos into production code almost always results in setting hive variables within your production script. You can set hive variables for table names, locations, partitions etc… For this example we are going to use some sample data that comes on an HDInsight cluster to play with variables.

Continue reading

Why RowCount() and Order By are expensive in Hadoop and other distributed environments

Orderby() and RowCount()

There are numerous reasons to include the RowCount() function in your relational data store, the most common being the addition of a primary key column in your data sets.  Sounds simple, but this can be a challenge for a distributed environment.  Let’s look at an example using hive on a two node Hadoop Cluster.

Continue reading

New HDInsight Visual Studio Tools


Exciting news for any HDInsight users out there. 

Today we released “HDInsight Visual Studio Tools  – July Update”, aka a bunch of totally cool updates to our already kick-ass HDInsight Visual Studio tool kit!

The update includes a smattering of new features but I’m most excited about the following three

    1. Hive IDE with full intellisense integration
    2. Hive execution engine for jobs utilizing Tez
    3. Templates for Storm with other Azure services, including DocDB, Event hubs, and SQL Azure

These HDInsight Visual Studio tools bring us another step closer to full cluster management via VS!


Hive IDE with full IntelliSense integration

What is this?  I can actually write hive statements as I would code in any other language? YUPPPPPP.  So awesome!  For those of you who never know what comes next in the create table syntax(Is  LOCATION or ROW FORMAT DELIMITED first?) you are saved!  In case you just woke up from a 10 year nap, IntelliSense provides recommendations to what may come next in your code block, keeping track of all of your created variables and methods.  It now supports almost all Hive DML statements, including subqueries, INSERT OVERWRITE statements, CTAS statements, etc.   Also, columns and built-in UDFs are now automatically suggested so you don’t have to remember a bunch of function names!




Hive execution engine for jobs utilizing Tez

Spend a day working in hive and you’ll quickly learn how hard it can be to figure out how exactly how a query is working under the covers.  Add in the complexity of Tez and it can become almost impossible.  Luckily, Tez runs as a directed acyclic graph (DAG) allowing for more specialized workflow models to be implemented.  The new VS tools provides the ability to easily investigate which workflow was implemented after execution of a Hive job.  This is helpful for many things but is mostly used for performance tuning of queries.  This will work for both Windows and Linux based clusters too!



Templates for Storm with Azure services, including DocDB, Event hubs, SQL Azure, and more

Storm has been a first class citizen on Azure for a few months and we have now released a number of samples.  Writing Storm applications that utilize Azure services or services running on Azure are infinitely easier to spin up now.  The new applications include example code as well as a plethora of comments, explaining not only what the code does, by why it does it and the pitfalls to watch out for.  There are examples of both reading and writing to most of the Azure services and its possible get it going in only a few steps.


This is just what I’ve found the most valuable in the new release, but there are many more features, including the addition of Pig scripts, so download the HDInsight Visual Studio Tools and get to exploring!

Happy Developing


Continue reading

3 “hacks” for Hadoop and HDInsight Clusters

Here are 3 useful “hacks” I’ve uncovered while developing Hadoop MapReduce jobs on HDInsight clusters.

1. Use Multiple Storage Accounts with your HDInsight Cluster

Chances are if you are using an HDInsight cluster you are dealing with lots and lots of data.  With the announcement of Azure Data Lake this won’t be a problem for long, but right now you may be pushing up against the limits on Azure Storage accounts.  This is easily fixed by storing your data across multiple storage accounts, increasing the maximum IoPs and capacity dramatically with each account added.  A common naming convention could be used to make association between all storage accounts easier.  For example, the location below could be used, substituting the storage account name for each that is associated with the cluster.


You may be thinking ” But Andrew, I spin my clusters up to do processing and I don’t want to manage a configuration file with all of this storage account information!”.  Well luckily you don’t have to…

2.  Use Azure Subscription to store storage account state

We can specify an Azure subscription just for the storage that will be associated with our cluster.  We can then use PowerShell to get a reference to each storage account and attach them to our cluster during creation.  I’ll even give you the script to do it 🙂

#Create the configuration for  new cluster

$HDIClusterConfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes     | Set-AzureHDInsightDefaultStorage -StorageAccountName     ${defaultStorageAccountName}” -StorageAccountKey     $defaultStorageAccountKey -StorageContainerName $defaultStorageContainerName |     Add-AzureHDInsightStorage -StorageAccountName     ${secondStorageAccountName}” -StorageAccountKey     $secondStorageAccountKey | Add-AzureHDInsightMetastore -SqlAzureServerName     ${metastoreServer}” -DatabaseName $metastoreDatabase     Credential $metastoreCred -MetastoreType HiveMetastore
#Access the Subscription where data is stored
Set-AzureSubscription -SubscriptionName $dataStorageSubscriptionName
select-AzureSubscription -SubscriptionName $dataStorageSubscriptionName
#Parses over the storage accounts in the subscription just accessed, and adds them to the cluster
Get-AzureStorageAccount | ForEach-Object
   {Get-AzureStorageKey -StorageAccountName $_.StorageAccountName} | ForEach-Object{
$HDIClusterConfig = Add-AzureHDInsightStorage -StorageAccountKey $_.Primary     StorageAccountName $_.StorageAccountName -Config $HDIClusterConfig
#Re-select the subscription where your cluster is hosted
Set-AzureSubscription -SubscriptionName $subscriptionName
select-AzureSubscription -SubscriptionName $subscriptionName
#Spin up the cluster!
New-AzureHDInsightCluster -Name $clusterName -Location $clusterLocation -Credential     $clusterCred -Config $HDIClusterConfig -Version 3.1

3. Create Input splits with blob file location, not data

Occasionally there is a use case where the file you are processing cannot be split without losing some of the data’s integrity.  Think of XML/JSON data and the need to keep the file whole.  With these types of files you can create an InputSplit which includes only the location to the file in Azure Storage and not the data itself.  This string can then be passed to the map task, where the logic to read the data will live.  You’ll need a good grasp of the map reduce operations before continuing.  Now on to more code examples!

How to create your list of InputSplits:

ArrayList<InputSplit> ret = new ArrayList<InputSplit>();

/*Do this for each path we receive.  Creates a directory of splits in this order s = input path (S1,1),(s2,1)…(sN,1),(s1,2),(sN,2),(sN,3) etc..
for (int i = numMinNameHashSplits; i <=     Math.min(numMaxNameHashSplits,numNameHashSplits1); i++) {
for (Path inputPath : inputPaths) {
  ret.add(new ParseDirectoryInputSplit(inputPath.toString(), i));
  System.out.println(i + ” “+inputPath.toString());
return ret;

Once the List<InputSplits> is assembled, each InputSplit is handed to a Record Reader class where each Key, Value, pair is read then passed to the map task.  The initialization of the recordreader class uses the InputSplit, a string representing the location of a “folder” of invoices in blob storage, to return a list of all blobs within the folder, the blobs variable below.  The below Java code demonstrates the creation of the record reader for each hashslot and the resulting list of blobs in that location.

Public class ParseDirectoryFileNameRecordReader

extends RecordReader<IntWritable, Text> {
private int nameHashSlot;
private int numNameHashSlots;
private Path myDir;
private Path currentPath;
private Iterator<ListBlobItem> blobs;
private int currentLocation;

public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
   myDir = ((ParseDirectoryInputSplit)split).getDirectoryPath();

//getNameHashSlot tells us which slot this record reader is responsible for
   nameHashSlot = ((ParseDirectoryInputSplit)split).getNameHashSlot();

//gets the total number of hashslots
   numNameHashSlots = getNumNameHashSplits(context.getConfiguration());

//gets the input credientals to the storage account assigned to this record reader.
   String inputCreds = getInputCreds(context.getConfiguration());

//break the directory path to get account name   
   String[] authComponents = myDir.toUri().getAuthority().split(“@”);
   String accountName = authComponents[1].split(“\\.”)[0];
   String containerName = authComponents[0];
   String accountKey = Utils.returnInputkey(inputCreds, accountName);
   System.out.println(“This mapper is assigned the following     account:”+accountName);
StorageCredentials creds = new        StorageCredentialsAccountAndKey(accountName,accountKey);
CloudStorageAccount account = new CloudStorageAccount(creds);
   CloudBlobClient client = account.createCloudBlobClient();
   CloudBlobContainer container =        client.getContainerReference(containerName);
blobs = container.listBlobs(myDir.toUri().getPath().substring(1) +     “/”,     true,EnumSet.noneOf(BlobListingDetails.class), null,null).iterator();
   currentLocation = –1;

Once initialized, the record reader is used to pass the next key to the map task.  This is controlled by the nextKeyValue method, and it is called every time map task starts.  The blow Java code demonstrates this.


//This checks if the next key value is assigned to this task or is assigned to another mapper.  If it assigned to this task the location is passed to the mapper, otherwise return false
public boolean nextKeyValue() throws IOException, InterruptedException {
while (blobs.hasNext()) {
  ListBlobItem currentBlob =;

//Returns a number between 1 and number of hashslots. If it matches the number assigned to this Mapper and its length is greater than 0, return the path to the map function
  if (doesBlobMatchNameHash(currentBlob) && getBlobLength(currentBlob) > 0) {
String[] pathComponents = currentBlob.getUri().getPath().split(“/”);

String pathWithoutContainer =
currentBlob.getUri().getPath().substring(pathComponents[1].length() + 1);

currentPath = new Path(myDir.toUri().getScheme(),     myDir.toUri().getAuthority(),pathWithoutContainer);

return true;
return false;

The logic in the map function is than simply as follows, with inputStream containing the entire XML string

Path inputFile = new Path(value.toString());
FileSystem fs = inputFile.getFileSystem(context.getConfiguration());

//Input stream contains all data from the blob in the location provided by Text
FSDataInputStream inputStream =;

Thanks to Mostafa for all the help getting this to work!   

Cool right!?  Now you can scale your HDInsight Clusters to unpresented size as well as process XML and JSON objects reliably with the Hadoop framework. 

Happy Coding


Microsoft Virtual Academy

Many of you probably know about Microsoft Virtual Academy but in case you don’t, here is a quick reminder.

What is Microsoft Virtual Academy?

Microsoft Virtual Academy is a place to go to learn about Microsoft technology from those who know it best.  Typically talks are given by those with close ties to the product teams or even the developers themselves.  They offer the ability to get a class setting without ever having to get out of your basketball shorts.  So order in some food, maybe grab a beverage, and get to watching!


Brand new to Hadoop?  Personally I think modules 1,2, 5, and 6 are enough to get you started


Already a Data Scientist?  Or an aspiring Data Scientist?


Have you been hearing a lot about Streaming Analytics and want to learn the concepts? Check out Module 1 and research Azure Stream Analytics to see the newest tech.


Trying to take your Excel skillz to the next level?  PowerBI Baby