Author Archives: AskTheData

About AskTheData

I'm just a guy living off caffeine, protein bars, and presentations. Looking to speak on all things data!

Tips for delivering an amazing technical presentation

We’ve all been here before…

You’re at a conference and a developer takes the stage to talk about how their product it going to change the world. Fast forward an hour and your neighbor shakes you awake because the presentation has ended. You fell asleep and missed the whole thing! Now maybe you didn’t have enough coffee, but my guess is that the speaker didn’t do enough to keep you engaged.


In this video, I’m able to interview David Chappell for his tips on how to deliver an amazing technical presentation. From the content within your presentation to the style you use to deliver it, we will talk about how you can be successful. The ability to give an amazing technical talk is one thing that separates good developers from GREAT developers.






Halo, Minecraft, and more! How Microsoft Studios Processes Gaming Event Data.

I like data and I like games. When I heard my friend Karan Gulati’s job include both of these things, I thought it was too good to be true. Obviously we have to get to the bottom of this… We’ll talk with him about his job in Microsoft Studios and learn about his most recent project creating a data pipeline for an undisclosed game 😉

Continue reading

Building your first Hadoop Jar with maven and eclipse

This guide walks through creating your fist Hadoop program, but it skips over some important details, like how to compile a Jar file… It relies on the assumption that you’ve already compiled all of your Java code. Small assumption, but if you come from a C# programming background this may be confusing. Don’t worry it’s fairly easy if you use eclipse for an IDE and maven to manage the project dependencies.

Continue reading

Teaching Data Science to Middle School Students

Do you think teaching data science to middle school students is possible? Can 8th grade students really learn how to solve data science problems? How about identifying what class of machine learning problem they will need to solve? I had no idea if it was possible, so I decided to try it out! With 4.5 hours over 3 days I attempted to  teach middle school students how to predict titanic survivors on Kaggle.com. How’d it go? Freaking awesome!

Continue reading

5 Apache Spark Training Videos with the HDInsight Team

Let’s learn Spark!

After looking around for some readiness materials on getting started with Apache Spark, I noticed a lack of videos explaining the main components! Instead of waiting around for them to be created, we took things into our own hands. Today we’re releasing 5 videos, each with a HDInsight product team member, around the pieces of HDInsight they specifically work on or own. The videos were made by developers, for developers, and therefore contain primarily technical content. It’s advised to have an understanding of Hadoop before jumping in, so make sure you’re up for it! You can access all of the Apache Spark training videos by by clicking below.

Continue reading

Setting Hive Variables in Hadoop

 

Taking hive from the world of demos into production code almost always results in setting hive variables within your production script. You can set hive variables for table names, locations, partitions etc… For this example we are going to use some sample data that comes on an HDInsight cluster to play with variables.

Continue reading

Why RowCount() and Order By are expensive in Hadoop and other distributed environments

Orderby() and RowCount()

There are numerous reasons to include the RowCount() function in your relational data store, the most common being the addition of a primary key column in your data sets.  Sounds simple, but this can be a challenge for a distributed environment.  Let’s look at an example using hive on a two node Hadoop Cluster.

Continue reading

New HDInsight Visual Studio Tools

 

Exciting news for any HDInsight users out there. 

Today we released “HDInsight Visual Studio Tools  – July Update”, aka a bunch of totally cool updates to our already kick-ass HDInsight Visual Studio tool kit!

The update includes a smattering of new features but I’m most excited about the following three

    1. Hive IDE with full intellisense integration
    2. Hive execution engine for jobs utilizing Tez
    3. Templates for Storm with other Azure services, including DocDB, Event hubs, and SQL Azure

These HDInsight Visual Studio tools bring us another step closer to full cluster management via VS!

 

Hive IDE with full IntelliSense integration

What is this?  I can actually write hive statements as I would code in any other language? YUPPPPPP.  So awesome!  For those of you who never know what comes next in the create table syntax(Is  LOCATION or ROW FORMAT DELIMITED first?) you are saved!  In case you just woke up from a 10 year nap, IntelliSense provides recommendations to what may come next in your code block, keeping track of all of your created variables and methods.  It now supports almost all Hive DML statements, including subqueries, INSERT OVERWRITE statements, CTAS statements, etc.   Also, columns and built-in UDFs are now automatically suggested so you don’t have to remember a bunch of function names!

cid:image016.jpg@01D0AAAB.684C6130

cid:image017.jpg@01D0AAAB.684C6130

 

Hive execution engine for jobs utilizing Tez

Spend a day working in hive and you’ll quickly learn how hard it can be to figure out how exactly how a query is working under the covers.  Add in the complexity of Tez and it can become almost impossible.  Luckily, Tez runs as a directed acyclic graph (DAG) allowing for more specialized workflow models to be implemented.  The new VS tools provides the ability to easily investigate which workflow was implemented after execution of a Hive job.  This is helpful for many things but is mostly used for performance tuning of queries.  This will work for both Windows and Linux based clusters too!

cid:image015.jpg@01D0AAAB.684C6130

 

Templates for Storm with Azure services, including DocDB, Event hubs, SQL Azure, and more

Storm has been a first class citizen on Azure for a few months and we have now released a number of samples.  Writing Storm applications that utilize Azure services or services running on Azure are infinitely easier to spin up now.  The new applications include example code as well as a plethora of comments, explaining not only what the code does, by why it does it and the pitfalls to watch out for.  There are examples of both reading and writing to most of the Azure services and its possible get it going in only a few steps.

cid:image003.jpg@01D0B2C1.170AF250

This is just what I’ve found the most valuable in the new release, but there are many more features, including the addition of Pig scripts, so download the HDInsight Visual Studio Tools and get to exploring!

Happy Developing

~Andrew

Continue reading