Category Archives: Uncategorized

Tips for delivering an amazing technical presentation

We’ve all been here before…

You’re at a conference and a developer takes the stage to talk about how their product it going to change the world. Fast forward an hour and your neighbor shakes you awake because the presentation has ended. You fell asleep and missed the whole thing! Now maybe you didn’t have enough coffee, but my guess is that the speaker didn’t do enough to keep you engaged.


In this video, I’m able to interview David Chappell for his tips on how to deliver an amazing technical presentation. From the content within your presentation to the style you use to deliver it, we will talk about how you can be successful. The ability to give an amazing technical talk is one thing that separates good developers from GREAT developers.






Halo, Minecraft, and more! How Microsoft Studios Processes Gaming Event Data.

I like data and I like games. When I heard my friend Karan Gulati’s job include both of these things, I thought it was too good to be true. Obviously we have to get to the bottom of this‚Ķ We’ll talk with him about his job in Microsoft Studios and learn about his most recent project creating a data pipeline for an undisclosed game ūüėČ

Continue reading

Teaching Data Science to Middle School Students

Do you think teaching data science to middle school students is possible? Can¬†8th grade students really¬†learn how to solve data science problems? How about identifying what class of machine learning problem they will need to solve? I had no idea if it was possible, so I decided to try it out!¬†With 4.5 hours over 3 days I attempted to¬† teach middle school students¬†how to predict titanic survivors on Kaggle.com. How’d it go? Freaking awesome!

Continue reading

Hive Date Format Manipulation

Doing some work at a customer recently, I ran into a tricky problem around handling dates.¬† Within the data, we had date expressed as yyyyMMdd.¬† After trying to create an external table over the file location it was apparent that hive didn’t want to play nicely with this format.¬† Unfortunately, when creating a date data type hive expects the data to be yyyy-MM-dd format and will import null when in any other format.¬† We were getting worried that we may have to build some custom code to reformat the data.

After looking closer at the date functions, it became apparent that it’s possible to convert a date string into a unix timestamp while specifying the expected input format.¬† After we have a unix timestamp it’s possible to convert that back into a date with the yyyy-MM-dd format!

 

CREATE EXTERNAL TABLE dateExampleStaging ( dt STRING, otherData STRING)

LOCATION “wasb://<container>@<storage>.blob.core.windows.net/files/”

 

CREATE TABLE dateExampleFinal (dt DATE, otherData STRING)

 

INSERT INTO TABLE dateExampleFinal

SELECT

from_unixtime(unix_timestamp(dt,’yyyyMMdd’ ),

otherData

FROM

dateExampleStaging

 

Wala!! your date worries are no more.  A cool and easy trick

Using OAuth to connect to Streaming Twitter API

Twitter just updated their API’s to v1.1 and with that update, now require Open Authentication or OAuth to access their streaming API.  OAuth has been required to access their REST API for some time now but it is a new requirement for their streaming services.  The first step is to get your authentication token from twitter.  Login on their developers page and choose my applications from the dropdown menu

image

Chose create new application and fill out the form.  If you do not have a website or callback url just add a place holder for now.  After creating you application scroll to the bottom of the page and click create my access token

image

At the top of this page there are tabs, choose the OAuth tool and you will see the details of your token there.

image

We will use this token in the code snippet below.  This code is from the twitter streaming class in the demo located at twitterbigdata.codeplex.com.

var oauth_consumer_key = “Enter your consumer key here”;
var oauth_consumer_secret = “Enter your consumer secret key here”;
var oauth_token = “Enter your token here”;
var oauth_token_secret = “Enter your token secret key here”;
var oauth_version = “1.0”;
var oauth_signature_method = “HMAC-SHA1”;
// unique request details
var oauth_nonce = Convert.ToBase64String(
new ASCIIEncoding().GetBytes(DateTime.Now.Ticks.ToString()));
var timeSpan = DateTime.UtcNow
– new DateTime(1970, 1, 1, 0, 0, 0, 0, DateTimeKind.Utc);
var oauth_timestamp = Convert.ToInt64(timeSpan.TotalSeconds).ToString();
var resource_url = “https://stream.twitter.com/1.1/statuses/filter.json”;

  // create oauth signature(this could be different for the normal Twitter API as well as any other social API’s
var baseFormat = “oauth_consumer_key={0}&oauth_nonce={1}&oauth_signature_method={2}” +
“&oauth_timestamp={3}&oauth_token={4}&oauth_version={5}&track={6}”;
var baseString = string.Format(baseFormat,
oauth_consumer_key,
oauth_nonce,
oauth_signature_method,
oauth_timestamp,
oauth_token,
oauth_version,
Uri.EscapeDataString(_config.Parameters)
);
baseString = string.Concat(“POST&”, Uri.EscapeDataString(resource_url), “&”, Uri.EscapeDataString(baseString));
var compositeKey = string.Concat(Uri.EscapeDataString(oauth_consumer_secret),
“&”, Uri.EscapeDataString(oauth_token_secret));
string oauth_signature;
using (HMACSHA1 hasher = new HMACSHA1(ASCIIEncoding.ASCII.GetBytes(compositeKey)))
{
oauth_signature = Convert.ToBase64String(
hasher.ComputeHash(ASCIIEncoding.ASCII.GetBytes(baseString)));
}
           // create the request header
¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† var headerFormat = “OAuth oauth_nonce=\”{0}\”, oauth_signature_method=\”{1}\”, ” +
“oauth_timestamp=\”{2}\”, oauth_consumer_key=\”{3}\”, ” +
“oauth_token=\”{4}\”, oauth_signature=\”{5}\”, ” +
“oauth_version=\”{6}\””;
var authHeader = string.Format(headerFormat,
Uri.EscapeDataString(oauth_nonce),
Uri.EscapeDataString(oauth_signature_method),
Uri.EscapeDataString(oauth_timestamp),
Uri.EscapeDataString(oauth_consumer_key),
Uri.EscapeDataString(oauth_token),
Uri.EscapeDataString(oauth_signature),
Uri.EscapeDataString(oauth_version)
);
// make the request
ServicePointManager.Expect100Continue = false;
var postBody = “track=”+_config.Parameters; // “screen_name=” + Uri.EscapeDataString(screen_name);//
resource_url += “?” + postBody;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(resource_url);
request.Headers.Add(“Authorization”, authHeader);
request.Method = “POST”;
request.ContentType = “application/x-www-form-urlencoded”;
request.PreAuthenticate = true;
request.AllowWriteStreamBuffering = true;
WebResponse response = request.GetResponse();
dreturn new StreamReader(response.GetResponseStream());
If you add your token and consumer information to the above code and use this as the twitter streaming class in the project at twitterbigdata.codeplex.com you will be able to access twitters streaming API.  This can also be used as a base when using OAuth v1 with any other social sites/API’s.

Use StreamInsight to Assign Sentiment Scores to a List of Items in a CSV (Part Two: Assign Sentiment)

In part one of this series, we discussed how to import a CSV file and then read from it using StreamInsight.  This post will take that a step further and show how to use the Sentiment140 API to assign sentiment to a string.  Within the ForEach(activeInterval) loop there are two calls that are crucial to assigning sentiment(below).

//Score the current SocialBlurb’s content attribute.
var result = sentiment.Analyze( activeInterval.l.CONTENT);
//Set the active SocialBlob’s sentiment140_mood to the sentiment received.
activeInterval.l.Sentiment140_Mood = (int) result.Mood;

Below is the Sentiment140 class that these calls relate to.  The analyze function takes the content, appends it to a URL, makes/receives a HTTP call, parses the result, and returns the sentiment score.

public class Sentiment140
{
private string _jsonURL = @”http://www.sentiment140.com/api/classify”;
public SentimentAnalysisResult Analyze(string textToAnalyze)
{
//Format url to include the text we want to analyze
¬†¬†¬†¬†¬†¬†¬†¬†¬† string url = string.Format(“{0}?text={1}”, this._jsonURL,
HttpUtility.UrlEncode(textToAnalyze, System.Text.Encoding.UTF8));
//Maximum url length check and set to neutral if larger than allowed
if (url.Length > 600)
{
SentimentAnalysisResult results = new SentimentAnalysisResult() { Mood = SentimentScore.Neutral, Probability = 100 };
return results;
}

//Create the HTTP Request
var request = HttpWebRequest.Create(url);
SentimentAnalysisResult result = new SentimentAnalysisResult() { Mood = SentimentScore.Neutral, Probability = 100 };
try
{
//Get the Response from Sentiment140
var response = request.GetResponse();
using (var streamReader = new StreamReader(response.GetResponseStream()))
{
// Read from source
var line = streamReader.ReadLine();
// Parse
var jObject = JObject.Parse(line);
int polarity = jObject.SelectToken(“results”, true).SelectToken(“polarity”, true).Value<int>();
switch (polarity)
{
case 0:
result.Mood = SentimentScore.Negative;
break;
case 4:
result.Mood = SentimentScore.Positive;
break;
default: // 2 or others
result.Mood = SentimentScore.Neutral;
break;
}
response.Close();
}
}
catch (System.Exception)
{
result.Mood = SentimentScore.Neutral;
}
return result;
}
}

This specific project uses Sentiment140 to assign the sentiment scores but could just as easily use another sentiment engine that has API calls.  The framework would stay similar but the URL and Parsing would change slightly to fit what was required by the new API.
Combining these two blog posts makes it is possible to read from a CSV file using StreamInsight and add sentiment values to the content from the CSV.  This is a solution that fits nicely with customers who are using a service that crawls the web and returns mentions of their enterprise.  Most of these services provide some sentiment scoring capability but if a customer wants to implement their own engine, a sentiment engine that is not offered though the service, or multiple engines, this solution is one we can offer to them.

Use StreamInsight to Assign Sentiment Scores to a List of Items in a CSV (Part One: Reading Input)

There will come a time when you need to read from a CSV file using StreamInsight.  It may be to match reference data with a current stream or utilize the processing power of StreamInsight to do many things at once.  This example uses the observable programming method to add input from a CSV to a list then enumerate through that list creating point events.

First things first, we need to get the data from the CSV file into a list.¬† I was using a ‚Äúsmall‚ÄĚ data set and was able to load the entire file into one list.¬† When working with larger files, simply process the CSV file in chunks.¬† The CSV parser(KBCsv) that I used can be found here.¬† The code below is taking the CSV file, reading it in, and creating a list of the social blurb class.

if (File.Exists(filepath))

{
using (var reader = new CsvReader(filepath))
{
reader.ReadHeaderRecord();
while (reader.HasMoreRecords)
{
var columnline = reader.ReadDataRecord();
myList.Add(new SocialBlurb())
{          ARTICLE_ID = columnline[0].Trim(),
HEADLINE = columnline[1].Trim(),
AUTHOR = columnline[2].Trim(),
CONTENT = columnline[3].Trim()
}}}}
Now that we have a list of SocialBlurb’s we need to create a SI application, turn the list into a point stream, add a CTI event, and do processing.  The code for this is shown below.

using (var server = Server.Create(“StreamInsightDefault”))
{
//Create SI app on Server
var application = server.CreateApplication(“My Application”);

//Query of all Items in the List
Var listquery = from l in myList
select l;
//Add CTI to the point event stream by increasing starttime
var streamquery = listquery.AsEnumerable().ToPointStream(application,
l => PointEvent.CreateInsert(DateTime.Now, new { l }),   AdvanceTimeSettings.IncreasingStartTime);

//read the results as starttime and payload(SocialBlurb)
var results = from Pointevent in streamquery.ToPointEnumerable()
where Pointevent.EventKind != EventKind.Cti
select new
{
Pointevent.StartTime,
Pointevent.Payload.l
};
//Do something with each point event(the Sentiment part will come in a part 2 post ūüôā )
foreach (var activeInterval in results)
{
var sentiment = new Sentiment140();
var result = sentiment.Analyze activeInterval.l.CONTENT);
activeInterval.l.Sentiment140_Mood = (int) result.Mood;

Console.WriteLine(activeInterval.l.AUTHOR);
Console.WriteLine(activeInterval.l.CONTENT);
Console.WriteLine(activeInterval.l.Sentiment140_Mood);
}
This post explains how to read lines from a CSV file and turn that input into a point event stream.  My next post will detail the sentiment scoring class shown above and point out how another sentiment engine could be implemented.