Week 4 – Trees, Forests, Bagging and Boosting

pride_trees

It’s Pride and here is my Random Forest – this isn’t really how it works…

This week we started on what I would consider the more exciting part of data science — machine learning.

By using computers, we are able to process vast quantities of data and discover patterns that would otherwise go unnoticed. Broadly speaking, there are two main categories of machine learning, supervised and unsupervised methods. This week we focused on supervised methods which I will briefly go over here.

Let’s take an entirely made up (and ridiculous) set of data relating ‘eye color’ and ‘sushi preference’ to whether or not a person ‘likes cats’. The ‘likes cats’ will serve as our ‘label’. Using a supervised method you feed the computer a bunch of observations as well as labels for those observations. We can then employ a variety of different algorithms to determine what, if any, features relate to the response variable, ‘Likes Cats.’ By crafting our experimental design and analysis well, we might even be able to determine if some of those features (i.e. eye color), CAUSE someone to like cats. More than that, if our model is good, we take a new person’s eye color and sushi preference, and predict if they’ll like cats or not with some degree of certainty (think targeted ads).

X1 = Eye Color X2 = Favorite Sushi Y = Likes Cats 
Brown California Rolls True
Brown Yellowtail False
Blue California Roll False
Green Cucumber Roll True

Now, from my example, this may seem childish and pointless but imagine you have thousands of predictors variables (X’s) and millions of observations (rows of data). How do you process this? How do you find patterns? Certainly not using an Excel spreadsheet.

This is the type of challenge that biostatisticians are facing when using genetic data to predict cancers and Facebook’s engineers deal with when trying to recognize classify users by their behaviors. These are non-trivial problems to have and machine learning is an essential tool for solving them.

Click here for a beautiful example of machine learning at work!

We learned 8 different algorithms this week. It was definitely an intense week and I won’t bore you by going over all of the nitty gritty. I will however provide links to helpful resources if you are at all interested. Gradient Descent, Stochastic Gradient Descent, Decision Trees, Random Forests, Bagging, Boosting, AdaBoost, and Support Vector Machines.

Please keep reading and ask me about Machine Learning, it’s awesome.

 

5 Comments

  1. This really makes me want to study data science! I’ve always found epidemiological data and analysis fascinating.

    Reply

  2. I love the r2d3 visualizations! And the Random Pride Forest ( I preferred the cats though…)
    I really like your blog, thank you for keeping it alive 😉

    Reply

    1. Absolutely! We should collaborate on an entry. I don’t know what I’m going to write about this week.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *