Week 6 / 7: Unsupervised Methods, Recommenders and Break Week

K-Means Clustering Animation

Unsupervised Methods

Well, it’s the end of break week, that’s why there wasn’t a post last Sunday. In the beginning of Week 6, we studied what are known as “unsupervised methods”. These methods involve using a computer to discover patterns in data and are often using in clustering or classifying applications. An example of a specific algorithm named “K-means Clustering” is shown above.

The way K-Means works is you take some set of data and tell the algorithm how many clusters (K) you think might be present. This can be done systematically or explicitly. The computer will then randomly place those K-cluster centers in the data and iteratively try to optimize the classification of the real data until you have the best optimized clusters possible. Optimization, in this case, refers to minimizing the euclidean distance between classified data points and their respective cluster centers as best as possible. Below, I’ve shown the results of K-Means clustering on the famous “Iris” data using K-values from 1 – 4. By eye, we can see that there is most probably 2 or 3 clusters present (humans are very good at pattern recognition) but we can tell the computer any number of clusters we want until the number of clusters is equal to the number of data points, at which point you will have classified each datapoint as it’s own class which is fairly useless.


K = 1 Clusters


K = 2 Clusters




K = 3 Clusters


K = 4 Clusters


This was a pretty fun little exercise, and I enjoyed building the different visualizations using both python’s matplotlib and an fantastic command-line tool called ImageMagick (thank you Denis) to make the animations. I’ve made the class file and documentation for my code available on github if anyone is interested. 

Matrices and Recommendation Systems

Apart from unsupervised methods, we again went back to linear algebra to learn a number of matrix factorization and dimensionality reduction techniques. The gist of these methods is that we can use matrix math to discover latent features in data, or to fill in unknown values with good estimates. Ever wonder how Netflix or Spotify recommends items to you? Matrix factorization is what it boils down to. Here’s an great and very accessible article on the topic from the Atlantic: How Netflix Reverse Engineered Hollywood.

Case Study: For this week’s case study, we built a joke recommender using the Jester Dataset. This is a dataset of 150 jokes with about 1.7 million ratings from 60,000 users. Our task was to best estimate the jokes that new users would score highly. My team used a GridSearch to cycle through a number of different parameters to best optimize our recommender system.

Next Week: Big Data Tools

Week 5 – There’s MongoDB in my Beautiful Soup

Who comes up with these names???

This week we touched on all sorts of topics. We started off studying data science for business, looking at how to translate different models into profit curves. We then moved onto web scraping (yes, that is the proper terminology). This is one of my favorite parts of data science. Using just a few lines of code, you are able to grab information, automatically from the internet and store them for further use. We used the Beautiful Soup library to scrape thousands of New York Times articles into a MongoDB database. The web scraping part is awesome, I found the MongoDB database pretty confusing, mostly because it’s written in Javascript which is a language I have very little experience with.

Natural Language Processing

After extracting all this text data, we naturally had to come up with ways of doing text analysis; a field called ‘Natural Language Processing (NLP)’. Like everyday, the instructors at Galvanize smashed an insane amount of material into one day. I’m still trying to process it all but basically it boils down to NLP is a difficult field since words in different arrangements or with different punctuation can have different meanings and connotations.

Let’s eat Grandma. Let’s eat, Grandma.

There are operations you can do to make NLP more successful, dropping suffixes and word simplification (i.e. car, cars, car’s, cars’ $\Rightarrow$ car) is one example. But it is still a developing field with lots of progress to be made. For our assignment, we were asked to analyze how similar various articles from the New York Times were from each other.

Time Series

Time series data broken down.

Example of time series data (top) broken down into ‘seasonal’, ‘trend’, and ‘white noise’ components.

Finally, we studied Time Series. I’ll admit that by this point in the week I was pretty much mentally checked out. I liked Time Series though, it was all about decomposing your signal into multiple ‘subsignals’. For example, let’s take temperature. At the scale of a day, the temperature goes up and down with the sun. At the scale of a year, the temperature goes up during summer and down in the winter. And if we were to look at the even larger, multi-year scale, we’d see an overall up-trend, because … global warming. If you have a good model, you should be able to capture the trends and the different cyclical components, leaving behind only white noise. In practice, this is quite difficult.

On the final day of the week we did a day-long analysis using data from a ride-sharing company (think uber or lyft). Using real data, we were able to predict customer ‘churn’ (when a customer leaves a company) and make recommendations for how to retain and build a customer base. My team built a Logistic regression model and Random Forest to capture and analyze these data.

The week ended with a much needed happy hour.

Thanks for hanging in there with me.

Next up: Unsupervised Learning + Capstone Ideas


Week 4 – Trees, Forests, Bagging and Boosting


It’s Pride and here is my Random Forest – this isn’t really how it works…

This week we started on what I would consider the more exciting part of data science — machine learning.

By using computers, we are able to process vast quantities of data and discover patterns that would otherwise go unnoticed. Broadly speaking, there are two main categories of machine learning, supervised and unsupervised methods. This week we focused on supervised methods which I will briefly go over here.

Let’s take an entirely made up (and ridiculous) set of data relating ‘eye color’ and ‘sushi preference’ to whether or not a person ‘likes cats’. The ‘likes cats’ will serve as our ‘label’. Using a supervised method you feed the computer a bunch of observations as well as labels for those observations. We can then employ a variety of different algorithms to determine what, if any, features relate to the response variable, ‘Likes Cats.’ By crafting our experimental design and analysis well, we might even be able to determine if some of those features (i.e. eye color), CAUSE someone to like cats. More than that, if our model is good, we take a new person’s eye color and sushi preference, and predict if they’ll like cats or not with some degree of certainty (think targeted ads).

X1 = Eye Color X2 = Favorite Sushi Y = Likes Cats 
Brown California Rolls True
Brown Yellowtail False
Blue California Roll False
Green Cucumber Roll True

Now, from my example, this may seem childish and pointless but imagine you have thousands of predictors variables (X’s) and millions of observations (rows of data). How do you process this? How do you find patterns? Certainly not using an Excel spreadsheet.

This is the type of challenge that biostatisticians are facing when using genetic data to predict cancers and Facebook’s engineers deal with when trying to recognize classify users by their behaviors. These are non-trivial problems to have and machine learning is an essential tool for solving them.

Click here for a beautiful example of machine learning at work!

We learned 8 different algorithms this week. It was definitely an intense week and I won’t bore you by going over all of the nitty gritty. I will however provide links to helpful resources if you are at all interested. Gradient Descent, Stochastic Gradient Descent, Decision Trees, Random Forests, Bagging, Boosting, AdaBoost, and Support Vector Machines.

Please keep reading and ask me about Machine Learning, it’s awesome.


Week 3 – Regression and the Dying Computer


My general confidence this week.

This week we studied various methods of linear and logistic regression. We went pretty deep into the mathematical underpinnings for why these techniques work, much further than I had gone in undergraduate statistics or graduate school work. The way we studied these regression techniques was to build our own models from scratch and compare their performance and outputs with that of Python’s own built-in libraries: statsmodels and scikit-learn. I am definitely not a fan of the documentation for either or these modules (there is NO reason that technical writing need be inaccessible) but scikit-learn at least seems slightly more intuitive. Both programming libraries are however very powerful and I was impressed with the speed with which they both were able to fit complex models on our datasets.

Being able to build a model to represent data is of no use to anyone unless you are able to interpret what the model actually means and how statistically significant it’s results are. We therefore spent a good deal of time this week learning various ways verify a model’s validity. This is something I spent a good deal of time doing during my year’s of experimental work in labs so I felt pretty good working through these assignments.

Dying computer…

What made this week particularly hard was my < 4 year old computer decided it would be a great time to die. It did this right in the middle of a quiz. The screen flashed shades of green and blue, it went totally black and then I heard a series of beeps coming from the inside the computer (rather than the speakers). Some sleuthing seemed to indicate that the RAM could potentially be bad. I purchased new RAM and a new solid-state drive on Amazon. Wednesday, after a long day at Galvanize, I spent the entire evening (until roughly 2am) installing the new parts and installing all software I needed.

Seemed like everything was running smoothy during the morning, but alas, even with the new parts, the computer continued to crash. So…I am now writing to you from a brand-new macbook. I have 14 days to see if I can fix my old one and return this new one, otherwise I’ll be keeping my new computer. It is a good 5 pounds lighter than my previous, so that’s nice.

La fin du semaine

The week ended with a difficult assessment covering math, statistics, python, pandas, numpy, SQL and more. My sister flew in Friday evening and we have been enjoying a weekend of hiking, celebrating my cousin’s PhD and eating far too much food.


Next Week: The real meat of the course begins — Machine Learning.


Galvanize Data Science: Week 2

Today, the high in Seattle hit 92°F. It’s very hot in my apartment, I feel like a normal distribution with an increasing standard deviation…


Animated normal distribution with a changing standard deviation. Made with matplotlib and imageMagick.

Before entering the Data Science Intensive Program at Galvanize, I reached out to current and previous cohort members to see if they had any thoughts or advice for someone thinking to enter. Nearly every person came back with a glowing review and the same analogy:

“…it’s like trying to drink from a firehose.”

They weren’t joking. There is so much information being thrown at us at one time that it is physically impossible to absorb it all. The best you can do is take good notes and try to take in as much as you can.

In this week we covered an incredible amount of material. We covered (no joke) about a year’s worth of lectures on probability, statistics (frequentist and bayesian), A/B testing, hypothesis testing, and bootstrapping all in one week. Because of memorial day, all of that was crammed into four days. Needless to say, I’m exhausted, but satisfied.

Each day’s lectures were complimented with programming exercises illustrating the topics of the day. For the A/B testing exercise for example, we used data from Etsy to determine if changing their homepage would drive additional customer conversion. For the Bayesian lecture, we developed and performed simulations for coin flips and die rolls to illustrate the concepts behind Bayesian probability. This topic was somewhat mind-blowing to me, as Bayesian probability is a way of thinking about statistics which is totally different and foreign to the ways that most people (including myself) are taught.

This was also the second week of pair-programming and I’m finding I like it more and more.  Pair-programming is when one person “drives” while the other person “worries”. You switch off every 30 minutes or so. The brilliance of pair-programming is that, in addition to learning to work with other people, by the end of the day, we are often very tired and having a partner helps. Working in groups makes you answerable to your partner. Your mutual success depends on both of you working hard to get the assignment done. We switch partners daily, so each day has a different dynamic. Sometimes I’m the stronger programmer, sometimes I’m not. I’ve found it’s a nice way of humbling oneself.

Thanks for reading.

Next week: It’s Linear and Logistic Regression, sounds fun huh?

Galvanize Data Science: Week 1

Wow, what a week!

If I thought the first week was tough, I was wrong. I haven’t worked this hard in a long time. It’s incredibly exciting though to be working and learning in a place like Galvanize. My fellow classmates come from all walks of life: structural engineers, web developers, business analysts, even a snowboarding instructor. The week started off with a two-hour assessment on Python, Numpy, Pandas (not the bear), SQL, Calculus, Linear Algebra, Probability, and Statistics. I’m very glad that I took Week 0 because I know for a fact that if I hadn’t, the test would have been much harder. This test will serve as a baseline for our progress going through the program.

After the initial assessment, we went through all the nitty-gritty, boiler-plate at Galvanize. We got our keys, wifi setups, learned the rules. Turns out that after the program ends in August, I get 6-months of access to all the facilities that Galvanize has: conference rooms, desks, the roof deck, social events, networking, etc. That’s an awesome bonus I was unaware of.

Going forward, the schedule more or less follows the table below.

Program Schedule

8:30 – 9 am Daily Quiz
9 – 11ish Morning Lecture
11ish – 12ish Individual Programming Assignment
12ish – 1:15 pm Lunch
1:15 – 3ish Afternoon Lecture
3ish – 5pm +/- 30 Pair Programming Assignment

EDIT: The reality seems to be that I leave Galvanize around 6pm or later.

We covered far too much information this week in lectures to go over on this blog, but here are the highlights.

Git + GitHub

One of the biggest focuses of this week has been getting familiar with Git and Github.  These two tools are fast becoming the industry-standards for version control. They allow scientists, engineers, hobbyists and the like to coordinate projects from all over the world without writing over each other’s changes. In addition, if you were to say, write a line of code that breaks everything, git contains a history of what’s called “commits”. You can revert to previous commits and get back to your working version. Git, is the program which runs version control. Github, is an online service similar to dropbox that allows you to host and collaborate with others. Here’s a link to mine. There isn’t much there yet but it will be filling up fast.

SQL (it’s just a puzzle to get stuff)

In the era of big data, sometimes the biggest problem is just accessing the information you need and leaving the rest behind. SQL (Structured-Query-Language) is a language used by many industries to access their data. Here’s a little example. Let’s say, I have a database called “my_table” and it contains a “favorite_cheese” column.

SELECT * FROM my_table WHERE favorite_cheese='camembert';

This query would return a table of every row in the table ‘my_table’ where the ‘favorite_cheese’ column was equal to ‘camembert’. Seems simple enough but by being creative you can perform incredibly complex operations to access results which are just what you are looking for.

We also covered Bash, Object-Oriented Programming, Pandas, AWS and more but I’ll try to address those in future posts. The one thing I will mention is that if you type

ack --cathy

into your shell, you’ll see an ASCII version of Cathy saying “Chocolate, Chocolate, Chocolate, AACK!”.  How useful is that?!

Screen Shot 2016-05-29 at 10.37.13 AM

The week ended with a happy hour hosted specifically for Galvanize’s Data Science students past and present. We were able to meet students from the previous cohort and learn about their experiences during and after Galvanize. We’ve got a great group and I’m happy to be learning and working with these people.

Next Week: In Week 2, the focus will be on statistics and probability. Stay tuned!


Galvanize Data Science: Week 0

View of pioneer square from Galvanize's headquarters in Downtown Seattle

View of pioneer square from Galvanize’s headquarters in Downtown Seattle

In the data science program at Galvanize, you sign up for a 13 week, intensive course in Python, machine learning, statistics and more. It is meant to be a highly efficient means of transitioning into the data science and analytics field; a transition I’ve been excited to make for some time now.

It turns out that Galvanize offers a Week 0, voluntary week, specifically focused on getting the members of the cohort up to snuff when it comes to python programming and linear algebra.

As I knew, going into the program, Galvanize was going to be an intense academic challenge. Already on Day 1 of week 0, I was having to work quite hard, thinking back to my undergraduate days when I worked with vector spaces and matrix algebra. Luckily, nothing was too taxing as of yet.

I’ve been enjoying playing around with the atom text editor which is a very powerful and flexible way of writing in many different languages. In fact, I’m writing this entry using markdown right now. One of my favorite things about it is the fact that I can use LaTeX math notation right in the editor meaning I can write out complex equations, arrays and the like quite easily using the text editor interface.

The location and setting of Galvanize are both quite awesome. It is located in the heart of Seattle’s Pioneer Square in an renovated brick building (which apparently used to be NBBJ’s headquarters). Housed in the building, in addition to the Galvanize’s education programs are many startups, making the atmosphere  busy and exciting. Because this week is voluntary, only part of my future cohort is here, but so far everyone, including the teachers seem very intelligent, motivated, and friendly.

I’m excited for the next 14-ish weeks of my life and the challenges and opportunities that this fellowship will bring me. My plan is to write a blog entry for each week of the program so people can track my progress, and see what a programming bootcamp is really like.

Analyzing the relationship between retail pot sales and call-center data

For years, the criminalization of Marijuana sale and usage has made data collection and research on the topic difficult to perform. In Washington state, Recreational Marijuana went on sale in local dispensaries starting mid-2014. The question of whether or not the opening of a dispensary produces a spike in the amount and type of Marijuana use is a valid question for legislators, administrators, doctors and more.

As an exploratory exercise, I created the following map using call-center data gathered by the Washington State Poison Center on marijuana use and data scraped from the web on the location and opening date of retail marijuana shops in Washington State. Data ranges from January 2014 to August 2015. Both calls and shops are localized by zip code. By scrolling through we can see where and when shops and cases cropped up.

“Cases” are any calls that went to the Washington Poison Center related to Marijuana usage. This could be anything from “My child got into my weed cookies” to a doctor calling to consult on someone who ingested too much Marijuana.


In this period of time there were only a few hundred cases. This was enough however to see some trends in the data. The highest number of cases occurred in the U District and in Pioneer Square.

Please note that currently only shops in KING COUNTY are shown.

This map was created using R, Leaflet, and Shiny.

[R] A little bit on multidimensional arrays and apply()

The command-line can be a little unintuitive when dealing with multidimensional objects since it is a 2D medium. It is therefore hard to envision objects greater than 2-dimensions. They exist however!

An array, in R, is simply a vector (list of objects) where each element has additional “dimension” attributes. In other words, each vector element is given a dimensional position. This is fairly easy to represent 3-dimensionally (see below) but there is no reason why additional dimensional attributes cannot be applied to each vector element, placing them in the 4th, 5th…nth dimensions.

Using array(), I created a 3-dimensional array object (represented by that box with numbers you see below) populated with values 1 to 4. Each of these is given a dimensional attribute, the 1’s located are located at [1,1,1] and [1,2,1]. The 4’s are located at [2,1,2] and [2,2,2], and so on.

Here is the array function:

array(data, dimensions,...)


The first argument of array() is the actual data to be used. The second argument is dimensions which is an integer vector referring to the maximum dimensions of the array; for the example above, this is 2 by 2 by 2.

Using apply(), we can perform functions on elements which are aligned in certain directions, in this case sum(). The array() function takes the following arguments:

apply(X, margins, FUN)

where X is the array over which apply should be…applied, margins is an integer vector telling R which margins (dimensions) to maintain and which to collapse, and FUN is the function to by applied. Basically, the apply() function is taking the sum over all elements in a certain edge of the cube. The margin attributes simply tell R which edges we are summing over. In the examples below, R converts a 3D array object into a 2D object. You can see the effect of changing the margins attribute on the final result of the summed arrays shown below.


Wiki is up!


One of the goals that I’ve had for a long time is to create the wikipedia page for freeze-casting. Even though it’s a relatively “hot” technique, at least in the ceramics world. There wasn’t a single entry for freeze-casting. This meant that when I started the project I actually had to read scientific literature to figure things out. The closest thing to freeze-casting on wikipedia was a topic known as freeze-gelation which is really not freeze-casting. 4+ years after I started my thesis there was still nothing on wikipedia in regards to freeze-casting. I’d already done all the research and spent literally 100s of hours creating figures and getting references for the topic so I figured why shouldn’t I do it. The article isn’t comprehensive as there are many topics to cover but I think it does a good job at explaining the basic concepts required for freeze-casting. Apparently wikipedia however thinks it still has many problems (it was graded as a Class-C article) but it’s been accepted and is available for public viewing.

For reference, a C-Class Wikipedia is defined as follows:

The article is substantial, but is still missing important content or contains much irrelevant material. The article should have some references to reliable sources, but may still have significant problems or require substantial cleanup.

If you’re interested please take a look, make edits if necessary and link to it if desired. Thanks for reading!