Week 5 – There’s MongoDB in my Beautiful Soup

Who comes up with these names???

This week we touched on all sorts of topics. We started off studying data science for business, looking at how to translate different models into profit curves. We then moved onto web scraping (yes, that is the proper terminology). This is one of my favorite parts of data science. Using just a few lines of code, you are able to grab information, automatically from the internet and store them for further use. We used the Beautiful Soup library to scrape thousands of New York Times articles into a MongoDB database. The web scraping part is awesome, I found the MongoDB database pretty confusing, mostly because it’s written in Javascript which is a language I have very little experience with.

Natural Language Processing

After extracting all this text data, we naturally had to come up with ways of doing text analysis; a field called ‘Natural Language Processing (NLP)’. Like everyday, the instructors at Galvanize smashed an insane amount of material into one day. I’m still trying to process it all but basically it boils down to NLP is a difficult field since words in different arrangements or with different punctuation can have different meanings and connotations.

Let’s eat Grandma. Let’s eat, Grandma.

There are operations you can do to make NLP more successful, dropping suffixes and word simplification (i.e. car, cars, car’s, cars’ $\Rightarrow$ car) is one example. But it is still a developing field with lots of progress to be made. For our assignment, we were asked to analyze how similar various articles from the New York Times were from each other.

Time Series

Time series data broken down.

Example of time series data (top) broken down into ‘seasonal’, ‘trend’, and ‘white noise’ components.

Finally, we studied Time Series. I’ll admit that by this point in the week I was pretty much mentally checked out. I liked Time Series though, it was all about decomposing your signal into multiple ‘subsignals’. For example, let’s take temperature. At the scale of a day, the temperature goes up and down with the sun. At the scale of a year, the temperature goes up during summer and down in the winter. And if we were to look at the even larger, multi-year scale, we’d see an overall up-trend, because … global warming. If you have a good model, you should be able to capture the trends and the different cyclical components, leaving behind only white noise. In practice, this is quite difficult.

On the final day of the week we did a day-long analysis using data from a ride-sharing company (think uber or lyft). Using real data, we were able to predict customer ‘churn’ (when a customer leaves a company) and make recommendations for how to retain and build a customer base. My team built a Logistic regression model and Random Forest to capture and analyze these data.

The week ended with a much needed happy hour.

Thanks for hanging in there with me.

Next up: Unsupervised Learning + Capstone Ideas



  1. Is it sick of me to love that teaching you how to deal with unmanageable amounts of data is done by throwing at you unmanageable amounts of data?

    I’m sorry. I’ll stop (;

    You’re amazing ❤️

  2. Did you use PyMongo? PyMongo does a pretty good job of isolating you from the javascript idiosyncrasies of Mongo, and gives a nice pythonic interface.

    The easiest way to think about Mongo is to think of each “row” as a Python dictionary — a mapping from column names to the data. It’s a lot more flexible than a typical database because you don’t need to set up a scheme. So it gracefully handles situations where different combinations of columns are relevant to different rows, or if you realize later you want to add more columns to your data. Its indexing and querying is somewhat more limited, specifically, you can’t do much in the way of joining so it works best for tables that are entirely self-contained.

    1. They had us use pymongo after straight MongoDB. I understand and appreciate the flexibility of MongoDB. I guess I was annoyed with, after all the cramming stuff in my brain that we had to do, trying to wrap my head around a seemingly cryptic language.

      1. Javascript is indeed a pretty sucky language. As a data scientist, odds are good that you’ll end up preparing web visualizations of your data using D3, so it’s worth knowing.

        I think the Mongo developers probably chose Javascript because it provides a lightweight way to express objects, and a really convenient way of sending objects and functions “over the wire” between a client application and the database (JSON notation originated in Javascript).

        Python actually meshes pretty well with Mongo, so your experience will probably be more positive once you have the freedom to interact with the database solely from Python.

        1. Lol, you don’t have to defend MongoDB. Also, I love what D3 can do, I’ve been trying to find time to do some tutorials for a while.

Comments are closed.