The Metis Experience

All about my experience at the Metis Data Science Bootcamp

Onsite in Chicago, Weeks 1-3

Off and Running with Python and Our First Project

Week one of the program was coined "The Essentials" and included a quick dive into the Python libraries we would be using throughout the course of this program. For example:

  • Day one we explored Pandas and were given our first project, Benson.
  • Day two we dove into visualizations using Matplotlib and Seaborn.
  • Day three we reviewed Git and worked on our projects.
  • Day four we did more advanced Pandas.
  • Day five we presented our first project, Benson.

It was a whirlwind of a week. I was very thankful I worked through the Udemy course Python for Data Science and Machine Learning Bootcamp before beginning this program or I would have been quite overwhelmed. Project Benson was a group project that got us using the Python skills we were learning right away with real data. I did a quick write-up of our project on my personal site here.

Intro to Machine Learning and Linear Regression

Week two of the program was our introduction to Machine Learning and Linear Regression. We also began our second project, Luther, and learned about web scraping.

  • Monday we Learned about scraping and started our second project, Luther.
  • Tuesday and Wednesday we learned about Linear Regression.
  • Wednesday we were introduced to the concept of train and test datasets.
  • We finished the week of looking at probability.

In the prework, we had been required to go through the book Think Stats. This was an important foundation for this week as we quickly cruised through interpretting the model and the undlying assumptions of Linear Regression. We also touched upon the important concepts such as T-tests, colinearity, normality of the residuals, transformations, and many other topics relevant to creating solid models.

Week three we continued to look at Linear Regression with a deeper dive into the assumptions behind it and also Regularization. On Friday of this week we presented Luther, our second project.

The Luther Project Take-Aways

For the Luther project I chose to look at Building Energy Benchmarking in Minneapolis. The goal of the project was to create a simple Linear Regression model predicting a property's Energy Star Score. I ambitiously decided to try combine three different data sources for this project. A large portion of my time was spent gathering the data and really putting my Python skills to work cleaning it up. Each data source was it's own challenge.

  • Energy Benchmarking Data from the city of Minneapolis. This data contained ~250 buildings, but was obviously put together by hand as the formating was not consistent. This became a huge headache as I tried to automate the process of using the address to search for property info.
  • Hennepin County Parcel Data was easily obtainable from the county's website as a shapefile, but it was not straight forward how to extract the table information. Again, with the use of some Python magic, I was able to extract the information I needed.
  • Property Info from the city of Minneapolis website was obtained by scraping the site using Beautiful Soup and Selenium. This process was imperfect as it required automating the process of searching by address and navigating the variety of responses that were returned.

It was a good exercise in gathering data as well as a lesson in how to handle a project where the results are stubborn in coming together due to the messiness of the data -- a very real world scenario. I did a write-up of the project results here.

Here I wanted to take a few minutes to talk through the major concepts we learned during these weeks and how they all came together in Benson. This project was far from perfect, but did act as a good learning exercise to better understand Linear Regression.

Scikit Learn and Stats Model are two Python libraries that make it quite easy to implement Linear Regression. The challenging part is interpretting the model and making sure the model does not violate the base assumptions of Linear Regression (which are many).

OLS Linear Regression First Pass

After collecting data and doing a fair amount of wrangling to combine 3 data sources, I arrived a dataset of just over 300 properties. I then began the modeling process. As an example, let's take this results summary from an initial pass I did with OLS Linear Regression. I should note also that for this modeling exercise, I did not leave out a portion of the data to act as a test set as I am not trying to predict anything. I opted instead for optimal use of the data I have in order to understand what features had the most influence.

When creating models, it is quite easy to fixate on the "score" of the model, which for Linear Regression is often the R-squared score. In this case the F-score was .83, which means that about 83 percent of the variation was explained by this model. Before going any further though, there are also some problematic things about this model that need to be addressed before looking at the coefficients or interpretting the model for conclusions about the features being measured.

The main assumptions of Linear Regression are as follows:

  • Linear relationshipa
  • Multivariate normalitya
  • No or little multicollinearitya
  • No auto-correlationa
  • Homoscedasticity

Some tests we can use to test these assumptions:

  • Prob (F-statisic): If p-value < 0.05, we can reject the null hypothesis.
  • P >|t|: if p-value < 0.05, we can reject the null hypothesis: This variable does contribute to this model
  • Prob(Omnibus): The p-value for this test. If p-value < 0.05, we reject the null hypothesis, meaning that the residuals do not exactly follow the normal distribution that we assumed.
  • Jaque Beara: Normality Test
  • Prob(JB) Null hypothesis: ε is normally distributed.
  • Skewness and Kurtosis: Idea is we are looking for a skewness coeff. ~ 0, and Kurtosis ~ 3. JB tests if those conditions are held against alternatives.
  • Condition Number: Note that as the condition number becomes quite large, then this implies that the matrix is ill-posed (does not have a unique, well-defined solution). This may be due to multicollinear relationships between independent variables.