All about my experience at the Metis Data Science Bootcamp
Week one of the program was coined "The Essentials" and included a quick dive into the Python libraries we would be using throughout the course of this program. For example:
It was a whirlwind of a week. I was very thankful I worked through the Udemy course Python for Data Science and Machine Learning Bootcamp before beginning this program or I would have been quite overwhelmed. Project Benson was a group project that got us using the Python skills we were learning right away with real data. I did a quick write-up of our project on my personal site here.
Week two of the program was our introduction to Machine Learning and Linear Regression. We also began our second project, Luther, and learned about web scraping.
In the prework, we had been required to go through the book Think Stats. This was an important foundation for this week as we quickly cruised through interpretting the model and the undlying assumptions of Linear Regression. We also touched upon the important concepts such as T-tests, colinearity, normality of the residuals, transformations, and many other topics relevant to creating solid models.
Week three we continued to look at Linear Regression with a deeper dive into the assumptions behind it and also Regularization. On Friday of this week we presented Luther, our second project.
For the Luther project I chose to look at Building Energy Benchmarking in Minneapolis. The goal of the project was to create a simple Linear Regression model predicting a property's Energy Star Score. I ambitiously decided to try combine three different data sources for this project. A large portion of my time was spent gathering the data and really putting my Python skills to work cleaning it up. Each data source was it's own challenge.
It was a good exercise in gathering data as well as a lesson in how to handle a project where the results are stubborn in coming together due to the messiness of the data -- a very real world scenario. I did a write-up of the project results here.
Here I wanted to take a few minutes to talk through the major concepts we learned during these weeks and how they all came together in Benson. This project was far from perfect, but did act as a good learning exercise to better understand Linear Regression.
Scikit Learn and Stats Model are two Python libraries that make it quite easy to implement Linear Regression. The challenging part is interpretting the model and making sure the model does not violate the base assumptions of Linear Regression (which are many).
After collecting data and doing a fair amount of wrangling to combine 3 data sources, I arrived a dataset of just over 300 properties. I then began the modeling process. As an example, let's take this results summary from an initial pass I did with OLS Linear Regression. I should note also that for this modeling exercise, I did not leave out a portion of the data to act as a test set as I am not trying to predict anything. I opted instead for optimal use of the data I have in order to understand what features had the most influence.
When creating models, it is quite easy to fixate on the "score" of the model, which for Linear Regression is often the R-squared score. In this case the F-score was .83, which means that about 83 percent of the variation was explained by this model. Before going any further though, there are also some problematic things about this model that need to be addressed before looking at the coefficients or interpretting the model for conclusions about the features being measured.
The main assumptions of Linear Regression are as follows:
Some tests we can use to test these assumptions: