The Metis Experience

All about my experience at the Metis Data Science Bootcamp

Onsite in Chicago, Weeks 4-6

Weeks four through six took us through all of the classification models and a whole grab bag of topics related to working in the cloud and web development.

We covered a majority of the main classification machine learning algorithms in use today. For all of these algorithms we did an in-depth look at the mathmatical underpinnings of the algorithms and also talked in detail about their strengths and weaknesses. A common review topic was comparing and contrasting when it would be appropriate to use one algorithm over another and how you would determine which performed best. I've created a summary 'cheat sheet' and shared it below.

Classification Models

KNN

Key Concept:

  • Non-parametic classification model used for both regression and classification.
  • Looks for k closest samples in a training set.
  • Fits fast, predicts slow. Holds entire sample set in memory to make a prediction.

Assumptions:

Scale variant, data should be Standardized.

Steps:

  1. Select k, the number of neighbors that will be used.
  2. To predict, determine the majority classification or average (regression) for a new point.

Hyperparameters:

nan


Model Characteristics

Interpretability Highly interpretable, has an analytic solution.
Computation Time Quick to fit, slow to predict. Requires holding all data in memory to predict, may run slow with very large datasets.
Accuracy Varies

Logistic Regression

Key Concept:

  • Binomial or multi-nomial districtuions.
  • Outputs probability.
  • Must set a threshold for a specific output.

Assumptions:

Scale variant, data should be Standardized.

Steps:

nan

Hyperparameters:

nan


Model Characteristics

Interpretability Highly interpretable, returns probabilities and coefficients. Loses interpretability with multiple classes (probabilities do not add up to one, unless using multi-).
Computation Time Linear, relatively fast, scales well. Multi-class much slower.
Accuracy Varies

SVM

Key Concept:

  • Can be used for regression or classification.
  • Non-parametic classification method.
  • Goal is to maximize the margin.
  • It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.

Assumptions:

Scale invariant, no assumptions about the data.

Steps:

  1. Identify support vectors that define a hyperplane.
  2. Maximizes distance between two groups.
  3. If using poly or rbf kernals, the kernel trick takes the data you give it and transforms it.

Hyperparameters:

  • Kernal: Linear, Polynomial, RBF
  • C: (similar to Lambda)
  • Gamma:


Model Characteristics

Interpretability Linear classification is easily interpretable. Regession and other kernals less so. The complex data transformations and resulting boundary plane are very difficult to interpret.
Computation Time Depends on kernel. Linear much faster than poly or rbf.
Accuracy Varies

Naive Bayes

Key Concept:

Applies Bayes Theorum to all features, then multiplied.

Assumptions:

Assumes independent features.

Steps:

nan

Hyperparameters:

nan


Model Characteristics

Interpretability Highly interpretable. Calculations consist of simple counting and multiplication.
Computation Time Performs well.
Accuracy Varies

Decision Tree

Key Concept:

  • Can be used for regression or classification.
  • Greedy algorithm:
  • Frequentist methodology

Assumptions:

Non-parametric, scale invariant.

Steps:

  1. Select attributes in order of entropy.
  2. Create child nodes based on split.
  3. Create small notes based
  4. Recurese until we reach a stopping point: all examples in the same class, a minimum number of samples is reached, or the tree becomes too large.

Hyperparameters:

Regression uses MSE. Classification uses entropy/gini.


Model Characteristics

Interpretability Highly interpretable.
Computation Time Scales well (depending on implementation, parallelizeable).
Accuracy Prone to overfitting.

Random Forrest

Key Concept:

  • Can be used for regression or classification.

Assumptions:

Non-parametric, scale invariant.

Steps:

nan

Hyperparameters:

nan


Model Characteristics

Interpretability nan
Computation Time Scales well (highly parallelizeable).
Accuracy nan

Gradient Boost

Key Concept:

  • Gradient Boosting = Gradient Descent + Boosting
  • Can be used for regression or classification.
  • Combines weak "learners" into a single strong learner, in an iterative fashion
  • “shortcomings” are identified by gradients.

Assumptions:

Non-parametric, scale invariant.

Steps:

  1. Builds first learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value).
  2. It will build a second learner to predict the loss after the first step based on the residuals.
  3. The step continues to learn the third, forth… until certain threshold or stopping condition is met.

Hyperparameters:

nan


Model Characteristics

Interpretability nan
Computation Time nan
Accuracy nan

Ada Boost

Key Concept:

  • Is a special case of Gradient Boosting
  • Can be used for regression or classification.
  • Combines weak "learners" into a single strong learner, in an iterative fashion

Assumptions:

nan

Steps:

  1. Users must specify a set of weak learners (alternatively, it will randomly generate a set of weak learner before the real learning process).
  2. It will then learn the weights of how to add these learners to be a strong learner. The weight of each learner is learned by whether it predicts a sample correctly or not. If a learner mispredicts a sample, the weight of the learner is reduced.
  3. It will repeat this process until convergence.

Hyperparameters:

nan


Model Characteristics

Interpretability nan
Computation Time nan
Accuracy nan

Luther Project

The big event of week 6 at Metis was the completion of our 3rd project, McNulty. This project was to be focused on the different classification models we’re covered in the past weeks. I again learned the lesson the hard way that real world problems that involve gathering messy data from multiple sources are not appropriate for a 2 week learning project.

My topic, what factors contribute to high schools successfully preparing kids for college, seemed like a straight-forward enough project. Alas, it was not. Although the data I collected did not come together how I had hoped, I did manage to get some practice working with a variety of classification models.