All about my experience at the Metis Data Science Bootcamp

Weeks four through six took us through all of the classification models and a whole grab bag of topics related to working in the cloud and web development.

We covered a majority of the main classification machine learning algorithms in use today. For all of these algorithms we did an in-depth look at the mathmatical underpinnings of the algorithms and also talked in detail about their strengths and weaknesses. A common review topic was comparing and contrasting when it would be appropriate to use one algorithm over another and how you would determine which performed best. I've created a summary 'cheat sheet' and shared it below.

**Key Concept:**

- Non-parametic classification model used for both regression and classification.
- Looks for k closest samples in a training set.
- Fits fast, predicts slow. Holds entire sample set in memory to make a prediction.

**Assumptions:**

Scale variant, data should be Standardized.

**Steps:**

- Select k, the number of neighbors that will be used.
- To predict, determine the majority classification or average (regression) for a new point.

**Hyperparameters:**

nan

Interpretability | Highly interpretable, has an analytic solution. |

Computation Time | Quick to fit, slow to predict. Requires holding all data in memory to predict, may run slow with very large datasets. |

Accuracy | Varies |

**Key Concept:**

- Binomial or multi-nomial districtuions.
- Outputs probability.
- Must set a threshold for a specific output.

**Assumptions:**

Scale variant, data should be Standardized.

**Steps:**

nan

**Hyperparameters:**

nan

Interpretability | Highly interpretable, returns probabilities and coefficients. Loses interpretability with multiple classes (probabilities do not add up to one, unless using multi-). |

Computation Time | Linear, relatively fast, scales well. Multi-class much slower. |

Accuracy | Varies |

**Key Concept:**

- Can be used for regression or classification.
- Non-parametic classification method.
- Goal is to maximize the margin.
- It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.

**Assumptions:**

Scale invariant, no assumptions about the data.

**Steps:**

- Identify support vectors that define a hyperplane.
- Maximizes distance between two groups.
- If using poly or rbf kernals, the kernel trick takes the data you give it and transforms it.

**Hyperparameters:**

- Kernal: Linear, Polynomial, RBF
- C: (similar to Lambda) Gamma:

Interpretability | Linear classification is easily interpretable. Regession and other kernals less so. The complex data transformations and resulting boundary plane are very difficult to interpret. |

Computation Time | Depends on kernel. Linear much faster than poly or rbf. |

Accuracy | Varies |

**Key Concept:**

Applies Bayes Theorum to all features, then multiplied.

**Assumptions:**

Assumes independent features.

**Steps:**

nan

**Hyperparameters:**

nan

Interpretability | Highly interpretable. Calculations consist of simple counting and multiplication. |

Computation Time | Performs well. |

Accuracy | Varies |

**Key Concept:**

- Can be used for regression or classification.
- Greedy algorithm:
- Frequentist methodology

**Assumptions:**

Non-parametric, scale invariant.

**Steps:**

- Select attributes in order of entropy.
- Create child nodes based on split.
- Create small notes based
- Recurese until we reach a stopping point: all examples in the same class, a minimum number of samples is reached, or the tree becomes too large.

**Hyperparameters:**

Regression uses MSE. Classification uses entropy/gini.

Interpretability | Highly interpretable. |

Computation Time | Scales well (depending on implementation, parallelizeable). |

Accuracy | Prone to overfitting. |

**Key Concept:**

- Can be used for regression or classification.

**Assumptions:**

Non-parametric, scale invariant.

**Steps:**

nan

**Hyperparameters:**

nan

Interpretability | nan |

Computation Time | Scales well (highly parallelizeable). |

Accuracy | nan |

**Key Concept:**

- Gradient Boosting = Gradient Descent + Boosting
- Can be used for regression or classification.
- Combines weak "learners" into a single strong learner, in an iterative fashion
- “shortcomings” are identified by gradients.

**Assumptions:**

Non-parametric, scale invariant.

**Steps:**

- Builds first learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value).
- It will build a second learner to predict the loss after the first step based on the residuals.
- The step continues to learn the third, forth… until certain threshold or stopping condition is met.

**Hyperparameters:**

nan

Interpretability | nan |

Computation Time | nan |

Accuracy | nan |

**Key Concept:**

- Is a special case of Gradient Boosting
- Can be used for regression or classification.
- Combines weak "learners" into a single strong learner, in an iterative fashion

**Assumptions:**

nan

**Steps:**

- Users must specify a set of weak learners (alternatively, it will randomly generate a set of weak learner before the real learning process).
- It will then learn the weights of how to add these learners to be a strong learner. The weight of each learner is learned by whether it predicts a sample correctly or not. If a learner mispredicts a sample, the weight of the learner is reduced.
- It will repeat this process until convergence.

**Hyperparameters:**

nan

Interpretability | nan |

Computation Time | nan |

Accuracy | nan |

The big event of week 6 at Metis was the completion of our 3rd project, McNulty. This project was to be focused on the different classification models we’re covered in the past weeks. I again learned the lesson the hard way that real world problems that involve gathering messy data from multiple sources are not appropriate for a 2 week learning project.

My topic, what factors contribute to high schools successfully preparing kids for college, seemed like a straight-forward enough project. Alas, it was not. Although the data I collected did not come together how I had hoped, I did manage to get some practice working with a variety of classification models.