The Metis Experience

All about my experience at the Metis Data Science Bootcamp

Onsite in Chicago, Weeks 7-9

Week 7 we dove into unsupervised learning clustering algorithms and natural language processing methods such as Kmeans, Word2Vec, and LDA. We also tackled different dimmensionality reduction strategies to combat the dreaded 'Curse of Dimensionality'. Week 8 we covered Neural Nets and Recommendation Systems.

I again put together a 'cheat sheet' of sorts for a handful of the main clustering algorithms we discussed. See below.

Clustering Models

K-Means

Key Concept:

  • Clusters data by trying to separate samples in n groups of equal variance, , minimizing a criterion known as the inertia or within-cluster sum-of-squares
  • Best way to determine number of clusters is to look at silhouette scores.
  • Seeding method could be random or K++ (K++ gives better results).

Assumptions:

Clusters are spherical in shape.

Steps:

nan

Hyperparameters:

Computation Time:

Computation time can be reduced with MiniBatch.


Model Characteristics:

Pros Cons
  • Good at well separated, spherical clusters.
  • Scales well to large number of samples
  • Requires the number of clusters to be specified
  • Sensitive to outliers.

Agglomerative

Key Concept:

  • General family of clustering algorithms that build nested clusters by merging or splitting them successively.
  • Two types: divisive (top-down) and agglomerative (bottom-up).
  • Have to choose linkage method: single (distance between most similar), average (distance between centroids), complete (distance between most dissimilar), ward (minimize increase in inertia).
  • Can visualize with dendragrams.

Assumptions:

nan

Steps:

nan

Hyperparameters:

Computation Time:

Can scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges.


Model Characteristics:

Pros Cons
  • Does not assume spherical cluster shape.
  • Can also include a connectivity constraints.
  • Must set a stopping condition: desired number of clusters or a max length criteria.

DBSCAN

Key Concept:

  • Density Based Spatial Clustering of Applications with Noise
  • Three types of points: core points, border points, noise points.

Assumptions:

nan

Steps:

  • Select the hyper-parameters radius (epsilon) and minimum points (n_clu).
  • Label points as core, border or noise.
  • Form a separate cluster for each core point or a connected group of core points.
  • Assign each border point to the cluster of it's corresponding core point

Hyperparameters:

Computation Time:

nan


Model Characteristics:

Pros Cons
  • Does not assume spherical cluster shape.
  • Can handle uneven cluster sizes.
  • Includes noise/outliers.
  • Have to set min points and radius.
  • Difficult to tune if there are large density differences inherent in dataset.

Week 8 was Big Data week. Though we only covered most of these topics in introductory depth, we were given the tools and context to easily tackle further study on our own. Many of us did put these technologies to work in our final projects for further practice. Technologies we covered included Docker, Hadoop, Hive, and Spark.

Big Data Concepts

Docker

Key Concepts

Docker is an open source framework for distributed computing using containers. Docker makes it possible to 'containerize' an application with all of the needed libraries and dependencies, and ship it all out as one package. Docker is a bit like a virtual machine, but allows applications to use the same Linux kernel as the system that they're running on and only requires applications to be shipped with things not already running on the host computer.

Applications

Docker is a tool designed to make it easier to create, deploy, and run applications in a reproducible fashion. The goal is to eliminate issues with different environments and software versions by using a fully reproducible environment.


Hadoop

Key Concepts

Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

Applications

Hadoop services provide for data storage, data processing, data access, data governance, security, and operations. Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.


Hive

Key Concepts

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis.

Applications

Data analysts use Hive to query, summarize, explore and analyze data, then turn it into actionable business insight.


Spark

Key Concepts

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark adds in-Memory Compute for ETL, Machine Learning and Data Science Workloads to Hadoop.

Applications

Additional libraries, built atop the core, allow diverse workloads for streaming, SQL, and machine learning. Spark is designed for data science and its abstraction makes data science easier. Spark’s ability to cache datasets in memory greatly speeds up iterative data processing, making Spark an ideal processing engine for implementing algorithms.