Course Description

One of the most common tasks performed by data scientists and data analysts are prediction and machine learning. This course will cover the basic components of building and applying prediction functions with an emphasis on practical applications. The course will provide basic grounding in concepts such as training and tests sets, overfitting, and error rates. The course will also introduce a range of model based and algorithmic machine learning methods including regression, classification trees, Naive Bayes, and random forests. The course will cover the complete process of building prediction functions including data collection, feature creation, algorithms, and evaluation.

Course Content

Articles

Coming soon!

What I’ve learned

Week 1

First week was great, it was a dive into what prediction is, the components, how the question is always the most important thing. Selecting the right data, features and algorithm that best fit the problem. Prediction is effectively about trade-offs, you want the right balance between interpretability, accuracy, speed, simplicity, scalability and your prediction. Interpretability is an important issue now, especially with biases and discrimination. And scalability as well, as models are all moving to the cloud. I also learned about types of errors, in and out of sample erorrs, what ROC curves are (essentially measuring quality of prediction algorithm, P(FP) on x, P(TP) on y). Also a very crucial concept I learned, cross validation, the groudnworks of many algorithms. The main idea is subsampling or bootstrapping training data and splitting into train and test multiple times.

Week 2

I was introduced to the caret package. I wanted to learn about tidymodels since its newer and flashier, but I thought it was still good to know about it. Also concepts like data slicing, training options, preprocessing, plotting predictors, covaraite creation, etc. all the what nots using caret to do modelling. Overall I found caret easy to use, but the math behind the algorithm difficult, which goes to show how anyone can do ML, but not everyone understands it. I will be doing my best to learn more with the ISLR book. I also got to apply caret on regression and multiple regression.

Week 3

This week was about tree based and ensemble learning. Bagging and boosting is an amazing thing. I’m stunned how old it is but still applicable today. The main idea of bagging is to reduce variance (overfitting) and boosting is to reduce bias(underfitting). Random forest is also really cool, it’s a type of bagging algorithm, but with an extra step where you boostrap features to train each individual tree as well, and this makes it faster than regular bagging and more accurate as well. Random forest is pretty much all you need when you want a clasifier algorithm. I also got introducwed to model based prediction, aka generative learning. These include LInear Discriminant Analysis and Naive Bayes, theyre basically algorithm that learns how data is generated by estimating P(x|y), and then use it to estimate P(y|x) using Bayes rule. It’s a lot of math, so I barely understand it, but I get the gist that it first assumes the distribution of the data, and then applies that to the prediction (which is the fundamental idea of bayesian statistics I believe)

Week 4

Regularized regression, the idea of making regular regression overfit less by penalizing the coefficeints that are too large. There’s three types, LASSO, Ridge and Elastic Net. Lasso can shrink coefs to zero, and is good for variable selection, Ridge makes coefs smaller, and elastic net balances between small coefs and variable selection. Next up was combining predictors, which is the idea of ensemble learning, bagging, boosting and all that combines similar classifiers (trees and averages them) where as other model stacking combines different models (odd numbers so there can be majority vote). Generally combining predictors reduces your RMSE, but the drawback is it is computationally expensive duh. the lectures also briefly touched on forecasting using the quantmod package, I got to forecast the TSLA stock, and my test error was incredible high because the stock is literally off the roof. Finally, unsupervised prediciton, which is about firstly predicting labels from data, using algo like K-means, and then using those labels to predict the outcome. This can be seen in recommendation systems, where what you interact with becomes a predictor for what your preference is, which is why recommendations are specific to the person.

Final Project

The final project was fun, but training it defintely wasn’t. My macbook air 2015 was super slow in training a random forest model. A few notes were I tried using PCA as a preprocessing for my rf model, but my accuracy decreased, I suspect it might be that losing my variables resulted in losing information needed, which is why my final model performed worse (had one classe variable predicted wrong). I removed the preprocess, trained it on 7 k-folds, and i got 100% accuracy. I would love to explore this project more using SVMs perhaps, but from what I researched, Random forest is pretty much the best in accuracy.

It was still pretty simple overall, I’m hoping to be able to do projects that are end-to-end, with multiple scripts, instead of just a notebook-type project that is useless by itself. I hope the final capstone project will give me what I want.

Concepts Cheatsheet

Process for prediction

population → probability and sampling to pick set of data → split into training and test set → build prediction function → predict for new data → evaluate

Components of predictor

question → input data → features (extracting variables/characteristics) → algorithm → parameters (estimate) → evaluation

Relative Order of importance

question (concrete/specific) > data (relevant) > features (properly extract) > algorithms

data, feature and algorithm selection

predictions are effectively about trade-offs between interpretability, accuracy, speed, simplicity and scalability

In vs out of sample error

Sample division guide

types of errors

Error measurements

Receiver Operating Characteristic Curves (ROC curve)

cross validation types

Decision Trees Measures of Impurity

CART

Random Forest

Boosting

Model based prediction (Generative Learning)

Model selection

Combining Predictors (model stacking)

Forecasting

Unsupervised prediction

Kaggle notebooks

Notes

quiz

Proof of completion

Certificate for first course

View it online