Course Description

Linear models, as their name implies, relates an outcome to a set of predictors of interest using linear assumptions. Regression models, a subset of linear models, are the most important statistical analysis tool in a data scientist’s toolkit. This course covers regression analysis, least squares and inference using regression models. Special cases of the regression model, ANOVA and ANCOVA will be covered as well. Analysis of residuals and variability will be investigated. The course will cover modern thinking on model selection and novel uses of regression models including scatterplot smoothing.

This class has three main components

  1. Least squares and linear regression
  2. Multivariable regression
  3. Generalized linear models

Articles

Articles are coming!

What I’ve learned

Week 1

Main Concepts: Simple Linear Regression, regression thru origin, regression to the mean

The starting was good, I learned about notations and about ordinary least squares and regression through the origin. I also delved abit into regression to the mean.

Week 2

Main Concepts: Statistical linear models, residuals, regression inference

I learned the diff between statistical learning and machine learning, for stats learning you focus more on statistical inference concepts, like extending estimates to population and focus on explainability, whereas machine learning focus on prediction only, which makes it more of a black box. In regression inference, you look at values such as residual variation and standard error, then you get a confidence interval and use it to extend to the population. The prerequisits for this however, is that your variables have to be random IID variables (independent and identically distributed).

Week 3

Main concepts: Dummy variables, interactions, residuals and diagnostics, model selection

This week taught me about multivariable regression, which is using many predictor variables to predict the outcome. I saw how adding variables means each of them are adjusting for the effects of others, and how values like dffits, hatvalues, r standard, cook’s distance, VIF and f-statistics all contribute to choosing good predictor variables and a good model.

Week 4

Main concepts: GLMs, logistic regression, poisson regression, fitting functions to linear models

This week was pretty difficult, because I was introduced to a new concept which was link functions, I found them really interesting. It’s basically a way to expand on the limitations of simple linear models, since you want to be able to predict outcomes that are, ie strictly positive or binary values, you’re able to use link functions (logit for binary, log for poisson) to transform the parameters of your predictions, instead of the data itself. This was a really great idea and I’m glad I was introduced to this.

Project

The project was pretty simple, all I had to do was quantify the diference between automatic and manual transmission cars in terms of MPG. By using a boxplot to show the difference, and fitting a simple linear model to quantify the difference, also including CI for inference purposes. I also used a stepwise selection method to choose the best Multivariate Linear model that can best predict MPG. The conclusion was manual transmission cars have better fuel efficiency than automatic cars. The hard part of this project was editing and making sure everything looked nice on the report.

Key points

Assumptions of linear regression:

Residual diagnostics terms

Guideline for model selection

3 conditions for GLM

Kaggle notebooks

Notes

quiz

Book

This course is complemented with the book below

Proof of completion

Certificate for 6th course

View it online