Member-only story
How to implement and select the best Linear Regression Model
Example with R and Python code
Contents
1. About Linear Regression
2. Implementation
3. R² and Adjusted-R²
4. Residual Standard Error
5. Bland-Altman Statistics and Plot
6. Akaike Information Criterion (AIC)
7. Bayesian Information Criterion (BIC)
8. Correlation Coefficient (CC)
9. Z-score
10. Model Choice
0 Loading Packages and Libraries
Before we begin, and assuming you already have the necessary packages and libraries installed, this is the code you will need to run in order to make sure all functions will work in the following sections:
R packages:
library(ggplot2)
library(hrbrthemes)
library(tidyverse)
library(blandr)
library(readr)
Python libraries:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import linspace
from sklearn.linear model import LinearRegression
from sklearn import datasets, linear model
from sklearn.metrics import *
import statistics
import seaborn as sns
import statsmodels.api as sm
from scipy
stats import gaussian kde
You can download the data sample used in this tutorial as well as R and Python code from GitHub.
1 About Linear Regression
Linear regression is a mathematical model in the form of line equation:
y = b + a1x1 + a2x2 + a3x3 + …
where y is the dependent variable, and x1; x2; x3 are the independent variables. As we know from pre-calculus, b is the intercept with y axis and a1; a2; a3 are the values that will set the line slope. In practice, y is the variable we want to predict, x1; x2; x3… are the predictor variables, and b is the y value when all x or all a are equal to zero (however, this values does not always have a real meaning outside the mathematical expression). Linear regression models can be used in life sciences, for example to predict hip circumference (that will be the y), based on weight, height and waist circumference (that will be x1, x2 and x3). The values of a1, a2 and a3 will tell us how much the independent variables x1, x2 and x3 affect hip circumference. A scatter-plot with the 3 predictive variables (x-axis) and the predicted variable (y-axis) can be found in Figure 1: