In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

Practical definition in ML

Given a dataset, we want to predict a range of numeric (continuous) values. One or several variables of the dataset predict (are correlated with) a numerical outcome (the future), which is usually another column in the data. The essence of the method is about minimizing the distance (residues) to the regression line.

Linear Regression Metrics

  • Mean Absolute Error (MAE). Its the easiest to understand: it’s the average error.
  • Mean Squared Error (MSE). More popular than MAE, because MSE “punishes” larger errors: useful in the real world.
  • Root Mean Squared Error (RMSE). Even more popular than MSE, because RMSE is interpretable in the “y” units.

In Machine Learning all these metrics are loss functions, because we want to minimize them when comparing the real vs predicted values.

Classification vs Regression

Classification is when you assign the result to some group based on input features. Regression is when you estimate a number. Either way its good to find out in EDA the distribution of the label (either the number for regression, or the counting for labels).


from sklearn.linear_model import LinearRegression             
lm = LinearRegression()                         ,y_train) 

predictions = lm.predict(X_test)