Understanding Linear Regression: Predicting Continuous Output Values Based on Input Variables

Lukman Aliyu
3 min readMay 21, 2023

Linear regression is a popular statistical and machine learning technique used to predict continuous numerical values by analyzing a set of input variables. It is a simple yet powerful algorithm that offers a comprehensive understanding of the connection between input and output variables. This article explores the fundamentals of linear regression and its practical application for making predictions.

At its core, linear regression aims to discover the best-fitting line that represents the relationship between an input variable (x) and an output variable (y). The model function, fw,b(x) = wx + b, consists of a weight or slope (w), a y-intercept (b), and an input feature (x) like the population of a city. For people who remember coordinate geometry, this is basically the equation of a straight line: y = mx + c.

The objective of training a linear regression model is to determine the optimal values for the parameters (w, b) that minimize the cost function J(w, b). This cost function quantifies the error or difference between predicted and actual values in the training dataset. The most suitable (w, b) values are those that yield the smallest cost J(w, b), effectively fitting the data.

To obtain the (w, b) values that minimize the cost J(w, b), gradient descent is employed. Gradient descent is an iterative optimization algorithm that adjusts the parameters (w, b) by taking steps proportional to the negative gradient of the cost function. With each iteration, the parameters move closer to the optimal values that minimize the cost.

During training, the algorithm computes predicted values fw, b(x) for input features (x) using the current (w, b) values. It then compares these predictions to the actual values (y) in the training dataset, updating the parameters (w, b) to reduce the cost J(w, b).

By iteratively updating the parameters with gradient descent, the algorithm gradually converges to the best-fitting line that minimizes the cost function. Once training is complete, the linear regression model can make predictions on new, unseen data.

The trained linear regression model takes an input feature (x) (e.g., city population) and calculates the corresponding predicted output fw, b(x) (e.g., monthly profit for a restaurant in that city). The model utilizes the learned values of (w, b) to generate these predictions based on the linear relationship established during training.

It’s worth noting that linear regression assumes a linear relationship between the input and output variables. If the relationship is nonlinear, the model may not provide accurate predictions. In such cases, advanced techniques like polynomial regression or other non-linear models should be considered. Additionally, four crucial conditions should be satisfied for the accurate fitting of the linear regression model:

  1. Linearity: The relationship between the input variable (x) and the output variable (y) should be linear, evident in a scatter plot forming a straight line.

2. Normality: The residuals (differences between actual values and predicted values fw, b(x)) should follow a normal distribution, which allows for reliable statistical inference.

3. Homoscedasticity: The variance of the residuals should be constant across all levels of the input variable, meaning the spread of residuals should remain consistent. Scatter plots of residuals against predicted values aid in identifying deviations from homoscedasticity.

4. Independence: Observations used for training the linear regression model should be independent, with each observation not being influenced by another. Violating independence can result in biased or inefficient parameter estimates.

It is vital to verify these conditions before relying on the results of a linear regression model. Failure to meet these conditions can lead to inaccurate predictions and unreliable inferences.

In practice, exploratory data analysis techniques like scatter plots, residual plots, and statistical tests are utilized to assess linearity, normality, homoscedasticity, and independence assumptions. If these assumptions are not met, alternative modeling approaches such as nonlinear regression or variable transformations may be more appropriate.

In conclusion, linear regression offers a fundamental understanding of the relationship between input and output continuous variables. By finding the best-fit line that minimizes the cost function, it enables predictions based on new input data. Despite assuming a linear relationship, linear regression remains a valuable and widely used technique across various fields, including finance, economics, and the social sciences.

--

--

Lukman Aliyu

Pharmacist enthusiastic about Data Science/AI/ML| Fellow, Arewa Data Science Academy