Building a Simple Linear Regression Model in Python

Lukman Aliyu
3 min readJan 20, 2024

Introduction

Data science is an ever-evolving field, wielding the power to extract meaningful insights from vast and varied datasets. At its core, data science blends statistical analysis, algorithmic developments, and technology to solve complex problems. In this tutorial, we’ll embark on a data science journey, focusing on one of the fundamental aspects of machine learning — building a simple linear regression model in Python.

Understanding Linear Regression

Linear regression is a starting point in machine learning for many data scientists. It’s a statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal is to find a linear function that predicts the dependent variable values as accurately as possible.

Setting Up the Environment

To start, ensure that Python is installed on your system. We will use libraries such as Pandas for data manipulation, NumPy for numerical operations, and Matplotlib for visualization. You can install these libraries using pip:

!pip install numpy pandas matplotlib scikit-learn

Step 1: Importing the Libraries

We begin by importing the necessary libraries. Pandas will help us read and manipulate the data, NumPy will assist in numerical calculations, Matplotlib will be used for plotting, and sklearn provides tools for building the regression model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Step 2: Loading and Understanding the Data

For our linear regression model, we’ll use a simple dataset — let’s consider a dataset containing housing prices and their corresponding attributes. The goal is to predict housing prices based on various features. However, for simplicity, we’ll use only one independent variable.

# Sample data loading
# data = pd.read_csv('housing.csv')
# For this tutorial, let's create a simple synthetic dataset
np.random.seed(0)
X = 2.5 * np.random.randn(1000) + 1.5 # Array of 1000 values with mean = 1.5, stddev = 2.5
res = 0.5 * np.random.randn(1000) # Generate 1000 residual terms
y = 2 + 0.3 * X + res # Actual values of Y

# Create a pandas dataframe
df = pd.DataFrame({'X': X, 'y': y})

Step 3: Data Splitting

Before building the model, split the dataset into training and testing sets. This is crucial for evaluating our model’s performance.

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(df['X'], df['y'], test_size=0.2, random_state=0)

Step 4: Building the Linear Regression Model

Now, we use sklearn to build our linear regression model. We fit the model on our training data and then use it to make predictions.

# Model initialization
regression_model = LinearRegression()

# Fit the data(train the model)
regression_model.fit(X_train.values.reshape(-1,1), y_train.values)

# Predict
y_predicted = regression_model.predict(X_test.values.reshape(-1,1))

Step 5: Evaluating the Model

After training the model, it’s essential to evaluate its performance. We’ll use the Mean Squared Error, a common metric for regression models. We will also plot the regression plot to see how well the line fits the data.

# Model evaluation
mse = mean_squared_error(y_test, y_predicted)

print('Slope:', regression_model.coef_)
print('Intercept:', regression_model.intercept_)
print('Mean squared error:', mse)

# Plotting
plt.scatter(X_test, y_test, color='gray')
plt.plot(X_test, y_predicted, color='red', linewidth=2)
plt.show()
Slope: [0.29511109]
Intercept: 2.0163897624979756
Mean squared error: 0.2449976720185338
Figure 1: Regression Plot for the Model

The metrics look good and the plot also shows the line is a really good fit! This implies that the model can predict y values if we supply in x values.

Conclusion

Linear regression is a powerful yet simple tool for predictive modeling. In this tutorial, we walked through the steps of building and evaluating a linear regression model using Python’s sklearn library. As you delve deeper into data science, you’ll encounter more complex models and techniques, but the foundational understanding of linear regression will always come in handy. Happy coding and exploring the world of data science!

--

--

Lukman Aliyu

Pharmacist enthusiastic about Data Science/AI/ML| Fellow, Arewa Data Science Academy