Exploring Breast Cancer Classification: Logistic Regression and Scikit-Learn in Action

4 min readMay 28, 2023

Introduction

Machine learning has revolutionized the field of data analysis, enabling us to make accurate predictions and classifications. Logistic regression is a popular and fundamental classification algorithm. In this article, I will delve into the intricacies of logistic regression, exploring its principles, applications, and implementation using a Scikit-Learn dataset. Logistic regression is widely used for binary classification tasks and offers simplicity, interpretability, and effectiveness. By understanding the inner workings of logistic regression, its power can be leveraged to make informed decisions and solve real-world problems.

Understanding Logistic Regression

Logistic regression is a binary classification approach that models the relationship between input features and the likelihood of a binary outcome. Unlike linear regression, which predicts continuous values, logistic regression estimates the probability of an instance belonging to a particular class. It utilizes the logistic function (also known as the sigmoid function) to map the input features to a probability score ranging from 0 to 1.

The Scikit-Learn library provides a comprehensive implementation of logistic regression that offers a wealth of features and flexibility. By using a suitable dataset from Scikit-Learn, we can demonstrate the practical application of logistic regression.

Exploring the Breast Cancer Dataset

To showcase the power of logistic regression, I will utilize the widely-used breast cancer dataset available in Scikit-Learn. This dataset contains features computed from digitized images of fine needle aspirates of breast mass. It aims to classify tumors as malignant or benign based on these features.

First, I will load the dataset and perform exploratory data analysis (EDA) to gain insights into the data. EDA involves examining the distribution of features, identifying missing values, and analyzing any relationships or patterns that may exist. By visualizing the data, we can better understand its characteristics and make informed decisions during the modeling process. In this particular dataset, it is a clean and ready dataset from Scikit-Learn that does not require much.

# importing required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd

# Load the breast cancer dataset
X,y = load_breast_cancer(return_X_y = True)

Pre-processing the Data

Before training our logistic regression model, it is essential to preprocess the data. This involves handling missing values, scaling numerical features, and encoding categorical variables. Additionally, splitting the dataset into training and testing subsets allows us to evaluate the model’s performance on unseen data. All these are minimal for the breast cancer dataset, as it is ready-made for use in the sklearn library.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Model Training and Evaluation

With the pre-processed dataset in hand, I can proceed to train the logistic regression model. Scikit-Learn provides a convenient interface for fitting the logistic regression algorithm to the data in a very easy-to-use syntax.

To evaluate the trained model, I will employ various evaluation metrics such as accuracy, precision, recall, and F1 score; all these can be seen in the classification report from sklearn.metrics. These metrics provide insights into the model’s ability to correctly classify instances from both classes (benign and malignant).

# Create a logistic regression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train_scaled, y_train)

# Make predictions on the test data, to test the model
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report = classification_report(y_test,y_pred)
print(f'accuracy score: {accuracy}')
print(f'classifcation report : \n  {classification_report}')

Output: 

accuracy score: 0.9736842105263158
classification report: 
                precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Visualizing the results

cm  = confusion_matrix(y_test,y_pred)
cm

# output

array([[41,  2],
       [ 1, 70]])

disp = ConfusionMatrixDisplay(confusion_matrix = cm,display_labels=model.classes_)
disp.plot()
plt.show()

Fig 1: Confusion matrix of the Logistic Regression Model

Interpreting the results

Logistic regression offers interpretability, allowing us to understand the contribution of each feature to the classification decision. The model performed excellently with an accuracy score of 0.97 (97%), which means that 97 out of 100 predictions (whether benign or malignant) were correct. In addition, the precision, recall, and F1 scores were all above 95%, meaning that the model was excellent for the dataset. Looking at the confusion matrix, the model had only 2 false positives and 1 false negative. The rest were all correct (41 true negatives and 70 true positives). The metrics allow for evaluation of the model, and a decision can be made if the model is being considered for deployment.

The interpretable nature of logistic regression allows for its use in domains that require a deep understanding of the decision-making process.

Conclusion

In this article, I explored logistic regression as a powerful classification algorithm using a Scikit-Learn dataset. Logistic regression provides a simple yet effective approach for binary classification tasks, offering an interpretable and accurate model. By leveraging the Scikit-Learn library and utilizing appropriate datasets, we can implement logistic regression with ease. Additionally, by pre-processing the data and evaluating the model’s performance, we can ensure the reliability and accuracy of our classification results. Logistic regression continues to be a widely used algorithm in various domains, making it an essential tool for machine learning practitioners seeking to solve classification problems.