A Beginner’s Guide to Packages and Modules in Scikit-Learn: Getting Started with Machine Learning
4 min readMay 20, 2023
The Scikit-Learn (sklearn) library is a widely used machine learning library in Python. It provides a range of modules and packages for various tasks in machine learning. Here are some important modules and packages in Scikit-Learn and their uses:
- sklearn.datasets: This module provides various datasets for practicing and testing machine learning algorithms. It includes functions to load standard datasets like the iris dataset, Boston housing dataset, MNIST dataset, etc.
- sklearn.preprocessing: This module provides functions for preprocessing and scaling data. It includes techniques like feature scaling, normalization, label encoding, one-hot encoding, etc.
- sklearn.model_selection: This module contains functions for model selection and evaluation. It provides utilities for dividing data into training and testing sets, cross-validation techniques, hyperparameter tuning using grid search or random search, and model evaluation metrics.
- sklearn.feature_selection: This module offers functions for feature selection and dimensionality reduction. It includes techniques like variance thresholding, recursive feature elimination, principal component analysis (PCA), and more.
- sklearn.linear_model: This module provides various linear models for regression, classification, and other tasks. It includes linear regression, logistic regression, ridge regression, Lasso, ElasticNet, and other linear models.
- sklearn.tree: This module contains classes for decision tree-based models. It includes decision trees, random forests, gradient boosting machines (GBM), and AdaBoost.
- sklearn.cluster: This module provides clustering algorithms for unsupervised learning. It includes k-means clustering, DBSCAN, hierarchical clustering, and more.
- sklearn.metrics: This module includes a wide range of evaluation metrics for assessing the performance of machine learning models. It includes metrics for classification, regression, clustering, and ranking tasks.
- sklearn.pipeline: This module offers utilities for creating and managing machine learning pipelines. It allows you to chain multiple transformers and estimators together and simplify the workflow.
- sklearn.neural_network: This module includes classes for neural network-based models. It includes multi-layer perceptron (MLP) for classification and regression tasks.
- sklearn.svm: This module contains support vector machine (SVM) algorithms for classification and regression tasks. It includes linear SVM, polynomial SVM, and radial basis function (RBF) SVM.
- sklearn.naive_bayes: This module provides implementations of Naive Bayes algorithms for classification tasks. It includes Gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes.
- sklearn.ensemble: This module includes ensemble methods for combining multiple models. It includes popular ensemble techniques such as random forests, gradient boosting, and AdaBoost.
- sklearn.decomposition: This module provides functions for matrix decomposition and dimensionality reduction. It includes techniques like principal component analysis (PCA), singular value decomposition (SVD), and non-negative matrix factorization (NMF).
- sklearn.manifold: This module provides algorithms for high-dimensional data visualization and dimensionality reduction. It includes techniques like t-SNE (t-Distributed Stochastic Neighbor Embedding) and Isomap.
- sklearn.impute: This module offers functions for imputing missing values in datasets. It includes techniques such as mean imputation, median imputation, and most frequent imputation.
- sklearn.neighbors: This module provides algorithms for nearest neighbor-based learning. It includes k-nearest neighbors (KNN) classification and regression, radius-based neighbors, and kernel density estimation.
- sklearn.compose: This module offers utilities for creating complex machine learning pipelines by combining multiple transformers and estimators. It includes functions like make_column_transformer and make_column_selector for column-wise transformations.
- sklearn.experimental: This module contains experimental features and functions that are still under development. It includes new algorithms, enhancements, and prototypes that are not yet fully integrated into the main scikit-learn library.
- sklearn.inspection: This module provides functions for model inspection and interpretation. It includes tools for visualizing feature importances, partial dependence plots, and learning curves.
- sklearn.externals: This module includes utilities for compatibility with older versions of scikit-learn. It provides functions for joblib-based serialization and deserialization.
- sklearn.calibration: This module contains functions for probability calibration of classifier outputs. It includes techniques such as Platt scaling and isotonic regression.
- sklearn.utils: This module offers utility functions for various tasks in scikit-learn. It includes functions for data manipulation, random sampling, model persistence, and more.
- sklearn.semi_supervised: This module provides algorithms for semi-supervised learning, where the training data includes both labeled and unlabeled samples. It includes techniques like label propagation and self-training.
- sklearn.metrics.pairwise: This module includes functions for pairwise distance computations and kernel calculations. It also includes various distance metrics like Euclidean distance, cosine similarity, and more.
Conclusion
In this article, I gave an overview of just a few of the modules and packages available in the robust Python machine learning library, Scikit-Learn. Check out the documentation for details.
Follow me on Medium for a first-hand account of my amazing machine learning journey.