scikit-learn Cheat Sheet

Essential guide for machine learning with scikit-learn, covering the fundamental workflow and key algorithms.


Core Machine Learning Workflow

Action Code Example Description
Split Data into Train/Test Sets from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_all, y_all, train_size=0.5)
Randomly partitions the data into training and testing subsets to evaluate model performance on unseen data.
Create and Train a Model from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
The standard workflow: instantiate a model class and then call the fit method with the training data.
Predict with a Trained Model y_pred = model.predict(X_test) Uses the trained model to make predictions on new data.
Evaluate Model Score (R²) model.score(X_test, y_test) For regression models, this returns the R-squared score, a measure of how well the model fits the data.
Get Model Coefficients model.coef_ Accesses the fitted coefficients (parameters) of a linear model.

Regularized Regression

Action Code Example Description
Lasso Regression (L1) model = linear_model.Lasso(alpha=1.0) A linear model with L1 regularization, which tends to produce sparse coefficients (many zeros).
Ridge Regression (L2) model = linear_model.Ridge(alpha=2.5) A linear model with L2 regularization, which penalizes large coefficient values.
Cross-Validated Lasso model = linear_model.LassoCV()
model.fit(X_all, y_all)
A Lasso model that automatically finds the best alpha parameter using cross-validation.

Classification Algorithms

Action Code Example Description
Support Vector Machine (SVM) from sklearn import svm
classifier = svm.SVC()
classifier.fit(X_train, y_train)
An example of a classification model. Other classifiers like DecisionTreeClassifier and KNeighborsClassifier follow the same API.
Random Forest Classifier from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
Ensemble method using multiple decision trees.
Logistic Regression from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
Linear classification algorithm.
K-Nearest Neighbors from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
Instance-based learning algorithm.

Clustering Algorithms

Action Code Example Description
K-Means Clustering from sklearn import cluster
clustering = cluster.KMeans(n_clusters=3)
clustering.fit(X)
An unsupervised learning algorithm that groups data into a specified number of clusters.
Hierarchical Clustering from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters=3)
hc.fit(X)
Builds a hierarchy of clusters using agglomerative methods.
DBSCAN from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=10)
dbscan.fit(X)
Density-based clustering algorithm that can find arbitrarily shaped clusters.
Gaussian Mixture Model from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)
gmm.fit(X)
Probabilistic clustering using Gaussian distributions.

Model Evaluation

Action Code Example Description
Classification Metrics from sklearn import metrics
metrics.confusion_matrix(y_test, y_pred)
print(metrics.classification_report(y_test, y_pred))
Functions for assessing the performance of a classification model.
Cross-Validation Score from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
Evaluate model using cross-validation across different train/test splits.
Grid Search (Hyperparameter Tuning) from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10]}
grid_search = GridSearchCV(model, param_grid, cv=5)
Systematically search for optimal hyperparameters.
ROC Curve & AUC from sklearn.metrics import roc_curve, auc
python<br>y_proba = model.predict_proba(X_test)[:, 1]<br>fpr, tpr, thresholds = roc_curve(y_test, y_proba)<br>auc_score = auc(fpr, tpr)<br>
Evaluate binary classification performance.

Data Preprocessing

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Normalize to [0, 1] range
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

Handling Missing Values

from sklearn.impute import SimpleImputer
import numpy as np

# Replace NaNs with mean value
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Replace NaNs with median
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

Categorical Encoding

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-hot encoding for categorical features
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical)

# Label encoding for target variables
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y_categorical)

Feature Selection

from sklearn.feature_selection import SelectKBest, f_regression

# Select top k features using statistical tests
selector = SelectKBest(score_func=f_regression, k=10)
X_selected = selector.fit_transform(X, y)

Essential scikit-learn Imports

# Core imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report

# Linear models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ensemble methods
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

# Dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Clustering
from sklearn.cluster import KMeans, DBSCAN
from sklearn.cluster import AgglomerativeClustering

Typical ML Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

Common Issues & Solutions

Handling Class Imbalance

from imblearn.over_sampling import SMOTE

# Oversampling minority class
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

Dealing with Outliers

from sklearn.ensemble import IsolationForest

# Detect outliers
isolation_forest = IsolationForest(contamination=0.1)
outlier_labels = isolation_forest.fit_predict(X)

# Remove outliers
X_clean = X[outlier_labels == 1]

Feature Engineering

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

scikit-learn provides a consistent, well-documented API for machine learning in Python. This cheatsheet covers the essential workflow from data preparation to model evaluation.

Updated: January 15, 2025
Author: Danial Pahlavan
Category: Machine Learning & Data Science