scikit-learn Cheat Sheet

Essential guide for machine learning with scikit-learn, covering the fundamental workflow and key algorithms.

Core Machine Learning Workflow

Action	Code Example	Description
Split Data into Train/Test Sets	`from sklearn import model_selection` `X_train, X_test, y_train, y_test = model_selection.train_test_split(X_all, y_all, train_size=0.5)`	Randomly partitions the data into training and testing subsets to evaluate model performance on unseen data.
Create and Train a Model	`from sklearn import linear_model` `model = linear_model.LinearRegression()` `model.fit(X_train, y_train)`	The standard workflow: instantiate a model class and then call the `fit` method with the training data.
Predict with a Trained Model	`y_pred = model.predict(X_test)`	Uses the trained model to make predictions on new data.
Evaluate Model Score (R²)	`model.score(X_test, y_test)`	For regression models, this returns the R-squared score, a measure of how well the model fits the data.
Get Model Coefficients	`model.coef_`	Accesses the fitted coefficients (parameters) of a linear model.

Regularized Regression

Action	Code Example	Description
Lasso Regression (L1)	`model = linear_model.Lasso(alpha=1.0)`	A linear model with L1 regularization, which tends to produce sparse coefficients (many zeros).
Ridge Regression (L2)	`model = linear_model.Ridge(alpha=2.5)`	A linear model with L2 regularization, which penalizes large coefficient values.
Cross-Validated Lasso	`model = linear_model.LassoCV()` `model.fit(X_all, y_all)`	A Lasso model that automatically finds the best `alpha` parameter using cross-validation.

Classification Algorithms

Action	Code Example	Description
Support Vector Machine (SVM)	`from sklearn import svm` `classifier = svm.SVC()` `classifier.fit(X_train, y_train)`	An example of a classification model. Other classifiers like `DecisionTreeClassifier` and `KNeighborsClassifier` follow the same API.
Random Forest Classifier	`from sklearn.ensemble import RandomForestClassifier` `rf = RandomForestClassifier(n_estimators=100)` `rf.fit(X_train, y_train)`	Ensemble method using multiple decision trees.
Logistic Regression	`from sklearn.linear_model import LogisticRegression` `lr = LogisticRegression()` `lr.fit(X_train, y_train)`	Linear classification algorithm.
K-Nearest Neighbors	`from sklearn.neighbors import KNeighborsClassifier` `knn = KNeighborsClassifier(n_neighbors=5)` `knn.fit(X_train, y_train)`	Instance-based learning algorithm.

Clustering Algorithms

Action	Code Example	Description
K-Means Clustering	`from sklearn import cluster` `clustering = cluster.KMeans(n_clusters=3)` `clustering.fit(X)`	An unsupervised learning algorithm that groups data into a specified number of clusters.
Hierarchical Clustering	`from sklearn.cluster import AgglomerativeClustering` `hc = AgglomerativeClustering(n_clusters=3)` `hc.fit(X)`	Builds a hierarchy of clusters using agglomerative methods.
DBSCAN	`from sklearn.cluster import DBSCAN` `dbscan = DBSCAN(eps=0.3, min_samples=10)` `dbscan.fit(X)`	Density-based clustering algorithm that can find arbitrarily shaped clusters.
Gaussian Mixture Model	`from sklearn.mixture import GaussianMixture` `gmm = GaussianMixture(n_components=3)` `gmm.fit(X)`	Probabilistic clustering using Gaussian distributions.

Model Evaluation

Action	Code Example	Description
Classification Metrics	`from sklearn import metrics` `metrics.confusion_matrix(y_test, y_pred)` `print(metrics.classification_report(y_test, y_pred))`	Functions for assessing the performance of a classification model.
Cross-Validation Score	`from sklearn.model_selection import cross_val_score` `scores = cross_val_score(model, X, y, cv=5)`	Evaluate model using cross-validation across different train/test splits.
Grid Search (Hyperparameter Tuning)	`from sklearn.model_selection import GridSearchCV` `param_grid = {'C': [0.1, 1, 10]}` `grid_search = GridSearchCV(model, param_grid, cv=5)`	Systematically search for optimal hyperparameters.
ROC Curve & AUC	`from sklearn.metrics import roc_curve, auc` `python<br>y_proba = model.predict_proba(X_test)[:, 1]<br>fpr, tpr, thresholds = roc_curve(y_test, y_proba)<br>auc_score = auc(fpr, tpr)<br>`	Evaluate binary classification performance.

Data Preprocessing

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Normalize to [0, 1] range
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

Handling Missing Values

from sklearn.impute import SimpleImputer
import numpy as np

# Replace NaNs with mean value
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Replace NaNs with median
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

Categorical Encoding

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-hot encoding for categorical features
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical)

# Label encoding for target variables
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y_categorical)

Feature Selection

from sklearn.feature_selection import SelectKBest, f_regression

# Select top k features using statistical tests
selector = SelectKBest(score_func=f_regression, k=10)
X_selected = selector.fit_transform(X, y)

Essential scikit-learn Imports

# Core imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report

# Linear models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ensemble methods
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

# Dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Clustering
from sklearn.cluster import KMeans, DBSCAN
from sklearn.cluster import AgglomerativeClustering

Typical ML Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

Common Issues & Solutions

Handling Class Imbalance

from imblearn.over_sampling import SMOTE

# Oversampling minority class
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

Dealing with Outliers

from sklearn.ensemble import IsolationForest

# Detect outliers
isolation_forest = IsolationForest(contamination=0.1)
outlier_labels = isolation_forest.fit_predict(X)

# Remove outliers
X_clean = X[outlier_labels == 1]

Feature Engineering

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

scikit-learn provides a consistent, well-documented API for machine learning in Python. This cheatsheet covers the essential workflow from data preparation to model evaluation.

Updated: January 15, 2025
Author: Danial Pahlavan
Category: Machine Learning & Data Science