scikit-learn Cheat Sheet
Essential guide for machine learning with scikit-learn, covering the fundamental workflow and key algorithms.
Core Machine Learning Workflow
| Action |
Code Example |
Description |
| Split Data into Train/Test Sets |
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_all, y_all, train_size=0.5) |
Randomly partitions the data into training and testing subsets to evaluate model performance on unseen data. |
| Create and Train a Model |
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X_train, y_train) |
The standard workflow: instantiate a model class and then call the fit method with the training data. |
| Predict with a Trained Model |
y_pred = model.predict(X_test) |
Uses the trained model to make predictions on new data. |
| Evaluate Model Score (R²) |
model.score(X_test, y_test) |
For regression models, this returns the R-squared score, a measure of how well the model fits the data. |
| Get Model Coefficients |
model.coef_ |
Accesses the fitted coefficients (parameters) of a linear model. |
Regularized Regression
| Action |
Code Example |
Description |
| Lasso Regression (L1) |
model = linear_model.Lasso(alpha=1.0) |
A linear model with L1 regularization, which tends to produce sparse coefficients (many zeros). |
| Ridge Regression (L2) |
model = linear_model.Ridge(alpha=2.5) |
A linear model with L2 regularization, which penalizes large coefficient values. |
| Cross-Validated Lasso |
model = linear_model.LassoCV()
model.fit(X_all, y_all) |
A Lasso model that automatically finds the best alpha parameter using cross-validation. |
Classification Algorithms
| Action |
Code Example |
Description |
| Support Vector Machine (SVM) |
from sklearn import svm
classifier = svm.SVC()
classifier.fit(X_train, y_train) |
An example of a classification model. Other classifiers like DecisionTreeClassifier and KNeighborsClassifier follow the same API. |
| Random Forest Classifier |
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train) |
Ensemble method using multiple decision trees. |
| Logistic Regression |
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train) |
Linear classification algorithm. |
| K-Nearest Neighbors |
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train) |
Instance-based learning algorithm. |
Clustering Algorithms
| Action |
Code Example |
Description |
| K-Means Clustering |
from sklearn import cluster
clustering = cluster.KMeans(n_clusters=3)
clustering.fit(X) |
An unsupervised learning algorithm that groups data into a specified number of clusters. |
| Hierarchical Clustering |
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters=3)
hc.fit(X) |
Builds a hierarchy of clusters using agglomerative methods. |
| DBSCAN |
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=10)
dbscan.fit(X) |
Density-based clustering algorithm that can find arbitrarily shaped clusters. |
| Gaussian Mixture Model |
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)
gmm.fit(X) |
Probabilistic clustering using Gaussian distributions. |
Model Evaluation
| Action |
Code Example |
Description |
| Classification Metrics |
from sklearn import metrics
metrics.confusion_matrix(y_test, y_pred)
print(metrics.classification_report(y_test, y_pred)) |
Functions for assessing the performance of a classification model. |
| Cross-Validation Score |
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5) |
Evaluate model using cross-validation across different train/test splits. |
| Grid Search (Hyperparameter Tuning) |
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10]}
grid_search = GridSearchCV(model, param_grid, cv=5) |
Systematically search for optimal hyperparameters. |
| ROC Curve & AUC |
from sklearn.metrics import roc_curve, auc
python<br>y_proba = model.predict_proba(X_test)[:, 1]<br>fpr, tpr, thresholds = roc_curve(y_test, y_proba)<br>auc_score = auc(fpr, tpr)<br> |
Evaluate binary classification performance. |
Data Preprocessing
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Normalize to [0, 1] range
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
Handling Missing Values
from sklearn.impute import SimpleImputer
import numpy as np
# Replace NaNs with mean value
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
# Replace NaNs with median
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)
Categorical Encoding
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# One-hot encoding for categorical features
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical)
# Label encoding for target variables
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y_categorical)
Feature Selection
from sklearn.feature_selection import SelectKBest, f_regression
# Select top k features using statistical tests
selector = SelectKBest(score_func=f_regression, k=10)
X_selected = selector.fit_transform(X, y)
Essential scikit-learn Imports
# Core imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report
# Linear models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# Ensemble methods
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
# Dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Clustering
from sklearn.cluster import KMeans, DBSCAN
from sklearn.cluster import AgglomerativeClustering
Typical ML Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train pipeline
pipeline.fit(X_train, y_train)
# Make predictions
y_pred = pipeline.predict(X_test)
Common Issues & Solutions
Handling Class Imbalance
from imblearn.over_sampling import SMOTE
# Oversampling minority class
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
Dealing with Outliers
from sklearn.ensemble import IsolationForest
# Detect outliers
isolation_forest = IsolationForest(contamination=0.1)
outlier_labels = isolation_forest.fit_predict(X)
# Remove outliers
X_clean = X[outlier_labels == 1]
Feature Engineering
# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
scikit-learn provides a consistent, well-documented API for machine learning in Python. This cheatsheet covers the essential workflow from data preparation to model evaluation.
Updated: January 15, 2025
Author: Danial Pahlavan
Category: Machine Learning & Data Science