Numba & Cython - Python Performance Optimization
Advanced techniques for optimizing Python code performance using Numba (Just-In-Time compilation) and Cython (C extensions for Python).
Introduction to Python Performance Issues
Python's dynamic nature provides flexibility but can be slow for computationally intensive tasks. CPU-bound operations benefit from compilation to machine code.
Common Performance Bottlenecks: - Tight loops with numeric computations - Array operations and mathematical functions - Function call overhead in hot paths - Memory allocation in loops - Interpreter overhead for simple operations
Numba - Just-In-Time Compilation
What is Numba?
Numba is a JIT compiler for Python that translates Python functions to optimized machine code at runtime. It uses LLVM compiler infrastructure and specializes functions for specific argument types.
Key Features: - No source code changes required (minimal decorators) - Automatic parallelization and vectorization - GPU support for CUDA kernels - Support for NumPy arrays and ufuncs - Easy integration with existing code
Getting Started with Numba
pip install numba
Basic Numba Usage
import numba as nb
import numpy as np
import time
# Function to benchmark
def slow_sum(arr):
total = 0.0
for x in arr:
total += x ** 2
return total
# Numba JIT compiled version
@nb.jit
def fast_sum(arr):
total = 0.0
for x in arr:
total += x ** 2
return total
# Vectorized NumPy version for comparison
def numpy_sum(arr):
return np.sum(arr ** 2)
# Benchmark
data = np.random.randn(1000000)
# Pure Python
start = time.time()
result1 = slow_sum(data)
time1 = time.time() - start
# Numba JIT
start = time.time()
result2 = fast_sum(data) # First call compiles, second runs fast
time2 = time.time() - start
# NumPy
start = time.time()
result3 = numpy_sum(data)
time3 = time.time() - start
print(f"Python: {time1:.3f}s")
print(f"Numba: {time2:.3f}s")
print(f"NumPy: {time3:.3f}s")
print(f"Numba speedup: {time1/time2:.1f}x")
Compilation Modes
Different JIT Compilation Options
# Mode 1: Regular JIT (compile on first call)
@nb.jit
def regular_jit(x, y):
return x + y
# Mode 2: Force compilation type specification
@nb.jit(nb.int32(nb.int32, nb.int32))
def typed_function(a, b):
return a * 2 + b
# Mode 3: Compile for specific signatures
@nb.jit([nb.float64(nb.float64[:]), nb.complex128(nb.complex128[:])])
def overloaded_function(arr):
return np.sum(arr)
# Mode 4: Disable compilation (debugging)
@nb.jit(nopython=False) # Allows Python object mode
def debuggable_function(x):
print(type(x)) # Works in object mode
return x * 2
# Mode 5: Force nopython mode (no Python objects)
@nb.jit(nopython=True)
def nopython_only(arr):
# Must operate only on supported types
return np.sum(arr ** 2)
Array Operations with Numba
@nb.jit
def complex_array_ops(arr1, arr2):
result = np.empty_like(arr1)
for i in nb.prange(len(arr1)): # Parallel loop
x, y = arr1[i], arr2[i]
# Complex computation
result[i] = np.sin(x) * np.cos(y) + np.exp(-(x**2 + y**2))
return result
@nb.jit
def matrix_operations(A, B):
"""Matrix multiplication with Numba"""
m, n = A.shape
n, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for j in range(p):
for k in range(n):
C[i, j] += A[i, k] * B[k, j]
return C
Parallelization and Vectorization
# Automatic parallelization
@nb.jit(parallel=True)
def parallel_computation(arr):
result = np.zeros_like(arr)
for i in nb.prange(len(arr)):
result[i] = np.sin(arr[i]) ** 2 + np.cos(arr[i]) ** 2
return result
# Vectorized operations
@nb.vectorize([nb.float64(nb.float64, nb.float64)])
def vectorized_func(x, y):
return x**2 + y**2 - 2*x*y*np.cos(np.pi/4)
# Universal functions (ufuncs)
@nb.vectorize([nb.float64(nb.float64)], target='parallel')
def parallel_ufunc(x):
# Automatically parallelized across CPU cores
return np.exp(-x**2) * np.sin(10*x)
Advanced Numba Features
GPU Programming with CUDA
from numba import cuda
import math
@cuda.jit
def gpu_kernel(arr):
pos = cuda.grid(1)
if pos < arr.size:
arr[pos] = math.sin(arr[pos]) * math.exp(-arr[pos]**2)
def gpu_computation(data):
# Copy data to GPU
d_arr = cuda.to_device(data)
# Configure kernel
threads_per_block = 256
blocks_per_grid = (data.size + threads_per_block - 1) // threads_per_block
# Launch kernel
gpu_kernel[blocks_per_grid, threads_per_block](d_arr)
# Copy result back to host
return d_arr.copy_to_host()
# Usage
data = np.random.randn(1000000)
result = gpu_computation(data)
Numba Classes and Object-Oriented Programming
from numba.experimental import jitclass
from numba import int32, float64
@jitclass([
('value', float64),
('count', int32),
])
class RunningAverage:
def __init__(self, initial_value):
self.value = initial_value
self.count = 1
def update(self, new_value):
self.value = (self.value * self.count + new_value) / (self.count + 1)
self.count += 1
def get(self):
return self.value
# Usage
avg = RunningAverage(10.0)
for x in [12.0, 8.0, 15.0]:
avg.update(x)
print(f"Current average: {avg.get()}")
Cython - C Extensions for Python
What is Cython?
Cython is a superset of Python that compiles to C code. It allows writing Python-like code that gets translated to efficient C extensions.
Key Features: - Python-like syntax with optional C type declarations - C-level performance when typed appropriately - Easy C library integration - Incremental optimization (optimize bottlenecks first) - Source code compatible with Python
Basic Cython Setup
Installation
pip install cython
Basic Cython Function
# example.pyx (Cython source file)
def cython_fibonacci(int n):
cdef int i
cdef long a = 0, b = 1
for i in range(n):
a, b = b, a + b
return a
# setup.py for compilation
from setuptools import setup
from Cython.Build import cythonize
setup(
ext_modules=cythonize("example.pyx")
)
Type Declarations and Optimization
# fibonacci.pyx
import cython
@cython.boundscheck(False) # Turn off bounds checking
@cython.wraparound(False) # Turn off negative indexing
def fast_fibonacci(long n):
cdef long i, a = 0, b = 1
if n < 2:
return n
for i in range(2, n+1):
a, b = b, a + b
return b
# With memory views for arrays
def sum_squares(double[:] arr):
cdef long n = arr.shape[0]
cdef long i
cdef double total = 0.0
for i in range(n):
total += arr[i] * arr[i]
return total
Classes and Extensions Types
# myclass.pyx
import cython
@cython.cclass
class FastCounter:
# C-level attributes
cdef public int count
cdef double[:] data
def __init__(self, data):
self.count = 0
self.data = data
@cython.ccall
def increment(self):
self.count += 1
@cython.ccall
def compute_mean(self):
cdef long i, n = self.data.shape[0]
cdef double total = 0.0
for i in range(n):
total += self.data[i]
return total / n
Cython Array Operations
# arrays.pyx
import numpy as np
cimport numpy as cnp
# Type definitions for NumPy arrays
ctypedef cnp.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
def elementwise_multiply(cnp.ndarray[DTYPE_t, ndim=2] A,
cnp.ndarray[DTYPE_t, ndim=2] B):
"""
Element-wise multiplication of two 2D arrays
"""
cdef int m = A.shape[0], n = A.shape[1]
cdef int i, j
# Create output array
cdef cnp.ndarray[DTYPE_t, ndim=2] C = np.empty((m, n), dtype=A.dtype)
for i in range(m):
for j in range(n):
C[i, j] = A[i, j] * B[i, j]
return C
@cython.cdef
cdef inline double fast_sqrt(double x) nogil:
"""Fast square root using C library"""
return sqrt(x)
def vectorized_sqrt(double[:] arr):
cdef int n = arr.shape[0]
cdef int i
result = np.empty(n, dtype=np.float64)
cdef double[:] result_view = result
for i in range(n):
result_view[i] = fast_sqrt(arr[i])
return result
Parallel Cython with OpenMP
# parallel.pyx
from cython.parallel import prange
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def parallel_sum(double[:] arr, int num_threads=4):
"""Compute sum using OpenMP parallelization"""
cdef int n = arr.shape[0]
cdef int i
cdef double total = 0.0
# Parallel reduction
for i in prange(n, nogil=True, num_threads=num_threads):
total += arr[i]
return total
@cython.boundscheck(False)
def matrix_multiplication(double[:, :] A, double[:, :] B):
"""Parallel matrix multiplication"""
cdef int m = A.shape[0]
cdef int n = A.shape[1]
cdef int p = B.shape[1]
cdef double[:, :] C = np.zeros((m, p))
cdef int i, j, k
# Parallelize outer loops
for i in prange(m, nogil=True):
for j in range(p):
for k in range(n):
C[i, j] += A[i, k] * B[k, j]
return np.asarray(C)
Calling C Functions from Cython
# c_integration.pyx
cdef extern from "math.h":
double sin(double x)
double cos(double x)
double exp(double x)
cdef extern from "stdlib.h":
void* malloc(size_t size)
void free(void* ptr)
def optimized_waveform(double[:] times, double frequency):
"""Generate waveform using optimized C functions"""
cdef int n = times.shape[0]
cdef int i
result = np.empty(n, dtype=np.float64)
cdef double[:] result_view = result
cdef double omega = 2 * 3.14159 * frequency
for i in range(n):
# Use fast C math functions
result_view[i] = sin(omega * times[i]) + 0.5 * cos(2 * omega * times[i])
return result
def fast_exponential(double[:] arr):
"""Vectorized exponential using C exp function"""
cdef int n = arr.shape[0]
cdef int i
result = np.empty(n, dtype=np.float64)
cdef double[:] result_view = result
for i in range(n):
result_view[i] = exp(arr[i])
return result
Performance Comparison and When to Use
Performance Benchmarks
import time
import numpy as np
# Generate test data
data = np.random.randn(1000000)
# Pure Python
def pure_python_compute(arr):
return [x**2 + np.sin(x) for x in arr]
# NumPy vectorized
def numpy_compute(arr):
return arr**2 + np.sin(arr)
# Numba JIT
@nb.jit
def numba_compute(arr):
return arr**2 + np.sin(arr)
# Cython (compiled)
# Assume cython_compute is defined in compiled module
# Benchmark
functions = [pure_python_compute, numpy_compute, numba_compute] # Add cython_compute if available
names = ['Pure Python', 'NumPy', 'Numba']
for func, name in zip(functions, names):
start = time.time()
result = func(data)
elapsed = time.time() - start
print(f"{name}: {elapsed:.3f}s")
When to Use Each Approach
Use Numba when:
- Minimal code changes needed
- Function-based computations
- JIT acceptable for first compilation time
- NumPy-heavy operations
- GPU acceleration needed
Use Cython when:
- Complex algorithms requiring fine-grained control
- Integration with C/C++ libraries needed
- Custom data structures and classes
- Ahead-of-time compilation preferred
- OpenMP parallelism required
- Memory optimization critical
Best Practices
General Optimization Tips
- Profile first - identify bottlenecks before optimizing
- Vectorize operations when possible
- Minimize Python object overhead in hot loops
- Consider algorithm complexity over micro-optimizations
- Use appropriate tools for specific optimization needs
Numba Best Practices
- Use function signatures when possible for better optimization
- Avoid Python objects in nopython mode
- Use prange for parallel loops
- Cache JIT compilation for production use
Cython Best Practices
- Type variables for better performance
- Use memory views for array operations
- Disable bounds checking in optimized sections
- Profile Cython code to identify performance issues
Both Numba and Cython provide powerful ways to optimize Python code performance. Numba offers the easiest path for existing Python/NumPy code, while Cython provides more control and broader optimization opportunities. The choice depends on your specific performance needs and willingness to modify code.
This guide covers the core concepts of Numba and Cython. For production optimization, always profile your code first to identify real bottlenecks before applying these techniques.
Updated: January 15, 2025
Author: Danial Pahlavan
Category: Performance Optimization