Numba & Cython - Python Performance Optimization

Advanced techniques for optimizing Python code performance using Numba (Just-In-Time compilation) and Cython (C extensions for Python).


Introduction to Python Performance Issues

Python's dynamic nature provides flexibility but can be slow for computationally intensive tasks. CPU-bound operations benefit from compilation to machine code.

Common Performance Bottlenecks: - Tight loops with numeric computations - Array operations and mathematical functions - Function call overhead in hot paths - Memory allocation in loops - Interpreter overhead for simple operations

Numba - Just-In-Time Compilation

What is Numba?

Numba is a JIT compiler for Python that translates Python functions to optimized machine code at runtime. It uses LLVM compiler infrastructure and specializes functions for specific argument types.

Key Features: - No source code changes required (minimal decorators) - Automatic parallelization and vectorization - GPU support for CUDA kernels - Support for NumPy arrays and ufuncs - Easy integration with existing code

Getting Started with Numba

pip install numba

Basic Numba Usage

import numba as nb
import numpy as np
import time

# Function to benchmark
def slow_sum(arr):
    total = 0.0
    for x in arr:
        total += x ** 2
    return total

# Numba JIT compiled version
@nb.jit
def fast_sum(arr):
    total = 0.0
    for x in arr:
        total += x ** 2
    return total

# Vectorized NumPy version for comparison
def numpy_sum(arr):
    return np.sum(arr ** 2)

# Benchmark
data = np.random.randn(1000000)

# Pure Python
start = time.time()
result1 = slow_sum(data)
time1 = time.time() - start

# Numba JIT
start = time.time()
result2 = fast_sum(data)  # First call compiles, second runs fast
time2 = time.time() - start

# NumPy
start = time.time()
result3 = numpy_sum(data)
time3 = time.time() - start

print(f"Python: {time1:.3f}s")
print(f"Numba:  {time2:.3f}s")
print(f"NumPy:  {time3:.3f}s")
print(f"Numba speedup: {time1/time2:.1f}x")

Compilation Modes

Different JIT Compilation Options

# Mode 1: Regular JIT (compile on first call)
@nb.jit
def regular_jit(x, y):
    return x + y

# Mode 2: Force compilation type specification
@nb.jit(nb.int32(nb.int32, nb.int32))
def typed_function(a, b):
    return a * 2 + b

# Mode 3: Compile for specific signatures
@nb.jit([nb.float64(nb.float64[:]), nb.complex128(nb.complex128[:])])
def overloaded_function(arr):
    return np.sum(arr)

# Mode 4: Disable compilation (debugging)
@nb.jit(nopython=False)  # Allows Python object mode
def debuggable_function(x):
    print(type(x))  # Works in object mode
    return x * 2

# Mode 5: Force nopython mode (no Python objects)
@nb.jit(nopython=True)
def nopython_only(arr):
    # Must operate only on supported types
    return np.sum(arr ** 2)

Array Operations with Numba

@nb.jit
def complex_array_ops(arr1, arr2):
    result = np.empty_like(arr1)
    for i in nb.prange(len(arr1)):  # Parallel loop
        x, y = arr1[i], arr2[i]
        # Complex computation
        result[i] = np.sin(x) * np.cos(y) + np.exp(-(x**2 + y**2))
    return result

@nb.jit
def matrix_operations(A, B):
    """Matrix multiplication with Numba"""
    m, n = A.shape
    n, p = B.shape
    C = np.zeros((m, p))

    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]

    return C

Parallelization and Vectorization

# Automatic parallelization
@nb.jit(parallel=True)
def parallel_computation(arr):
    result = np.zeros_like(arr)
    for i in nb.prange(len(arr)):
        result[i] = np.sin(arr[i]) ** 2 + np.cos(arr[i]) ** 2
    return result

# Vectorized operations
@nb.vectorize([nb.float64(nb.float64, nb.float64)])
def vectorized_func(x, y):
    return x**2 + y**2 - 2*x*y*np.cos(np.pi/4)

# Universal functions (ufuncs)
@nb.vectorize([nb.float64(nb.float64)], target='parallel')
def parallel_ufunc(x):
    # Automatically parallelized across CPU cores
    return np.exp(-x**2) * np.sin(10*x)

Advanced Numba Features

GPU Programming with CUDA

from numba import cuda
import math

@cuda.jit
def gpu_kernel(arr):
    pos = cuda.grid(1)
    if pos < arr.size:
        arr[pos] = math.sin(arr[pos]) * math.exp(-arr[pos]**2)

def gpu_computation(data):
    # Copy data to GPU
    d_arr = cuda.to_device(data)

    # Configure kernel
    threads_per_block = 256
    blocks_per_grid = (data.size + threads_per_block - 1) // threads_per_block

    # Launch kernel
    gpu_kernel[blocks_per_grid, threads_per_block](d_arr)

    # Copy result back to host
    return d_arr.copy_to_host()

# Usage
data = np.random.randn(1000000)
result = gpu_computation(data)

Numba Classes and Object-Oriented Programming

from numba.experimental import jitclass
from numba import int32, float64

@jitclass([
    ('value', float64),
    ('count', int32),
])
class RunningAverage:
    def __init__(self, initial_value):
        self.value = initial_value
        self.count = 1

    def update(self, new_value):
        self.value = (self.value * self.count + new_value) / (self.count + 1)
        self.count += 1

    def get(self):
        return self.value

# Usage
avg = RunningAverage(10.0)
for x in [12.0, 8.0, 15.0]:
    avg.update(x)
    print(f"Current average: {avg.get()}")

Cython - C Extensions for Python

What is Cython?

Cython is a superset of Python that compiles to C code. It allows writing Python-like code that gets translated to efficient C extensions.

Key Features: - Python-like syntax with optional C type declarations - C-level performance when typed appropriately - Easy C library integration - Incremental optimization (optimize bottlenecks first) - Source code compatible with Python

Basic Cython Setup

Installation

pip install cython

Basic Cython Function

# example.pyx (Cython source file)
def cython_fibonacci(int n):
    cdef int i
    cdef long a = 0, b = 1
    for i in range(n):
        a, b = b, a + b
    return a

# setup.py for compilation
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("example.pyx")
)

Type Declarations and Optimization

# fibonacci.pyx
import cython

@cython.boundscheck(False)  # Turn off bounds checking
@cython.wraparound(False)   # Turn off negative indexing
def fast_fibonacci(long n):
    cdef long i, a = 0, b = 1
    if n < 2:
        return n
    for i in range(2, n+1):
        a, b = b, a + b
    return b

# With memory views for arrays
def sum_squares(double[:] arr):
    cdef long n = arr.shape[0]
    cdef long i
    cdef double total = 0.0

    for i in range(n):
        total += arr[i] * arr[i]

    return total

Classes and Extensions Types

# myclass.pyx
import cython

@cython.cclass
class FastCounter:
    # C-level attributes
    cdef public int count
    cdef double[:] data

    def __init__(self, data):
        self.count = 0
        self.data = data

    @cython.ccall
    def increment(self):
        self.count += 1

    @cython.ccall
    def compute_mean(self):
        cdef long i, n = self.data.shape[0]
        cdef double total = 0.0

        for i in range(n):
            total += self.data[i]

        return total / n

Cython Array Operations

# arrays.pyx
import numpy as np
cimport numpy as cnp

# Type definitions for NumPy arrays
ctypedef cnp.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
def elementwise_multiply(cnp.ndarray[DTYPE_t, ndim=2] A,
                        cnp.ndarray[DTYPE_t, ndim=2] B):
    """
    Element-wise multiplication of two 2D arrays
    """
    cdef int m = A.shape[0], n = A.shape[1]
    cdef int i, j

    # Create output array
    cdef cnp.ndarray[DTYPE_t, ndim=2] C = np.empty((m, n), dtype=A.dtype)

    for i in range(m):
        for j in range(n):
            C[i, j] = A[i, j] * B[i, j]

    return C

@cython.cdef
cdef inline double fast_sqrt(double x) nogil:
    """Fast square root using C library"""
    return sqrt(x)

def vectorized_sqrt(double[:] arr):
    cdef int n = arr.shape[0]
    cdef int i

    result = np.empty(n, dtype=np.float64)
    cdef double[:] result_view = result

    for i in range(n):
        result_view[i] = fast_sqrt(arr[i])

    return result

Parallel Cython with OpenMP

# parallel.pyx
from cython.parallel import prange
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def parallel_sum(double[:] arr, int num_threads=4):
    """Compute sum using OpenMP parallelization"""
    cdef int n = arr.shape[0]
    cdef int i
    cdef double total = 0.0

    # Parallel reduction
    for i in prange(n, nogil=True, num_threads=num_threads):
        total += arr[i]

    return total

@cython.boundscheck(False)
def matrix_multiplication(double[:, :] A, double[:, :] B):
    """Parallel matrix multiplication"""
    cdef int m = A.shape[0]
    cdef int n = A.shape[1]
    cdef int p = B.shape[1]

    cdef double[:, :] C = np.zeros((m, p))
    cdef int i, j, k

    # Parallelize outer loops
    for i in prange(m, nogil=True):
        for j in range(p):
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]

    return np.asarray(C)

Calling C Functions from Cython

# c_integration.pyx
cdef extern from "math.h":
    double sin(double x)
    double cos(double x)
    double exp(double x)

cdef extern from "stdlib.h":
    void* malloc(size_t size)
    void free(void* ptr)

def optimized_waveform(double[:] times, double frequency):
    """Generate waveform using optimized C functions"""
    cdef int n = times.shape[0]
    cdef int i

    result = np.empty(n, dtype=np.float64)
    cdef double[:] result_view = result

    cdef double omega = 2 * 3.14159 * frequency

    for i in range(n):
        # Use fast C math functions
        result_view[i] = sin(omega * times[i]) + 0.5 * cos(2 * omega * times[i])

    return result

def fast_exponential(double[:] arr):
    """Vectorized exponential using C exp function"""
    cdef int n = arr.shape[0]
    cdef int i

    result = np.empty(n, dtype=np.float64)
    cdef double[:] result_view = result

    for i in range(n):
        result_view[i] = exp(arr[i])

    return result

Performance Comparison and When to Use

Performance Benchmarks

import time
import numpy as np

# Generate test data
data = np.random.randn(1000000)

# Pure Python
def pure_python_compute(arr):
    return [x**2 + np.sin(x) for x in arr]

# NumPy vectorized
def numpy_compute(arr):
    return arr**2 + np.sin(arr)

# Numba JIT
@nb.jit
def numba_compute(arr):
    return arr**2 + np.sin(arr)

# Cython (compiled)
# Assume cython_compute is defined in compiled module

# Benchmark
functions = [pure_python_compute, numpy_compute, numba_compute]  # Add cython_compute if available
names = ['Pure Python', 'NumPy', 'Numba']

for func, name in zip(functions, names):
    start = time.time()
    result = func(data)
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.3f}s")

When to Use Each Approach

Use Numba when:

Use Cython when:

Best Practices

General Optimization Tips

  1. Profile first - identify bottlenecks before optimizing
  2. Vectorize operations when possible
  3. Minimize Python object overhead in hot loops
  4. Consider algorithm complexity over micro-optimizations
  5. Use appropriate tools for specific optimization needs

Numba Best Practices

Cython Best Practices

Both Numba and Cython provide powerful ways to optimize Python code performance. Numba offers the easiest path for existing Python/NumPy code, while Cython provides more control and broader optimization opportunities. The choice depends on your specific performance needs and willingness to modify code.

This guide covers the core concepts of Numba and Cython. For production optimization, always profile your code first to identify real bottlenecks before applying these techniques.

Updated: January 15, 2025
Author: Danial Pahlavan
Category: Performance Optimization