Apache Parquet - Columnar Storage Format
Comprehensive guide to Apache Parquet, the popular columnar storage format designed for efficient big data processing and analytics.
What is Apache Parquet?
Apache Parquet is an open-source columnar storage file format designed for use in big data ecosystems. It provides efficient data compression and encoding schemes with enhanced performance for complex data processing tasks like filtering, aggregation, and joins.
Key Features: - Columnar Storage: Data stored by column rather than row - Efficient Compression: Advanced algorithms with good compression ratios - Schema Evolution: Support for schema changes over time - Predicate Pushdown: Filter data at storage level - Multiple Encodings: Adaptive encoding based on data type
How Parquet Works
Row vs Column Storage
Traditional Row Storage:
Row 1: John, Smith, 1980, Male, Engineer
Row 2: Jane, Doe, 1985, Female, Manager
Row 3: Bob, Johnson, 1975, Male, Analyst
Storage: Name|Age|Gender|Job| (repeated for each row)
Parquet Columnar Storage:
Names: John, Jane, Bob
LastNames: Smith, Doe, Johnson
BirthYear: 1980, 1985, 1975
Gender: Male, Female, Male
Job: Engineer, Manager, Analyst
Benefits of Columnar Storage
- Selective Access: Only read required columns
- Better Compression: Similar values grouped together
- Efficient Filtering: Apply predicates to columns
- Vectorized Operations: Process entire columns at once
- Reduced I/O: Skip irrelevant data
PyArrow Integration
Installation and Setup
pip install pyarrow pandas
Basic Parquet Operations
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
# Create sample data
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 70000, 55000],
'department': ['Engineering', 'Sales', 'Engineering', 'Marketing']
})
# Convert to PyArrow Table
table = pa.Table.from_pandas(df)
# Write to Parquet
pq.write_table(table, 'employees.parquet')
Reading Parquet Files
# Read entire file
table = pq.read_table('employees.parquet')
df = table.to_pandas()
# Read specific columns only
table = pq.read_table('employees.parquet', columns=['name', 'salary'])
print(f"Shape with selected columns: {table.shape}")
# Read with filtering
import pyarrow.compute as pc
# Read and filter in-memory
table = pq.read_table('employees.parquet')
filtered = pc.filter(table, pc.field('salary') > 55000)
print(f"High earners: {filtered.num_rows}")
Schema and Data Types
Parquet Schema
# Inspect schema
parquet_file = pq.ParquetFile('employees.parquet')
print("Schema:")
print(parquet_file.schema_arrow)
# Schema evolution example
original_schema = pa.schema([
('name', pa.string()),
('age', pa.int64()),
('salary', pa.float64())
])
# Add new column to schema
extended_schema = original_schema.append(pa.field('department', pa.string()))
Data Types in Parquet
# Define schema explicitly
custom_schema = pa.schema([
pa.field('id', pa.int64()),
pa.field('timestamp', pa.timestamp('s')),
pa.field('price', pa.decimal128(10, 2)), # 8 digits before decimal, 2 after
pa.field('category', pa.dictionary(pa.int8(), pa.string())), # Dictionary encoding
pa.field('tags', pa.list_(pa.string())) # List/array type
])
# Create table with custom schema
data = {
'id': [1, 2, 3],
'timestamp': pd.date_range('2025-01-01', periods=3),
'price': [99.99, 149.50, 79.95],
'category': ['A', 'B', 'A'],
'tags': [['tech', 'new'], ['home', 'sale'], ['books']]
}
table = pa.table(data, schema=custom_schema)
Partitioning and Optimization
Partitioning by Columns
# Write partitioned data
pq.write_to_dataset(
table,
root_path='partitioned_data',
partition_cols=['department']
)
# Partition structure created:
# partitioned_data/
# ├── department=Engineering/
# │ └── part-0.parquet
# ├── department=Sales/
# │ └── part-0.parquet
# └── department=Marketing/
# └── part-0.parquet
Row Group Optimization
# Control row group size for better performance
pq.write_table(
table,
'optimized.parquet',
row_group_size=100000, # Number of rows per row group
use_dictionary=True, # Use dictionary encoding for strings
compression='snappy' # Compression codec
)
Compression Algorithms
Available Compression Codecs
# Different compression options
compression_options = ['snappy', 'gzip', 'zstd', 'lz4', 'brotli']
# Write with different compressions for comparison
for codec in compression_options:
pq.write_table(table, f'data_{codec}.parquet', compression=codec)
file_size = Path(f'data_{codec}.parquet').stat().st_size
print(f"{codec}: {file_size} bytes")
Compression Trade-offs
- Snappy: Fast compression/decompression, good for analytical workloads
- Gzip: Better compression ratio, slower than Snappy
- Zstandard (Zstd): Configurable compression-speed trade-off
- LZ4: Very fast compression/decompression
- Brotli: Highest compression ratio, slowest
Advanced Features
Predicate Pushdown
# Filter at read time (pushdown to storage layer)
from pyarrow.dataset import dataset
# Create dataset for advanced operations
ds = dataset('employees.parquet', format='parquet')
# Define filter
filter_expr = pc.field('salary') > 55000
# Read only data that satisfies filter
filtered_data = ds.to_table(filter=filter_expr)
print(f"Filtered results: {filtered_data.num_rows} rows")
Chunked Reading
# Process large files in chunks
parquet_file = pq.ParquetFile('large_file.parquet')
for i, batch in enumerate(parquet_file.iter_batches(batch_size=10000)):
# Process each batch
df = batch.to_pandas()
process_batch(df)
print(f"Processed batch {i+1}")
Metadata and Statistics
# Access file metadata
metadata = pq.read_metadata('data.parquet')
print(f"Number of row groups: {metadata.num_row_groups}")
print(f"Total rows: {metadata.num_rows}")
print(f"Created by: {metadata.created_by}")
# Get column statistics for each row group
for i in range(metadata.num_row_groups):
rg_meta = metadata.row_group(i)
for col in rg_meta.column_metadata:
print(f"Column {col.name}: min={col.statistics.min}, max={col.statistics.max}")
Integration with Big Data Tools
Spark Integration
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ParquetExample").getOrCreate()
# Read Parquet with Spark
df = spark.read.parquet("large_dataset.parquet")
# Apply transformations
result = df.filter("salary > 50000").groupBy("department").avg("salary")
# Write back to Parquet with partitioning
result.write.partitionBy("department").parquet("salary_analysis.parquet")
Pandas Integration
# Pandas can read/write Parquet directly (if pyarrow installed)
df = pd.read_parquet('data.parquet')
# Write with specific options
df.to_parquet('output.parquet',
engine='pyarrow',
compression='snappy',
index=False)
Performance Optimization
Best Practices for Performance
# Optimize for your access patterns
# For analytical workloads (few columns, many rows)
table = pq.read_table('data.parquet', columns=['col1', 'col3', 'col5'])
# For time-based queries with partitioning
pq.write_to_dataset(table,
root_path='time_partitioned',
partition_cols=['year', 'month'])
# Use appropriate chunk size
pq.write_table(table, 'optimized.parquet',
row_group_size=500000, # Balance memory vs query performance
compression='zstd',
compression_level=3)
Schema Design Guidelines
- Use appropriate data types: Choose precision needed (int32 vs int64)
- Consider dictionary encoding for low-cardinality strings
- Partition strategically: Common query filters as partition keys
- Avoid deep nesting when possible - prefer flat schemas for analytics
Comparison with Other Formats
Parquet vs CSV
- Parquet: Compressed, columnar, typed, faster queries
- CSV: Human readable, universal support, slower for analytics
Parquet vs ORC
- Parquet: Better compression, cross-engine support
- ORC: Optimized for Hive, better predicate pushdown
Parquet vs Delta Lake
- Parquet: File format only, no transactions
- Delta Lake: Adds ACID transactions, time travel, schema enforcement
File Format Details
Parquet File Structure
Parquet File:
├── Header (magic number "PAR1")
├── Row Groups (1+):
│ ├── Column Chunks:
│ │ ├── Page Header
│ │ ├── Data Pages (compressed/encoded)
│ │ └── Dictionary Pages (optional)
│ └── Metadata
├── Footer:
│ ├── Metadata
│ └── Magic number "PAR1"
Encoding Types
- Plain: No encoding, direct values
- Dictionary: Replaces repeated values with integers
- Run Length Encoding (RLE): Compresses runs of identical values
- Delta Encoding: Stores differences between values
- Bit Packing: Packs small integers more efficiently
Apache Parquet has become the standard for analytical workloads in modern data lakes and warehouses. Its columnar format and advanced compression techniques make it ideal for big data processing, providing significant performance improvements over traditional row-based formats.
This guide covers the core functionality of Apache Parquet with PyArrow. For production systems, consider additional tools like Apache Spark, Presto, or Athena for processing Parquet files at scale.
Updated: January 15, 2025
Author: Danial Pahlavan
Category: Data Formats