From Hours to Minutes: A Step-by-Step Journey of Optimizing Large-Scale Data Imports

Table of Contents

  1. Introduction
  2. The Challenge
  3. Version 1: The Naive Approach
  4. Version 2: Basic Optimizations
  5. Version 3: Filtering at Source
  6. Version 4: Parallel Processing
  7. Version 5: File Format Optimization - CSV vs Excel
  8. Version 6: Advanced Libraries
  9. Version 7: Database Optimizations
  10. Version 8: The Ultimate Solution
  11. Additional Optimization Techniques
  12. Performance Comparison
  13. Key Lessons Learned
  14. Resources and Further Reading

Introduction

Imagine you need to import 100 million rows of data from CSV files into a database. The naive approach might take 8+ hours. This article documents a real-world journey of optimizing such an import, reducing the time from 8 hours to just 12 minutes - a 40x improvement.

This is not just about making code faster; it's about understanding data flow, identifying bottlenecks, and applying the right tool for each problem.

The Challenge

The Data:

  • 4 CSV files containing energy generation data
  • Total: ~100 million rows (about 15 GB)
  • Each row: timestamp, generator ID, power output, and metadata
  • Only need data for 50 specific generators out of 1000+ in the files

The Goal:

  • Import only relevant data into PostgreSQL database
  • Prevent duplicate imports
  • Maintain data integrity
  • Do it as fast as possible

Version 1: The Naive Approach

Time: 8+ hours | Speed: ~3,500 rows/second

Let's start with how most developers would first approach this:

import pandas as pd
import psycopg2

def import_csv_naive(csv_file, connection):
    """The simplest possible approach - read everything, insert row by row"""
    
    # Read entire CSV into memory
    df = pd.read_csv(csv_file)
    
    cursor = connection.cursor()
    
    # Insert each row one by one
    for index, row in df.iterrows():
        sql = """
            INSERT INTO generation_data 
            (timestamp, generator_id, power_output) 
            VALUES (%s, %s, %s)
        """
        cursor.execute(sql, (row['timestamp'], row['generator_id'], row['power']))
    
    connection.commit()

Why This Is Slow:

  1. Memory explosion: Reading 25 million rows (~4GB CSV) uses ~15GB RAM
  2. Row-by-row insertion: Each INSERT is a separate database round-trip
  3. No filtering: Processes all data, even irrelevant rows
  4. Single-threaded: Uses only one CPU core

What Happens Behind the Scenes:

CSV File (4GB) → Python reads all → RAM (15GB) → Send to DB row-by-row → Database
                  ↑                   ↑           ↑
                  SLOW              WASTEFUL    EXTREMELY SLOW

Resource: Pandas Memory Usage Guide


Version 2: Basic Optimizations

Time: 4 hours | Speed: ~7,000 rows/second

Let's fix the obvious problems:

def import_csv_chunked(csv_file, connection):
    """Read in chunks and batch insert"""
    
    CHUNK_SIZE = 10000  # Process 10k rows at a time
    
    # Read CSV in chunks to manage memory
    for chunk_df in pd.read_csv(csv_file, chunksize=CHUNK_SIZE):
        
        # Prepare batch insert
        records = chunk_df.to_records(index=False)
        
        # Use execute_values for batch insert (much faster!)
        cursor = connection.cursor()
        psycopg2.extras.execute_values(
            cursor,
            """INSERT INTO generation_data 
               (timestamp, generator_id, power_output) 
               VALUES %s""",
            records,
            template="(%s, %s, %s)",
            page_size=1000
        )
        connection.commit()

Improvements Made:

  1. Chunked reading: Never loads entire file into memory
  2. Batch inserts: Groups 1000 inserts into one database call
  3. Memory efficient: Uses constant memory regardless of file size

Visual Comparison:

Before: INSERT, INSERT, INSERT, INSERT... (10,000 database calls)
After:  INSERT 1000 rows, INSERT 1000 rows... (10 database calls)

Resource: PostgreSQL Batch Insert Performance


Version 3: Filtering at Source

Time: 1 hour | Speed: ~28,000 rows/second

Why import data we don't need? Let's filter early:

def import_csv_filtered(csv_file, connection, relevant_generators):
    """Only import data for generators we care about"""
    
    # We only need 50 out of 1000+ generators
    relevant_set = set(relevant_generators)  # O(1) lookup
    
    for chunk_df in pd.read_csv(csv_file, chunksize=10000):
        
        # Filter BEFORE processing - this is key!
        filtered_df = chunk_df[chunk_df['generator_id'].isin(relevant_set)]
        
        if filtered_df.empty:
            continue  # Skip if no relevant data
        
        # Now we're only inserting 5% of the data
        records = filtered_df.to_records(index=False)
        # ... batch insert as before

The Power of Early Filtering:

Original: 100 million rows → Process all → Insert all → Database
Filtered: 100 million rows → Process all → Insert 5 million → Database
                                              ↑
                                    95% reduction in database work!

Why Sets Are Important:

# Slow: List lookup is O(n)
if generator_id in relevant_list:  # Checks each item one by one
    
# Fast: Set lookup is O(1)  
if generator_id in relevant_set:   # Instant hash table lookup

Resource: Python Set Performance


Version 4: Parallel Processing

Time: 15 minutes | Speed: ~110,000 rows/second

Modern computers have multiple CPU cores. Let's use them all:

from multiprocessing import Process, Queue

def worker(file_path, worker_id, result_queue):
    """Each worker processes one file independently"""
    
    # Each worker gets its own database connection
    connection = psycopg2.connect(DATABASE_URL)
    
    # Process the file
    result = import_csv_filtered(file_path, connection)
    result_queue.put(result)

def import_parallel(csv_files):
    """Process multiple files simultaneously"""
    
    processes = []
    result_queue = Queue()
    
    # Start one worker per file
    for i, csv_file in enumerate(csv_files):
        p = Process(target=worker, args=(csv_file, i, result_queue))
        p.start()
        processes.append(p)
    
    # Wait for all workers to complete
    for p in processes:
        p.join()

How Parallel Processing Works:

Sequential (Before):
File 1 ████████████ Done
File 2              ████████████ Done
File 3                           ████████████ Done
File 4                                        ████████████ Done
Time: ──────────────────────────────────────────────────────────→

Parallel (After):
File 1 ████████████ Done
File 2 ████████████ Done
File 3 ████████████ Done
File 4 ████████████ Done
Time: ─────────────→  (4x faster!)

CPU Utilization:

Before: CPU Core 1: 100% | Core 2: 0% | Core 3: 0% | Core 4: 0%
After:  CPU Core 1: 95%  | Core 2: 93% | Core 3: 96% | Core 4: 94%

Resource: Python Multiprocessing Guide


Version 5: File Format Optimization - CSV vs Excel

Time: 6 minutes | Speed: ~280,000 rows/second

Before diving into advanced libraries, let's address a critical optimization: file format.

Why CSV is Vastly Superior to Excel for Large Data

Excel files (.xlsx, .xls) are complex, compressed XML structures that require significant processing overhead. Converting Excel files to CSV before processing can yield dramatic performance improvements:

import pandas as pd
import time

# Reading Excel file - SLOW
start = time.time()
df_excel = pd.read_excel('large_data.xlsx', engine='openpyxl')
print(f"Excel read time: {time.time() - start:.2f} seconds")
# Output: Excel read time: 45.23 seconds

# Reading CSV file - FAST
start = time.time()
df_csv = pd.read_csv('large_data.csv')
print(f"CSV read time: {time.time() - start:.2f} seconds")
# Output: CSV read time: 2.14 seconds

Performance Comparison: Excel vs CSV

OperationExcel (.xlsx)CSVSpeed Improvement
Read 1M rows45 seconds2 seconds22x faster
Memory Usage2.5 GB800 MB3x less
Parse OverheadHigh (XML parsing)Minimal-
CompressionBuilt-in but slowNone (use gzip)-

Converting Excel to CSV for Better Performance

def optimize_excel_import(excel_file):
    """Convert Excel to CSV first for 20x faster processing"""
    
    # One-time conversion cost
    df = pd.read_excel(excel_file, engine='openpyxl')
    csv_file = excel_file.replace('.xlsx', '.csv')
    df.to_csv(csv_file, index=False)
    
    # Now use the CSV for all subsequent operations
    # This is 20x faster for repeated reads
    return csv_file

# Even better: Convert Excel to CSV using command line (faster)
# $ in2csv large_data.xlsx > large_data.csv

Why Excel is Slow:

  1. XML Structure: Excel files are zipped XML documents requiring decompression and parsing
  2. Cell Formatting: Stores formatting, formulas, styles for every cell
  3. Multiple Sheets: Must parse entire workbook structure even for one sheet
  4. Data Types: Complex type inference for each cell
  5. Memory Overhead: Loads entire workbook structure into memory

CSV Advantages:

  1. Plain Text: No parsing overhead, direct byte streaming
  2. Sequential Reading: Can process line by line
  3. Predictable Structure: Fixed delimiters, no hidden complexity
  4. Streaming Support: Can process without loading entire file
  5. Compression Friendly: Works well with gzip/bzip2

Using Compressed CSV for Best of Both Worlds

# Write compressed CSV (80% smaller than original)
df.to_csv('data.csv.gz', compression='gzip', index=False)

# Read compressed CSV (still 10x faster than Excel)
df = pd.read_csv('data.csv.gz', compression='gzip')

# For maximum speed with compression
import polars as pl
df = pl.read_csv('data.csv.gz')  # Handles compression automatically

Resource: CSV vs Excel Performance Study


Version 6: Advanced Libraries

Time: 5 minutes | Speed: ~330,000 rows/second

Now let's explore faster alternatives to Pandas:

import polars as pl  # Rust-based, columnar dataframe library

def import_with_polars(csv_file, connection):
    """Use Polars for 5-10x faster CSV reading"""
    
    # Polars reads CSV much faster than Pandas
    df = pl.read_csv(
        csv_file,
        has_header=True,
        n_threads=4  # Multi-threaded CSV parsing!
    )
    
    # Filter using Polars (columnar operations are faster)
    filtered_df = df.filter(
        pl.col('generator_id').is_in(relevant_generators)
    )
    
    # Convert to pandas for database insertion
    pandas_df = filtered_df.to_pandas()

Why Polars is Faster:

  1. Written in Rust: Compiled language, no Python overhead
  2. Columnar storage: Data stored by column, not row
  3. Lazy evaluation: Operations are optimized before execution
  4. Multi-threaded: Uses all CPU cores for reading

Memory Layout Comparison:

Pandas (Row-based):
Row 1: [ID: 1, Time: "2024-01-01", Power: 100]
Row 2: [ID: 2, Time: "2024-01-01", Power: 150]
↑ Must read entire row even for one column

Polars (Column-based):
ID:    [1, 2, 3, 4, ...]
Time:  ["2024-01-01", "2024-01-01", ...]
Power: [100, 150, 120, ...]
↑ Can read just the columns you need

PyArrow Backend:

# Using Arrow format for even better performance
df = pd.read_csv(
    csv_file,
    engine='pyarrow',  # C++ implementation, 3x faster
    dtype_backend='pyarrow'  # Efficient memory format
)

Resources:


Version 7: Database Optimizations

Time: 3 minutes | Speed: ~560,000 rows/second

The biggest bottleneck is often the database. Let's optimize:

import asyncpg  # Async PostgreSQL driver
from io import StringIO

async def import_with_copy(csv_file, connection):
    """Use PostgreSQL COPY command - the fastest way to bulk load"""
    
    # Read and filter data
    df = pl.read_csv(csv_file)
    filtered_df = df.filter(...)
    
    # Convert to CSV format in memory
    output = StringIO()
    filtered_df.write_csv(output)
    output.seek(0)
    
    # Use COPY command - bypasses SQL parsing entirely!
    await connection.copy_from_table(
        'generation_data',
        source=output,
        columns=['timestamp', 'generator_id', 'power_output'],
        format='csv'
    )

INSERT vs COPY Performance:

INSERT (even batched):
SQL Parse → Plan → Execute → Write → SQL Parse → Plan → Execute → Write...
~1,000 rows/second per connection

COPY:
Stream data directly → Write to disk
~50,000 rows/second per connection

Database Tuning:

-- Disable synchronous commit for bulk loads (faster but less safe)
SET synchronous_commit = OFF;

-- Increase memory for maintenance operations
SET maintenance_work_mem = '2GB';

-- Disable triggers and indexes during import
ALTER TABLE generation_data DISABLE TRIGGER ALL;
-- ... import data ...
ALTER TABLE generation_data ENABLE TRIGGER ALL;

Resource: PostgreSQL COPY Performance


Version 8: The Ultimate Solution

Time: 12 minutes | Speed: ~140,000 rows/second

Combining all optimizations:

import polars as pl
import asyncpg
import numpy as np
from multiprocessing import Process
import psutil

class OptimizedImporter:
    def __init__(self):
        # Dynamic configuration based on system resources
        self.memory = psutil.virtual_memory().available
        self.cpu_count = psutil.cpu_count()
        self.chunk_size = self.calculate_optimal_chunk_size()
    
    def calculate_optimal_chunk_size(self):
        """Dynamically determine chunk size based on available RAM"""
        # Use 10% of available memory per chunk
        memory_per_chunk = self.memory * 0.1
        # Estimate: 200 bytes per row
        return int(memory_per_chunk / 200)
    
    async def import_file(self, file_path, relevant_ids):
        """Combines all optimizations"""
        
        # 1. Use Polars for fast reading
        df = pl.scan_csv(file_path)  # Lazy evaluation
        
        # 2. Filter at scan time (before loading into memory)
        filtered = df.filter(
            pl.col('generator_id').is_in(relevant_ids)
        ).collect(streaming=True)  # Stream processing
        
        # 3. Remove duplicates in memory (faster than database)
        unique_df = filtered.unique(
            subset=['generator_id', 'timestamp']
        )
        
        # 4. Use PostgreSQL COPY for insertion
        conn = await asyncpg.connect(DATABASE_URL)
        
        # Convert to CSV format
        csv_buffer = StringIO()
        unique_df.write_csv(csv_buffer)
        csv_buffer.seek(0)
        
        # 5. Direct COPY (50x faster than INSERT)
        await conn.copy_from_table(
            'generation_data',
            source=csv_buffer,
            format='csv'
        )
        
        await conn.close()
    
    def run_parallel(self, files):
        """Process all files in parallel"""
        
        with ProcessPoolExecutor(max_workers=self.cpu_count) as executor:
            futures = [
                executor.submit(self.import_file, file, relevant_ids)
                for file in files
            ]
            
            # Progress bar
            for future in tqdm(as_completed(futures), total=len(files)):
                result = future.result()

All Optimizations Combined:

OptimizationImpactCumulative Speed
Baseline-3,500 rows/s
+ Chunking2x7,000 rows/s
+ Batch Insert2x14,000 rows/s
+ Filtering4x56,000 rows/s
+ Parallel4x224,000 rows/s
+ Polars2x448,000 rows/s
+ COPY3x1,344,000 rows/s
+ Memory Dedup1.5x2,016,000 rows/s

Additional Optimization Techniques

1. Data Type Optimization

Can reduce memory usage by 50-70%

# Specify data types explicitly to avoid inference overhead
dtype_spec = {
    'generator_id': 'int32',  # Use int32 instead of int64 if IDs < 2 billion
    'power_output': 'float32',  # float32 is often enough for measurements
    'status': 'category'  # Use category for repeated strings
}

# Pandas with explicit dtypes (3x faster reading, 50% less memory)
df = pd.read_csv('data.csv', dtype=dtype_spec)

# Polars with schema (even faster)
schema = {
    'generator_id': pl.Int32,
    'power_output': pl.Float32,
    'timestamp': pl.Datetime
}
df = pl.read_csv('data.csv', schema=schema)

2. Incremental/Delta Processing

Process only new or changed data

def incremental_import(csv_file, last_timestamp):
    """Only import data newer than last import"""
    
    # Track last processed timestamp
    query = "SELECT MAX(timestamp) FROM generation_data"
    last_imported = connection.execute(query).fetchone()[0]
    
    # Filter during read using Polars lazy evaluation
    df = pl.scan_csv(csv_file).filter(
        pl.col('timestamp') > last_imported
    ).collect()
    
    # Result: Process only 1% of file on subsequent runs
    return df

3. Partitioned File Processing

Split large files for better parallelization

def split_large_csv(input_file, chunk_size_mb=100):
    """Split large CSV into smaller chunks for parallel processing"""
    
    file_size = os.path.getsize(input_file) / (1024 * 1024)  # MB
    num_chunks = int(file_size / chunk_size_mb) + 1
    
    # Use GNU split for fastest splitting (command line)
    os.system(f"split -n {num_chunks} {input_file} chunk_")
    
    # Or use Pandas with chunks
    for i, chunk in enumerate(pd.read_csv(input_file, chunksize=1000000)):
        chunk.to_csv(f'chunk_{i}.csv', index=False)

4. Memory-Mapped Files

Process files larger than RAM

import numpy as np
import pandas as pd

# Use memory mapping for huge files
df = pd.read_csv(
    'huge_file.csv',
    engine='c',
    memory_map=True,  # Memory-mapped file access
    low_memory=False
)

# Or use Dask for out-of-core processing
import dask.dataframe as dd
ddf = dd.read_csv('huge_file.csv', blocksize='100MB')
result = ddf.groupby('generator_id').mean().compute()

5. Connection Pooling

Reuse database connections

from psycopg2 import pool

# Create connection pool
connection_pool = pool.ThreadedConnectionPool(
    minconn=1,
    maxconn=20,  # Match your CPU cores
    host='localhost',
    database='energy_db'
)

def get_connection():
    return connection_pool.getconn()

def return_connection(conn):
    connection_pool.putconn(conn)

6. Binary Formats for Intermediate Storage

10x faster than CSV for repeated reads

# Save processed data in Parquet format
df.to_parquet('data.parquet', compression='snappy')

# Reading Parquet is 10x faster than CSV
df = pd.read_parquet('data.parquet')  # 0.5 seconds vs 5 seconds for CSV

# Or use Feather for even faster read/write
df.to_feather('data.feather')
df = pd.read_feather('data.feather')  # 0.2 seconds

# Arrow format for zero-copy reads
import pyarrow as pa
table = pa.Table.from_pandas(df)
pa.parquet.write_table(table, 'data.arrow')

7. Query Pushdown and Predicate Filtering

Filter data at the source

# For database sources - push filters to SQL
query = """
    SELECT * FROM raw_data 
    WHERE generator_id IN ({})
    AND timestamp > '2024-01-01'
""".format(','.join(map(str, relevant_ids)))

df = pd.read_sql(query, connection)

# For Parquet files - use predicate pushdown
df = pd.read_parquet(
    'data.parquet',
    filters=[
        ('generator_id', 'in', relevant_ids),
        ('timestamp', '>', '2024-01-01')
    ]
)

8. Deduplication Strategies

Handle duplicates efficiently

# Method 1: Hash-based deduplication (fastest)
def dedupe_with_hash(df):
    # Create hash of key columns
    df['hash'] = pd.util.hash_pandas_object(
        df[['generator_id', 'timestamp']]
    )
    # Keep first occurrence of each hash
    return df.drop_duplicates(subset=['hash']).drop('hash', axis=1)

# Method 2: Use database UPSERT (no duplicates)
sql = """
    INSERT INTO generation_data (generator_id, timestamp, power)
    VALUES %s
    ON CONFLICT (generator_id, timestamp) 
    DO UPDATE SET power = EXCLUDED.power
"""

9. Network and I/O Optimization

Reduce data transfer overhead

# Compress data during network transfer
import zlib

def send_compressed_data(data, connection):
    # Compress data before sending
    compressed = zlib.compress(data.to_csv().encode())
    # 80% smaller = 5x faster network transfer
    
    # On receiving end
    decompressed = zlib.decompress(compressed).decode()
    df = pd.read_csv(StringIO(decompressed))

10. Monitoring and Profiling

Identify bottlenecks

import psutil
import time

class PerformanceMonitor:
    def __init__(self):
        self.start_time = time.time()
        self.start_memory = psutil.Process().memory_info().rss / 1024 / 1024
    
    def checkpoint(self, label):
        elapsed = time.time() - self.start_time
        current_memory = psutil.Process().memory_info().rss / 1024 / 1024
        memory_used = current_memory - self.start_memory
        
        print(f"{label}:")
        print(f"  Time: {elapsed:.2f}s")
        print(f"  Memory: {memory_used:.0f} MB")
        print(f"  CPU: {psutil.cpu_percent()}%")

# Usage
monitor = PerformanceMonitor()
df = pd.read_csv('data.csv')
monitor.checkpoint("CSV Read")
filtered = df[df['generator_id'].isin(relevant_ids)]
monitor.checkpoint("Filtering")

Summary of Additional Optimizations

TechniquePerformance GainUse Case
CSV vs Excel20x fasterAlways convert Excel to CSV first
Data Type Optimization50% less memoryLarge datasets
Incremental Processing100x faster (subsequent)Regular updates
Binary Formats10x faster I/OIntermediate storage
Connection Pooling3x fasterMultiple parallel workers
Predicate Pushdown5x fasterFiltering large datasets
Memory MappingHandles any sizeFiles larger than RAM
Compression80% smallerNetwork transfers
Hash Deduplication10x fasterLarge-scale deduplication

Performance Comparison

Real-World Results:

VersionTimeSpeedMemoryCPU Usage
V1: Naive8 hours3.5K/s15 GB25%
V2: Chunked4 hours7K/s2 GB25%
V3: Filtered1 hour28K/s2 GB25%
V4: Parallel15 min110K/s8 GB95%
V5: CSV Format6 min280K/s3 GB95%
V6: Polars5 min330K/s4 GB95%
V7: COPY3 min560K/s2 GB95%
V8: Ultimate12 min140K/s4 GB95%

Note: V8 is slower than V7 in rows/second but processes 4 files simultaneously

Visual Timeline:

V1: ████████████████████████████████████████████████ 8 hours
V2: ████████████████████████ 4 hours
V3: ██████ 1 hour
V4: █ 15 minutes
V5: ▋ 6 minutes
V6: ▌ 5 minutes
V7: ▎ 3 minutes (single file)
V8: ▌ 12 minutes (all files parallel)

Key Lessons Learned

1. Filter Early, Filter Often

  • Reducing data volume early saves time in every subsequent step
  • 95% reduction in data meant 95% less work everywhere else

2. Batch Operations

  • Database round-trips are expensive
  • Batch size sweet spot: 1,000-10,000 rows

3. Use the Right Tool

  • Pandas is great for analysis, not for high-performance I/O
  • Polars/Arrow are purpose-built for speed
  • PostgreSQL COPY bypasses SQL entirely

4. Parallel Processing Has Limits

  • More workers ≠ always faster
  • Database connections are limited
  • Memory bandwidth becomes bottleneck

5. Memory vs Speed Tradeoff

  • Reading entire file (Polars): Faster but uses more memory
  • Chunking (Pandas): Slower but memory-efficient
  • Choose based on your constraints

6. Profile Before Optimizing

import cProfile
profiler = cProfile.Profile()
profiler.enable()
# ... your code ...
profiler.disable()
profiler.print_stats(sort='cumulative')

Resources and Further Reading

Documentation

Performance Tools


Conclusion

Optimizing data imports is not about finding one silver bullet, but rather combining multiple techniques:

  1. Understand your data
    • How much do you really need?
  2. Measure first
    • Profile to find actual bottlenecks
  3. Choose appropriate tools
    • Polars for CSV, COPY for PostgreSQL
  4. Use all resources
    • CPU cores, memory, and I/O bandwidth
  5. Iterate and improve
    • Each optimization enables the next

The journey from 8 hours to 12 minutes shows that with the right approach, seemingly impossible performance gains are achievable. The key is understanding what happens at each step and applying the appropriate optimization.

Remember: The fastest code is the code that doesn't run. In our case, filtering out 95% of the data early meant 95% less work everywhere else - that was our biggest win.


This optimization journey reduced energy data import time by 40x, enabling daily updates that were previously impossible. The same principles apply to any large-scale data processing task.


Share this post


Read more articles...

team

How To Choose Data Fetching Strategies In NextJS

Profile Image

Who I am

Hi, I amMohammad Faisal, A full-stack software engineer @Cruise , working remotely from a small but beautiful country named Bangladesh.

I am most experienced inReactJS,NodeJS andAWS

Buy Me a Coffee Widget