Working with billion-row datasets in Python (using VAEX)

by SkillAiNest

Working with billion-row datasets in Python (using VAEX)
Photo by author

# Introduction

Handling large-scale datasets containing billions of rows is a major challenge in data science and analytics. Traditional tools like Pandas Work well for small to medium-sized datasets that fit in system memory, but as dataset sizes grow, they become slow, use large amounts of random access memory (RAM) to perform operations, and often crash with out-of-memory (OOM) errors.

This is the place Vaexcomes a high-performance debugging library for data processing out of the core. Vex lets you examine, edit, visualize, and analyze large tabular datasets efficiently and memory-friendly, even on a standard laptop.

# What is Vixen?

Vax is a slow, out-of-the-core Python library DataFrames (like Pandas) designed for data larger than your RAM.

Key Features:

VAEX is designed to efficiently handle large datasets by working directly with disk data and reading only the parts needed, avoiding loading entire files into memory.

VAEX uses lazy evaluation, meaning operations are only calculated when the results are actually requested, and it can open columnar databases—which store data by column instead of HDF5, Apache Arrow, and memory mapping—instantly through HDF5, Apache Arrow, and Parkey.

Built on a custom C/C++ backend, VAEX can compute statistics and perform operations in billions of rows per second, accelerating large-scale analysis even on modest hardware.

It has a Pandas-like application programming interface (API) that makes the transition smooth for users already familiar with Pandas, helping them leverage big data capabilities without a steep learning curve.

# Comparing Vixen and Disk

Vixen does not match Disk Overall, however, it is similar to disk DataFrameswhich are built on pandas DataFrames. This means that Disk inherits some Pandas issues, such as the requirement that data be fully loaded into RAM for processing in some contexts. This is not the case with Wax. Does not make wax DataFrame Copy, so it can process large DataFrames On machines with less main memory. Both VAEX and DASK use slow processing. The main difference is that vex calculates the field only when needed, whereas with disk, we need to explicitly call compute() Data must be in HDF5 or Apache Arrow format to take full advantage of the VAEX function.

# Why Traditional Tools Struggle

Tools like pandas load the entire dataset into RAM before processing. For datasets larger than memory, this leads to:

  • Slow performance
  • System crashes (OOM errors)
  • Limited interactivity

VAEX never loads the entire dataset into memory. Instead, it:

  • Streams data from disk
  • Use virtual columns and slow evaluation to delay calculations
  • Only materialize results when clearly needed

This enables the analysis of large datasets even on modest hardware.

# How Wax Works Under the Hood

// Execution outside the core

Vex reads data from disk as needed using memory mapping. This allows working on data files much larger than RAM can hold.

// Slow diagnosis

Instead of executing each operation immediately, Vex creates a computation graph. Calculations are only performed when you request a result (such as when printing or plotting).

// Virtual column

Virtual columns are expressions defined on a dataset that do not occupy memory until they are calculated. This saves RAM and speeds up workflows.

# Getting Started with Wax

// Installing VAEX

Create a clean virtual environment:

conda create -n vaex_demo python=3.9
conda activate vaex_demo

Install Vaex with pip:

pip install vaex-core vaex-hdf5 vaex-viz

Upgrade Wax:

pip install --upgrade vaex

Install the support libraries:

pip install pandas numpy matplotlib

// Opening large datasets

VAEX supports various popular storage formats for handling large datasets. It can work directly with HDF5, Apache Arrow, and Parquet files, all of which are optimized for efficient disk access and fast analytics. Although VAEX can also read CSV files, they first need to be converted to a more efficient format to improve performance when working with large datasets.

How to open a Parquet file:

import vaex

df = vaex.open("your_huge_dataset.parquet")
print(df)

You can now inspect the dataset structure without loading it into memory.

// Basic operations in Vex

Filtering data:

filtered = df(df.sales > 1000)

It does not calculate the result immediately. Instead, the filter is registered and applied only when needed.

Group By and Add:

result = df.groupby("category", agg=vaex.agg.mean("sales"))
print(result)

Vex computes aggregates efficiently using parallel algorithms and minimal memory.

Computing statistics:

mean_price = df("price").mean()
print(mean_price)

Wax computed it on the fly by scanning the dataset in segments.

// Demonstration with the taxi dataset

We’ll create a realistic 50 million row taxi dataset to demonstrate Vix’s capabilities:

import vaex
import numpy as np
import pandas as pd
import time

Set random seeds for reproducibility:

np.random.seed(42)
print("Creating 50 million row dataset...")
n = 50_000_000

Generate realistic taxi trip data:

data = {
    'passenger_count': np.random.randint(1, 7, n),
    'trip_distance': np.random.exponential(3, n),
    'fare_amount': np.random.gamma(10, 1.5, n),
    'tip_amount': np.random.gamma(2, 1, n),
    'total_amount': np.random.gamma(12, 1.8, n),
    'payment_type': np.random.choice(('credit', 'cash', 'mobile'), n),
    'pickup_hour': np.random.randint(0, 24, n),
    'pickup_day': np.random.randint(1, 8, n),
}

Create VAEX DataFrame:

df_vaex = vaex.from_dict(data)

Export in HDF5 format (effective for VAEX):

df_vaex.export_hdf5('taxi_50M.hdf5')
print(f"Created dataset with {n:,} rows")

Output:

Shape: (50000000, 8)
Created dataset with 50,000,000 rows

Now we have a dataset of 50 million rows with 8 columns.

// Vix vs Pandas performance

Opening large files with Vex Memory Mapped Opener:

start = time.time()
df_vaex = vaex.open('taxi_50M.hdf5')
vaex_time = time.time() - start

print(f"Vaex opened {df_vaex.shape(0):,} rows in {vaex_time:.4f} seconds")
print(f"Memory usage: ~0 MB (memory-mapped)")

Output:

Vaex opened 50,000,000 rows in 0.0199 seconds
Memory usage: ~0 MB (memory-mapped)

Pandas: load into memory (don’t try this with a 50m queue!):

# This would fail on most machines
df_pandas = pd.read_hdf('taxi_50M.hdf5')

This will result in a memory error! VAEX opens files almost instantly, regardless of size, because it doesn’t load the data into memory.

Basic Aggregation: Calculate statistics on 50 million rows:

start = time.time()
stats = {
    'mean_fare': df_vaex.fare_amount.mean(),
    'mean_distance': df_vaex.trip_distance.mean(),
    'total_revenue': df_vaex.total_amount.sum(),
    'max_fare': df_vaex.fare_amount.max(),
    'min_fare': df_vaex.fare_amount.min(),
}
agg_time = time.time() - start

print(f"\nComputed 5 aggregations in {agg_time:.4f} seconds:")
print(f"  Mean fare: ${stats('mean_fare'):.2f}")
print(f"  Mean distance: {stats('mean_distance'):.2f} miles")
print(f"  Total revenue: ${stats('total_revenue'):,.2f}")
print(f"  Fare range: ${stats('min_fare'):.2f} - ${stats('max_fare'):.2f}")

Output:

Computed 5 aggregations in 0.8771 seconds:
  Mean fare: $15.00
  Mean distance: 3.00 miles
  Total revenue: $1,080,035,827.27
  Fare range: $1.25 - $55.30

Filtering operations: Filter long journeys:

start = time.time()
long_trips = df_vaex(df_vaex.trip_distance > 10)
filter_time = time.time() - start

print(f"\nFiltered for trips > 10 miles in {filter_time:.4f} seconds")
print(f"  Found: {len(long_trips):,} long trips")
print(f"  Percentage: {(len(long_trips)/len(df_vaex)*100):.2f}%")

Output:

Filtered for trips > 10 miles in 0.0486 seconds
Found: 1,784,122 long trips
Percentage: 3.57%

Multiple conditions:

start = time.time()
premium_trips = df_vaex((df_vaex.trip_distance > 5) & 
                        (df_vaex.fare_amount > 20) & 
                        (df_vaex.payment_type == 'credit'))
multi_filter_time = time.time() - start

print(f"\nMultiple condition filter in {multi_filter_time:.4f} seconds")
print(f"  Premium trips (>5mi, >$20, credit): {len(premium_trips):,}")

Output:

Multiple condition filter in 0.0582 seconds
Premium trips (>5mi, >$20, credit): 457,191

Group By Operations:

start = time.time()
by_payment = df_vaex.groupby('payment_type', agg={
    'mean_fare': vaex.agg.mean('fare_amount'),
    'mean_tip': vaex.agg.mean('tip_amount'),
    'total_trips': vaex.agg.count(),
    'total_revenue': vaex.agg.sum('total_amount')
})
groupby_time = time.time() - start

print(f"\nGroupBy operation in {groupby_time:.4f} seconds")
print(by_payment.to_pandas_df())

Output:

GroupBy operation in 5.6362 seconds
  payment_type  mean_fare  mean_tip  total_trips  total_revenue
0       credit  15.001817  2.000065     16663623   3.599456e+08
1       mobile  15.001200  1.999679     16667691   3.600165e+08
2         cash  14.999397  2.000115     16668686   3.600737e+08

More complex groups by:

start = time.time()
by_hour = df_vaex.groupby('pickup_hour', agg={
    'avg_distance': vaex.agg.mean('trip_distance'),
    'avg_fare': vaex.agg.mean('fare_amount'),
    'trip_count': vaex.agg.count()
})
complex_groupby_time = time.time() - start

print(f"\nGroupBy by hour in {complex_groupby_time:.4f} seconds")
print(by_hour.to_pandas_df().head(10))

Output:

GroupBy by hour in 1.6910 seconds
   pickup_hour  avg_distance   avg_fare  trip_count
0            0      2.998120  14.997462     2083481
1            1      3.000969  14.998814     2084650
2            2      3.003834  15.001777     2081962
3            3      3.001263  14.998196     2081715
4            4      2.998343  14.999593     2083882
5            5      2.997586  15.003988     2083421
6            6      2.999887  15.011615     2083213
7            7      3.000240  14.996892     2085156
8            8      3.002640  15.000326     2082704
9            9      2.999857  14.997857     2082284

// Advanced VAEX features

Virtual columns (computed columns) allow the addition of columns in which data is not copied:

df_vaex('tip_percentage') = (df_vaex.tip_amount / df_vaex.fare_amount) * 100
df_vaex('is_generous_tipper') = df_vaex.tip_percentage > 20
df_vaex('rush_hour') = (df_vaex.pickup_hour >= 7) & (df_vaex.pickup_hour <= 9) | \
                        (df_vaex.pickup_hour >= 17) & (df_vaex.pickup_hour <= 19)

These are calculated on the fly with no memory overhead:

print("Added 3 virtual columns with zero memory overhead")
generous_tippers = df_vaex(df_vaex.is_generous_tipper)
print(f"Generous tippers (>20% tip): {len(generous_tippers):,}")

rush_hour_trips = df_vaex(df_vaex.rush_hour)
print(f"Rush hour trips: {len(rush_hour_trips):,}")

Output:

VIRTUAL COLUMNS
Added 3 virtual columns with zero memory overhead
Generous tippers (>20% tip): 11,997,433
Rush hour trips: 12,498,848

Correlation Analysis:

corr = df_vaex.correlation(df_vaex.trip_distance, df_vaex.fare_amount)
print(f"Correlation (distance vs fare): {corr:.4f}")

Percentage:

try:
    percentiles = df_vaex.percentile_approx('fare_amount', (25, 50, 75, 90, 95, 99))
except AttributeError:
    percentiles = (
        df_vaex.fare_amount.quantile(0.25),
        df_vaex.fare_amount.quantile(0.50),
        df_vaex.fare_amount.quantile(0.75),
        df_vaex.fare_amount.quantile(0.90),
        df_vaex.fare_amount.quantile(0.95),
        df_vaex.fare_amount.quantile(0.99),
    )

print(f"\nFare percentiles:")
print(f"25th: ${percentiles(0):.2f}")
print(f"50th (median): ${percentiles(1):.2f}")
print(f"75th: ${percentiles(2):.2f}")
print(f"90th: ${percentiles(3):.2f}")
print(f"95th: ${percentiles(4):.2f}")
print(f"99th: ${percentiles(5):.2f}")

standard deviation:

std_fare = df_vaex.fare_amount.std()
print(f"\nStandard deviation of fares: ${std_fare:.2f}")

Additional useful statistics:

print(f"\nAdditional statistics:")
print(f"Mean: ${df_vaex.fare_amount.mean():.2f}")
print(f"Min: ${df_vaex.fare_amount.min():.2f}")
print(f"Max: ${df_vaex.fare_amount.max():.2f}")

Output:

Correlation (distance vs fare): -0.0001

Fare percentiles:
  25th: $11.57
  50th (median): $nan
  75th: $nan
  90th: $nan
  95th: $nan
  99th: $nan

Standard deviation of fares: $4.74

Additional statistics:
  Mean: $15.00
  Min: $1.25
  Max: $55.30

// Data export

# Export filtered data
high_value_trips = df_vaex(df_vaex.total_amount > 50)

Export to different formats:

start = time.time()
high_value_trips.export_hdf5('high_value_trips.hdf5')
export_time = time.time() - start
print(f"Exported {len(high_value_trips):,} rows to HDF5 in {export_time:.4f}s")

You can also export to CSV, Parquet, etc.

high_value_trips.export_csv('high_value_trips.csv')
high_value_trips.export_parquet('high_value_trips.parquet')

Output:

Exported 13,054 rows to HDF5 in 5.4508s

Performance summary dashboard

print("VAEX PERFORMANCE SUMMARY")
print(f"Dataset size:           {n:,} rows")
print(f"File size on disk:      ~2.4 GB")
print(f"RAM usage:              ~0 MB (memory-mapped)")
print()
print(f"Open time:              {vaex_time:.4f} seconds")
print(f"Single aggregation:     {agg_time:.4f} seconds")
print(f"Simple filter:          {filter_time:.4f} seconds")
print(f"Complex filter:         {multi_filter_time:.4f} seconds")
print(f"GroupBy operation:      {groupby_time:.4f} seconds")
print()
print(f"Throughput:             ~{n/groupby_time:,.0f} rows/second")

Output:

VAEX PERFORMANCE SUMMARY
Dataset size:           50,000,000 rows
File size on disk:      ~2.4 GB
RAM usage:              ~0 MB (memory-mapped)

Open time:              0.0199 seconds
Single aggregation:     0.8771 seconds
Simple filter:          0.0486 seconds
Complex filter:         0.0582 seconds
GroupBy operation:      5.6362 seconds

Throughput:             ~8,871,262 rows/second

# Concluding thoughts

Vix is ​​ideal when you’re working with large datasets that are larger than 1GB and don’t fit in RAM, doing big data searches, performing feature engineering with millions of rows, or building data preprocessing pipelines.

You should not use VAEX for datasets smaller than 100MB. For them, pandas are easy to use. If you are dealing with complex joins across multiple tables, using a structured query language (SQL) database may be better. While you need the full Pandas API, note that VAEX has limited compatibility. For real-time streaming data, other tools are more appropriate.

VAEX fills a gap in the Python data science ecosystem: the ability to efficiently and interactively work on billion-row datasets without loading everything into memory. Its primitive architecture, slow execution model, and improved algorithms make it a powerful tool for big data exploration even on a laptop. Whether you’re looking for large-scale logs, scientific surveys, or high-frequency time series, Vex helps bridge the gap between ease of use and big data scalability.

Shito Olomide is a software engineer and technical writer passionate about leveraging modern technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also get Shito Twitter.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro