Pandas vs Polars: A complete comparison of syntax, speed and memory

by SkillAiNest

Pandas vs Polars: A complete comparison of syntax, speed and memory
Photo by author

# Introduction

If you’ve worked with data in Python, you’ve almost certainly used Pandey. It has been the go-to library for data manipulation for over a decade. But recently, The polar is gaining serious traction. Pollers promises to be faster, more memorable, and more intuitive than Panda. But is it worth learning? And how different is it really?

In this article, we will compare Panda and Polars. You’ll see performance standards, and learn syntax differences. By the end, you’ll be able to make an informed decision for your next data project.

You can find the code on GitHub..

# to begin

Let’s install both libraries first:

pip install pandas polars

Note: This article uses Pandas 2.2.2 and Pollers 1.31.0.

For this comparison, we will also use a data set that is large enough to see real performance differences. We will use Forgery To generate test data:

Now we are ready to start coding.

# Measuring speed by reading large CSV files

Let’s start with one of the most common tasks: reading a CSV file. We will create a dataset with 1 million rows to see the real performance difference.

First, let’s create our sample data:

import pandas as pd
from faker import Faker
import random

# Generate a large CSV file for testing
fake = Faker()
Faker.seed(42)
random.seed(42)

data = {
    'user_id': range(1000000),
    'name': (fake.name() for _ in range(1000000)),
    'email': (fake.email() for _ in range(1000000)),
    'age': (random.randint(18, 80) for _ in range(1000000)),
    'salary': (random.randint(30000, 150000) for _ in range(1000000)),
    'department': (random.choice(('Engineering', 'Sales', 'Marketing', 'HR', 'Finance'))
                   for _ in range(1000000))
}

df_temp = pd.DataFrame(data)
df_temp.to_csv('large_dataset.csv', index=False)
print("✓ Generated large_dataset.csv with 1M rows")

This code creates a CSV file with realistic data. Now compare the reading speed:

import pandas as pd
import polars as pl
import time

# pandas: Read CSV
start = time.time()
df_pandas = pd.read_csv('large_dataset.csv')
pandas_time = time.time() - start

# Polars: Read CSV
start = time.time()
df_polars = pl.read_csv('large_dataset.csv')
polars_time = time.time() - start

print(f"Pandas read time: {pandas_time:.2f} seconds")
print(f"Polars read time: {polars_time:.2f} seconds")
print(f"Polars is {pandas_time/polars_time:.1f}x faster")

Output when reading sample CSV:

Pandas read time: 1.92 seconds
Polars read time: 0.23 seconds
Polars is 8.2x faster

Here’s what’s going on: We time how long each library takes to read a single CSV file. While Pandas uses its traditional single-threaded CSV reader, Pollers automatically parallelizes reading across multiple CPU cores. We calculate the speedup factor.

On most machines, you’ll find that Polars is 2-5x faster at reading CSVs. This difference becomes even more important with larger files.

# Measurement of memory usage during operation

Speed ​​is not the only consideration. Let’s see how much memory each library uses. We will perform a series of operations and measure the memory consumption. please pip install psutil If you don’t already have one in your work environment:

import pandas as pd
import polars as pl
import psutil
import os
import gc # Import garbage collector for better memory release attempts

def get_memory_usage():
    """Get current process memory usage in MB"""
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

# — - Test with Pandas — -
gc.collect()
initial_memory_pandas = get_memory_usage()

df_pandas = pd.read_csv('large_dataset.csv')
filtered_pandas = df_pandas(df_pandas('age') > 30)
grouped_pandas = filtered_pandas.groupby('department')('salary').mean()

pandas_memory = get_memory_usage() - initial_memory_pandas
print(f"Pandas memory delta: {pandas_memory:.1f} MB")

del df_pandas, filtered_pandas, grouped_pandas
gc.collect()

# — - Test with Polars (eager mode) — -
gc.collect()
initial_memory_polars = get_memory_usage()

df_polars = pl.read_csv('large_dataset.csv')
filtered_polars = df_polars.filter(pl.col('age') > 30)
grouped_polars = filtered_polars.group_by('department').agg(pl.col('salary').mean())

polars_memory = get_memory_usage() - initial_memory_polars
print(f"Polars memory delta: {polars_memory:.1f} MB")

del df_polars, filtered_polars, grouped_polars
gc.collect()

# — - Summary — -
if pandas_memory > 0 and polars_memory > 0:
  print(f"Memory savings (Polars vs Pandas): {(1 - polars_memory/pandas_memory) * 100:.1f}%")
elif pandas_memory == 0 and polars_memory > 0:
  print(f"Polars used {polars_memory:.1f} MB while Pandas used 0 MB.")
elif polars_memory == 0 and pandas_memory > 0:
  print(f"Polars used 0 MB while Pandas used {pandas_memory:.1f} MB.")
else:
  print("Cannot compute memory savings due to zero or negative memory usage delta in both frameworks.")

This code measures the memory footprint:

  1. We use The psutil library To track memory usage before and after an operation
  2. Both libraries read the same file and perform filtering and grouping.
  3. We calculate the difference in memory consumption.

Sample output:

Pandas memory delta: 44.4 MB
Polars memory delta: 1.3 MB
Memory savings (Polars vs Pandas): 97.1%

The above results show the delta of memory usage for both pandas and pollers when performing filtering and aggregation operations. large_dataset.csv.

  • Panda Memory Delta: indicates the memory used by pandas for operations.
  • Polaris Memory Delta: Indicates the memory used by pollers for the same operation.
  • Memory Savings (Polar vs Pandas): This metric provides a percentage of how much less memory was used by pollers compared to pandas.

Due to its columnar data storage and optimized execution engine, it is common for pollers to perform in memory. Typically, you will see a 30% to 70% improvement using polarizers.

Note: However, sequential memory measurements within the same Python process psutil.Process(...).memory_info().rss can be misleading at times. Python’s memory allocator does not always immediately return memory to the operating system, so a ‘clean’ baseline for a later test may still be affected by earlier operations. For the most accurate comparison, tests should ideally be run in separate, isolated Python processes.

# Comparing syntax for basic functions

Now let’s see how the syntax differs between the two libraries. We’ll cover the most common operations you’ll use.

// Selecting a column

Let’s select a subset of columns. We’ll create a very small dataframe for this (and subsequent examples).

import pandas as pd
import polars as pl

# Create sample data
data = {
    'name': ('Anna', 'Betty', 'Cathy'),
    'age': (25, 30, 35),
    'salary': (50000, 60000, 70000)
}

# Pandas approach
df_pandas = pd.DataFrame(data)
result_pandas = df_pandas(('name', 'salary'))

# Polars approach
df_polars = pl.DataFrame(data)
result_polars = df_polars.select(('name', 'salary'))
# Alternative: More expressive
result_polars_alt = df_polars.select((pl.col('name'), pl.col('salary')))

print("Pandas result:")
print(result_pandas)
print("\nPolars result:")
print(result_polars)

Here are the main differences:

  • Pandas uses the bracket notation: df(('col1', 'col2'))
  • Use polarizers. .select() method
  • Poles also favor more expressiveness. pl.col() Syntax, which becomes powerful for complex operations.

Output:

Pandas result:
    name  salary
0   Anna   50000
1  Betty   60000
2  Cathy   70000

Polars result:
shape: (3, 2)
┌───────┬────────┐
│ name  ┆ salary │
│ — -   ┆ — -    │
│ str   ┆ i64    │
╞═══════╪════════╡
│ Anna  ┆ 50000  │
│ Betty ┆ 60000  │
│ Cathy ┆ 70000  │
└───────┴────────┘

Both produce the same output, but the syntax of pollers is more specific about what you’re doing.

// Filtering rows

Now let’s filter the rows:

# pandas: Filter rows where age > 28
filtered_pandas = df_pandas(df_pandas('age') > 28)

# Alternative Pandas syntax with query
filtered_pandas_alt = df_pandas.query('age > 28')

# Polars: Filter rows where age > 28
filtered_polars = df_polars.filter(pl.col('age') > 28)

print("Pandas filtered:")
print(filtered_pandas)
print("\nPolars filtered:")
print(filtered_polars)

Check out the differences:

  • In Pandas, we use Boolean indexing with bracket notation. You can also use .query() method
  • Use polarizers. .filter() method with pl.col() Feedback
  • The syntax for pollers reads more like SQL: “Filter where column age is greater than 28”.

Output:

Pandas filtered:
    name  age  salary
1  Betty   30   60000
2  Cathy   35   70000

Polars filtered:
shape: (2, 3)
┌───────┬─────┬────────┐
│ name  ┆ age ┆ salary │
│ — -   ┆ — - ┆ — -    │
│ str   ┆ i64 ┆ i64    │
╞═══════╪═════╪════════╡
│ Betty ┆ 30  ┆ 60000  │
│ Cathy ┆ 35  ┆ 70000  │
└───────┴─────┴────────┘

// Adding new columns

Now add new columns to the data frame:

# pandas: Add a new column
df_pandas('bonus') = df_pandas('salary') * 0.1
df_pandas('total_comp') = df_pandas('salary') + df_pandas('bonus')

# Polars: Add new columns
df_polars = df_polars.with_columns((
    (pl.col('salary') * 0.1).alias('bonus'),
    (pl.col('salary') * 1.1).alias('total_comp')
))

print("Pandas with new columns:")
print(df_pandas)
print("\nPolars with new columns:")
print(df_polars)

Output:

Pandas with new columns:
    name  age  salary   bonus  total_comp
0   Anna   25   50000  5000.0     55000.0
1  Betty   30   60000  6000.0     66000.0
2  Cathy   35   70000  7000.0     77000.0

Polars with new columns:
shape: (3, 5)
┌───────┬─────┬────────┬────────┬────────────┐
│ name  ┆ age ┆ salary ┆ bonus  ┆ total_comp │
│ — -   ┆ — - ┆ — -    ┆ — -    ┆ — -        │
│ str   ┆ i64 ┆ i64    ┆ f64    ┆ f64        │
╞═══════╪═════╪════════╪════════╪════════════╡
│ Anna  ┆ 25  ┆ 50000  ┆ 5000.0 ┆ 55000.0    │
│ Betty ┆ 30  ┆ 60000  ┆ 6000.0 ┆ 66000.0    │
│ Cathy ┆ 35  ┆ 70000  ┆ 7000.0 ┆ 77000.0    │
└───────┴─────┴────────┴────────┴────────────┘

Here’s what’s going on:

  • Pandas uses direct column assignment, which transforms the dataframe in place.
  • Uses Polar. .with_columns() and returns a new data frame (immutable by default).
  • In pollers, you use .alias() Naming the new column

The polar approach promotes immutability and makes data transformations more readable.

# Performance measurement in grouping and aggregation

Let’s look at a more practical example: grouping data and calculating multiple sums. This code shows how we group data by department, calculate multiple statistics on different columns, and time both operations to see the performance difference:

# Load our large dataset
df_pandas = pd.read_csv('large_dataset.csv')
df_polars = pl.read_csv('large_dataset.csv')

# pandas: Group by department and calculate stats
import time

start = time.time()
result_pandas = df_pandas.groupby('department').agg({
    'salary': ('mean', 'median', 'std'),
    'age': 'mean'
}).reset_index()
result_pandas.columns = ('department', 'avg_salary', 'median_salary', 'std_salary', 'avg_age')
pandas_time = time.time() - start

# Polars: Same operation
start = time.time()
result_polars = df_polars.group_by('department').agg((
    pl.col('salary').mean().alias('avg_salary'),
    pl.col('salary').median().alias('median_salary'),
    pl.col('salary').std().alias('std_salary'),
    pl.col('age').mean().alias('avg_age')
))
polars_time = time.time() - start

print(f"Pandas time: {pandas_time:.3f}s")
print(f"Polars time: {polars_time:.3f}s")
print(f"Speedup: {pandas_time/polars_time:.1f}x")
print("\nPandas result:")
print(result_pandas)
print("\nPolars result:")
print(result_polars)

Output:


Pandas time: 0.126s
Polars time: 0.077s
Speedup: 1.6x

Pandas result:
    department    avg_salary  median_salary    std_salary    avg_age
0  Engineering  89954.929266        89919.0  34595.585863  48.953405
1      Finance  89898.829762        89817.0  34648.373383  49.006690
2           HR  90080.629637        90177.0  34692.117761  48.979005
3    Marketing  90071.721095        90154.0  34625.095386  49.085454
4        Sales  89980.433386        90065.5  34634.974505  49.003168

Polars result:
shape: (5, 5)
┌─────────────┬──────────────┬───────────────┬──────────────┬───────────┐
│ department  ┆ avg_salary   ┆ median_salary ┆ std_salary   ┆ avg_age   │
│ — -         ┆ — -          ┆ — -           ┆ — -          ┆ — -       │
│ str         ┆ f64          ┆ f64           ┆ f64          ┆ f64       │
╞═════════════╪══════════════╪═══════════════╪══════════════╪═══════════╡
│ HR          ┆ 90080.629637 ┆ 90177.0       ┆ 34692.117761 ┆ 48.979005 │
│ Sales       ┆ 89980.433386 ┆ 90065.5       ┆ 34634.974505 ┆ 49.003168 │
│ Engineering ┆ 89954.929266 ┆ 89919.0       ┆ 34595.585863 ┆ 48.953405 │
│ Marketing   ┆ 90071.721095 ┆ 90154.0       ┆ 34625.095386 ┆ 49.085454 │
│ Finance     ┆ 89898.829762 ┆ 89817.0       ┆ 34648.373383 ┆ 49.00669  │
└─────────────┴──────────────┴───────────────┴──────────────┴───────────┘

Break down the syntax:

  • Pandas uses a dictionary to define collections, which can be confusing for complex operations.
  • Pollers use method chaining: each operation is explicit and named.

Polars syntax is more verbose but also more readable. You can immediately see which statistics are being calculated.

# Understanding slow diagnosis in polars

Slow evaluation is one of the most helpful features of pollers. It means It does not immediately process your query. Instead, it plans the entire operation and optimizes it before it runs..

Let’s see this in action:

import polars as pl

# Read in lazy mode
df_lazy = pl.scan_csv('large_dataset.csv')

# Build a complex query
result = (
    df_lazy
    .filter(pl.col('age') > 30)
    .filter(pl.col('salary') > 50000)
    .group_by('department')
    .agg((
        pl.col('salary').mean().alias('avg_salary'),
        pl.len().alias('employee_count')
    ))
    .filter(pl.col('employee_count') > 1000)
    .sort('avg_salary', descending=True)
)

# Nothing has been executed yet!
print("Query plan created, but not executed")

# Now execute the optimized query
import time
start = time.time()
result_df = result.collect()  # This runs the query
execution_time = time.time() - start

print(f"\nExecution time: {execution_time:.3f}s")
print(result_df)

Output:

Query plan created, but not executed

Execution time: 0.177s
shape: (5, 3)
┌─────────────┬───────────────┬────────────────┐
│ department  ┆ avg_salary    ┆ employee_count │
│ — -         ┆ — -           ┆ — -            │
│ str         ┆ f64           ┆ u32            │
╞═════════════╪═══════════════╪════════════════╡
│ HR          ┆ 100101.595816 ┆ 132212         │
│ Marketing   ┆ 100054.012365 ┆ 132470         │
│ Sales       ┆ 100041.01049  ┆ 132035         │
│ Finance     ┆ 99956.527217  ┆ 132143         │
│ Engineering ┆ 99946.725458  ┆ 132384         │
└─────────────┴───────────────┴────────────────┘

here, scan_csv() does not load the file immediately; It only intends to read it. We offer multiple filters, groupings, and sort streams. Pollers analyze the entire query and optimize it. For example, it can filter all data before reading it.

Only when we call. .collect() What is the actual calculation? An optimized query runs much faster than doing each step separately.

# wrap up

As seen, pollers are extremely useful for data processing with Python. It’s faster, more memory efficient, and has a cleaner API than pandas. That said, the panda isn’t going anywhere. It has over a decade of development, a vast ecosystem, and millions of users. For many projects, Panda is still the right choice.

If you are considering large-scale analysis of data engineering projects and the like, learn Pollers. The syntax differences aren’t huge, and the performance gains are real. But keep Panda in your toolkit for compatibility and quick research work.

Start by testing pollers on a side project or data pipeline that is running slowly. You will immediately realize if it is right for your use case. Happy data wrangling!

Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she’s working on learning lessons and sharing her knowledge with the developer community, writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource reviews and coding tutorials.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro