How to work with the ORC file format in Python – a guide with examples

by SkillAiNest

If you’ve worked with big data or analytics platforms, you’ve probably heard of it ORC files. But what exactly are they, and how can you work with them in Python?

In this tutorial, I walk you through the basics of reading, writing, and manipulating Arc files using Python. Finally, you’ll understand when to use ORC and how to integrate it into your data pipelines.

You can find the code on GitHub.

Table of Contents

  1. What is the ORC file format?

  2. Conditions

  3. Reading ORC files in Python

  4. Writing ORC files with compression

  5. Working with complex data types

  6. Another helpful example: processing log data

  7. When should you use ORC?

What is the ORC file format?

Orc stands for Optimized row column. It is a columnar storage file format designed for Hadoop workloads. Unlike traditional row-based formats like CSV, ORC stores data in columns, which makes it incredibly efficient for analytical queries.

Here’s why Auric is popular:

  • ORC files are highly compressed, often 75% smaller than text files

  • Columnar format means you only read the columns you need

  • You can add or remove columns without rewriting the data

  • ORC includes a lightweight index for faster queries

Most of the organizations use ORC for their big data processing because it works well with it Apache hivesfor , for , for , . The sparkand Presto.

Conditions

Before we begin, make sure you have:

  • Python 3.10 or later is installed

  • Basic understanding of dataframes (Pandas or similar)

  • Familiarity with file I/O operation

You will need to install these libraries:

pip install pyarrow pandas

So why do we need it? Peru? is a Python implementation of Perro Apache arrowwhich provides excellent support for column formats such as Oracle and Parquet. It is fast, memory efficient and actively maintained.

Reading ORC files in Python

Let’s start by reading an ORC file. First, I’ll show you how to create a sample arc file so we have something to work with.

Creating a sample arch file

Here’s how we’ll create a simple employee dataset and save it as an ORC:

import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc


data = {
    'employee_id': (101, 102, 103, 104, 105),
    'name': ('Alice Johnson', 'Bob Smith', 'Carol White', 'David Brown', 'Eve Davis'),
    'department': ('Engineering', 'Sales', 'Engineering', 'HR', 'Sales'),
    'salary': (95000, 65000, 88000, 72000, 71000),
    'years_experience': (5, 3, 7, 4, 3)
}

df = pd.DataFrame(data)


table = pa.Table.from_pandas(df)
orc.write_table(table, 'employees.orc')

print("ORC file created successfully!")

These results:

ORC file created successfully!

Let me break down what’s going on here. We start with a pandas data frame containing employee information. Then we convert it to a perrow table, which is an in-memory representation of a perrow’s column data. Finally, we use orc.write_table() To write it to disk in ORC format.

The conversion to the perrow table is necessary because Arc is a columnar format, and perrow handles the translation from row-based pandas to column-based storage.

Reading an ORC file

Now that we have an ORC file, let’s read it back:


table = orc.read_table('employees.orc')


df_read = table.to_pandas()

print(df_read)
print(f"\nData types:\n{df_read.dtypes}")

Output:

   employee_id           name   department  salary  years_experience
0          101  Alice Johnson  Engineering   95000                 5
1          102      Bob Smith        Sales   65000                 3
2          103    Carol White  Engineering   88000                 7
3          104    David Brown           HR   72000                 4
4          105      Eve Davis        Sales   71000                 3

Data types:
employee_id          int64
name                object
department          object
salary               int64
years_experience     int64
dtype: object

orc.read_table() The function loads the entire ORC file into memory as a pierotable. We then convert it back to pandas for the familiar data frame operations.

Consider how data types are protected. ORC maintains schema information, so your integers stay and strings stay in the string.

Reading specific columns

This is where Arc really shines. When working with large datasets, you often don’t need all the columns. ORC lets you read only what you need:


table_subset = orc.read_table('employees.orc', columns=('name', 'salary'))
df_subset = table_subset.to_pandas()

print(df_subset)

Output:

            name  salary
0  Alice Johnson   95000
1      Bob Smith   65000
2    Carol White   88000
3    David Brown   72000
4      Eve Davis   71000

This is called column pruning, and it’s a huge performance improvement. If your ORC file has 50 columns but you only need 3, you are reading a chunk of data. This translates into faster load times and less memory usage.

Writing ORC files with compression

ORC supports multiple compression codecs. Let’s explore how to use compression when writing files:


large_data = {
    'id': range(10000),
    'value': (f"data_{i}" for i in range(10000)),
    'category': ('A', 'B', 'C', 'D') * 2500
}

df_large = pd.DataFrame(large_data)
table_large = pa.Table.from_pandas(df_large)


orc.write_table(table_large, 'data_zlib.orc', compression='ZLIB')


orc.write_table(table_large, 'data_snappy.orc', compression='SNAPPY')


orc.write_table(table_large, 'data_zstd.orc', compression='ZSTD')

import os
print(f"ZLIB size: {os.path.getsize('data_zlib.orc'):,} bytes")
print(f"SNAPPY size: {os.path.getsize('data_snappy.orc'):,} bytes")
print(f"ZSTD size: {os.path.getsize('data_zstd.orc'):,} bytes")

Output:

ZLIB size: 23,342 bytes
SNAPPY size: 44,978 bytes
ZSTD size: 6,380 bytes

Different compression codecs offer different tradeoffs. Zalib Gives better compression but is slower. snappy Fast but produces large files. Z STD Offers a good balance between compression ratio and speed.

For most use cases, I recommend ZSTD. It is fast enough for real-time processing and provides excellent compression.

Working with complex data types

ORC handles native data structures well. Here’s how to work with lists and nested data:


complex_data = {
    'user_id': (1, 2, 3),
    'name': ('Alice', 'Bob', 'Carol'),
    'purchases': (
        ('laptop', 'mouse'),
        ('keyboard'),
        ('monitor', 'cable', 'stand')
    ),
    'ratings': (
        (4.5, 5.0),
        (3.5),
        (4.0, 4.5, 5.0)
    )
}

df_complex = pd.DataFrame(complex_data)
table_complex = pa.Table.from_pandas(df_complex)
orc.write_table(table_complex, 'complex_data.orc')


table_read = orc.read_table('complex_data.orc')
df_read = table_read.to_pandas()

print(df_read)
print(f"\nType of 'purchases' column: {type(df_read('purchases')(0))}")

Output:

   user_id   name                purchases          ratings
0        1  Alice          (laptop, mouse)       (4.5, 5.0)
1        2    Bob               (keyboard)            (3.5)
2        3  Carol  (monitor, cable, stand)  (4.0, 4.5, 5.0)

Type of 'purchases' column: 

ORC preserves the list structure, which is incredibly useful for storing JSON-like data or aggregated information. Each cell can contain a list, and ORC handles variable-length collections efficiently.

Another helpful example: processing log data

Let’s put this together with a practical example. Imagine you are processing web server logs:

from datetime import datetime, timedelta
import random


log_data = ()
start_date = datetime(2025, 1, 1)

for i in range(1000):
    log_data.append({
        'timestamp': start_date + timedelta(minutes=i),
        'user_id': random.randint(1000, 9999),
        'endpoint': random.choice(('/api/users', '/api/products', '/api/orders')),
        'status_code': random.choice((200, 200, 200, 404, 500)),
        'response_time_ms': random.randint(50, 2000)
    })

df_logs = pd.DataFrame(log_data)


table_logs = pa.Table.from_pandas(df_logs)
orc.write_table(table_logs, 'server_logs.orc', compression='ZSTD')


table_subset = orc.read_table('server_logs.orc')
df_subset = table_subset.to_pandas()


errors = df_subset(df_subset('status_code') >= 400)
print(f"Total errors: {len(errors)}")
print(f"\nError breakdown:\n{errors('status_code').value_counts()}")
print(f"\nSlowest error response: {errors('response_time_ms').max()}ms")

Output:

Total errors: 387

Error breakdown:
status_code
404    211
500    176
Name: count, dtype: int64

Slowest error response: 1994ms

This example shows how RC files are suitable file formats for log storage. You can write logs continuously, compress them efficiently, and query them quickly. The columnar form format means you can filter by status code without reading the endpoint or response time data.

When should you use ORC?

When you: Use ORC:

  • Work with big data platforms (Hadoop, Spark, Hive).

  • Effective storage is required for analytics workloads

  • Have extensive tables where you frequently query specific columns

  • Want built-in compression and indexing

When you: Do not use ORC:

  • Row-by-row processing is required—use avro Instead

  • Work with small datasets – CSV is convenient in such cases

  • Human-readable files are required—use JSON

  • There is no big data infrastructure

The result

ORC is a powerful format for data engineering and analytics. With Peru, working with ORC in Python is both straightforward and performant.

You learned how to read and write ORC files, use compression, handle complex data types, and apply these concepts to real-world scenarios. Columnar storage and compression make ORC an excellent choice for big data pipelines.

Try integrating ORC into your next data project. You’ll see significant improvements in storage costs and query performance.

Happy coding!

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro