If you’ve worked with big data or analytics platforms, you’ve probably heard of it ORC files. But what exactly are they, and how can you work with them in Python?
In this tutorial, I walk you through the basics of reading, writing, and manipulating Arc files using Python. Finally, you’ll understand when to use ORC and how to integrate it into your data pipelines.
You can find the code on GitHub.
Table of Contents
What is the ORC file format?
Orc stands for Optimized row column. It is a columnar storage file format designed for Hadoop workloads. Unlike traditional row-based formats like CSV, ORC stores data in columns, which makes it incredibly efficient for analytical queries.
Here’s why Auric is popular:
ORC files are highly compressed, often 75% smaller than text files
Columnar format means you only read the columns you need
You can add or remove columns without rewriting the data
ORC includes a lightweight index for faster queries
Most of the organizations use ORC for their big data processing because it works well with it Apache hivesfor , for , for , . The sparkand Presto.
Conditions
Before we begin, make sure you have:
Python 3.10 or later is installed
Basic understanding of dataframes (Pandas or similar)
Familiarity with file I/O operation
You will need to install these libraries:
pip install pyarrow pandas
So why do we need it? Peru? is a Python implementation of Perro Apache arrowwhich provides excellent support for column formats such as Oracle and Parquet. It is fast, memory efficient and actively maintained.
Reading ORC files in Python
Let’s start by reading an ORC file. First, I’ll show you how to create a sample arc file so we have something to work with.
Creating a sample arch file
Here’s how we’ll create a simple employee dataset and save it as an ORC:
import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc
data = {
'employee_id': (101, 102, 103, 104, 105),
'name': ('Alice Johnson', 'Bob Smith', 'Carol White', 'David Brown', 'Eve Davis'),
'department': ('Engineering', 'Sales', 'Engineering', 'HR', 'Sales'),
'salary': (95000, 65000, 88000, 72000, 71000),
'years_experience': (5, 3, 7, 4, 3)
}
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)
orc.write_table(table, 'employees.orc')
print("ORC file created successfully!")
These results:
ORC file created successfully!
Let me break down what’s going on here. We start with a pandas data frame containing employee information. Then we convert it to a perrow table, which is an in-memory representation of a perrow’s column data. Finally, we use orc.write_table() To write it to disk in ORC format.
The conversion to the perrow table is necessary because Arc is a columnar format, and perrow handles the translation from row-based pandas to column-based storage.
Reading an ORC file
Now that we have an ORC file, let’s read it back:
table = orc.read_table('employees.orc')
df_read = table.to_pandas()
print(df_read)
print(f"\nData types:\n{df_read.dtypes}")
Output:
employee_id name department salary years_experience
0 101 Alice Johnson Engineering 95000 5
1 102 Bob Smith Sales 65000 3
2 103 Carol White Engineering 88000 7
3 104 David Brown HR 72000 4
4 105 Eve Davis Sales 71000 3
Data types:
employee_id int64
name object
department object
salary int64
years_experience int64
dtype: object
orc.read_table() The function loads the entire ORC file into memory as a pierotable. We then convert it back to pandas for the familiar data frame operations.
Consider how data types are protected. ORC maintains schema information, so your integers stay and strings stay in the string.
Reading specific columns
This is where Arc really shines. When working with large datasets, you often don’t need all the columns. ORC lets you read only what you need:
table_subset = orc.read_table('employees.orc', columns=('name', 'salary'))
df_subset = table_subset.to_pandas()
print(df_subset)
Output:
name salary
0 Alice Johnson 95000
1 Bob Smith 65000
2 Carol White 88000
3 David Brown 72000
4 Eve Davis 71000
This is called column pruning, and it’s a huge performance improvement. If your ORC file has 50 columns but you only need 3, you are reading a chunk of data. This translates into faster load times and less memory usage.
Writing ORC files with compression
ORC supports multiple compression codecs. Let’s explore how to use compression when writing files:
large_data = {
'id': range(10000),
'value': (f"data_{i}" for i in range(10000)),
'category': ('A', 'B', 'C', 'D') * 2500
}
df_large = pd.DataFrame(large_data)
table_large = pa.Table.from_pandas(df_large)
orc.write_table(table_large, 'data_zlib.orc', compression='ZLIB')
orc.write_table(table_large, 'data_snappy.orc', compression='SNAPPY')
orc.write_table(table_large, 'data_zstd.orc', compression='ZSTD')
import os
print(f"ZLIB size: {os.path.getsize('data_zlib.orc'):,} bytes")
print(f"SNAPPY size: {os.path.getsize('data_snappy.orc'):,} bytes")
print(f"ZSTD size: {os.path.getsize('data_zstd.orc'):,} bytes")
Output:
ZLIB size: 23,342 bytes
SNAPPY size: 44,978 bytes
ZSTD size: 6,380 bytes
Different compression codecs offer different tradeoffs. Zalib Gives better compression but is slower. snappy Fast but produces large files. Z STD Offers a good balance between compression ratio and speed.
For most use cases, I recommend ZSTD. It is fast enough for real-time processing and provides excellent compression.
Working with complex data types
ORC handles native data structures well. Here’s how to work with lists and nested data:
complex_data = {
'user_id': (1, 2, 3),
'name': ('Alice', 'Bob', 'Carol'),
'purchases': (
('laptop', 'mouse'),
('keyboard'),
('monitor', 'cable', 'stand')
),
'ratings': (
(4.5, 5.0),
(3.5),
(4.0, 4.5, 5.0)
)
}
df_complex = pd.DataFrame(complex_data)
table_complex = pa.Table.from_pandas(df_complex)
orc.write_table(table_complex, 'complex_data.orc')
table_read = orc.read_table('complex_data.orc')
df_read = table_read.to_pandas()
print(df_read)
print(f"\nType of 'purchases' column: {type(df_read('purchases')(0))}")
Output:
user_id name purchases ratings
0 1 Alice (laptop, mouse) (4.5, 5.0)
1 2 Bob (keyboard) (3.5)
2 3 Carol (monitor, cable, stand) (4.0, 4.5, 5.0)
Type of 'purchases' column:
ORC preserves the list structure, which is incredibly useful for storing JSON-like data or aggregated information. Each cell can contain a list, and ORC handles variable-length collections efficiently.
Another helpful example: processing log data
Let’s put this together with a practical example. Imagine you are processing web server logs:
from datetime import datetime, timedelta
import random
log_data = ()
start_date = datetime(2025, 1, 1)
for i in range(1000):
log_data.append({
'timestamp': start_date + timedelta(minutes=i),
'user_id': random.randint(1000, 9999),
'endpoint': random.choice(('/api/users', '/api/products', '/api/orders')),
'status_code': random.choice((200, 200, 200, 404, 500)),
'response_time_ms': random.randint(50, 2000)
})
df_logs = pd.DataFrame(log_data)
table_logs = pa.Table.from_pandas(df_logs)
orc.write_table(table_logs, 'server_logs.orc', compression='ZSTD')
table_subset = orc.read_table('server_logs.orc')
df_subset = table_subset.to_pandas()
errors = df_subset(df_subset('status_code') >= 400)
print(f"Total errors: {len(errors)}")
print(f"\nError breakdown:\n{errors('status_code').value_counts()}")
print(f"\nSlowest error response: {errors('response_time_ms').max()}ms")
Output:
Total errors: 387
Error breakdown:
status_code
404 211
500 176
Name: count, dtype: int64
Slowest error response: 1994ms
This example shows how RC files are suitable file formats for log storage. You can write logs continuously, compress them efficiently, and query them quickly. The columnar form format means you can filter by status code without reading the endpoint or response time data.
When should you use ORC?
When you: Use ORC:
Work with big data platforms (Hadoop, Spark, Hive).
Effective storage is required for analytics workloads
Have extensive tables where you frequently query specific columns
Want built-in compression and indexing
When you: Do not use ORC:
Row-by-row processing is required—use avro Instead
Work with small datasets – CSV is convenient in such cases
Human-readable files are required—use JSON
There is no big data infrastructure
The result
ORC is a powerful format for data engineering and analytics. With Peru, working with ORC in Python is both straightforward and performant.
You learned how to read and write ORC files, use compression, handle complex data types, and apply these concepts to real-world scenarios. Columnar storage and compression make ORC an excellent choice for big data pipelines.
Try integrating ORC into your next data project. You’ll see significant improvements in storage costs and query performance.
Happy coding!