

Photo by author
# Introduction
Hug Face datasets provides one of the most straightforward ways to load datasets using a single line of code. These datasets are often available in formats such as CSV, Parquet, and Arrow. Although all three are designed to store tabular data, they work differently on the backend. The choice of each format determines how the data is stored, how quickly it can be loaded, how much storage space is needed, and how efficiently the data types are stored. These differences become increasingly prominent as datasets become larger and models more complex. In this article, we’ll look at how Hug Face datasets work with CSV, Parquet, and Arrow, what actually makes them different on disk and in memory, and when each makes sense to use. So, let’s begin.
# 1. CSV
CSV stands for Comma Separated Values. It’s just text, one line per line, with columns separated by commas (or tabs). Almost every tool can open it i.e. Excel, Google Sheets, Pandas, Database etc. It is very easy and interactive.
Example:
name,age,city
Kanwal,30,New York
Qasim,25,EdmontonThe embrace face treats it as a row-oriented format, meaning it reads data row by row. While this is acceptable for small datasets, performance degrades with scaling. Additionally, there are some other limitations, such as:
- A clear schema: Since all data is stored in text format, types need to be inferred every time the file is loaded. This can cause errors if the data is not consistent.
- Large size and slow I/O: Text storage increases file size, and text-to-parse numbers are CPU-intensive.
# 2. Parquet
Parquet is a binary column form. Instead of writing one row after another like CSV, Parkey groups values ​​by column. This makes reads and queries much faster when you only need a few columns, and compression keeps file sizes and I/O low. Parquet also maintains a schema so types are preserved. This works best for batch processing and large-scale analytics, not for many small, frequent updates to the same file (it’s better for batch writes than constant edits). If we take the above CSV example, it will put all names together, all ages and all cities together. This is the column layout and the example would look like this:
Names: Kanwal, Qasim
Ages: 30, 25
Cities: New York, EdmontonIt also includes metadata for each column: type, min/max values, key count, and compression information. This allows for faster reading, efficient storage, and accurate type handling. Compression algorithms such as snappy or gzip further reduce disk space. It has the following powers:
- Compression: Similar column values ​​compress well. Files are small and cheap to store.
- Column Wise Reading: Load only the columns you need, speeding up queries.
- Rich Typing: The schema is stored, so there is no type guessing on every load.
- scale: Works well for millions or billions of rows.
# 3. Arrow
Arrow is not the same as CSV or Parquet. It is a one column form kept in memory for fast operations. In Face Hug, every dataset is supported by an arrow table, whether you started with a CSV, Parquet, or Arrow file. Continuing with the same example table, Arrow also stores data column by column, but in memory:
Names: contiguous memory block storing Kanwal, Qasim
Ages: contiguous memory block storing 30, 25
Cities: contiguous memory block storing New York, EdmontonBecause the data is in contiguous blocks, operations on columns (such as filtering, mapping, or summarizing) are extremely fast. Arrow also supports memory mapping, which allows datasets to be retrieved from disk without having to be fully loaded into RAM. Some of the main advantages of this format are:
- The zero copy reads: Memory map files without loading everything into RAM.
- Fast column access: Column layout enables vectorized operations.
- Rich varieties: Handles nested data, lists, tensors.
- Collaborative: Works with Pandas, Peru, Spark, Polar and more.
# wrap up
Embracing facial datasets normalizes switching formats. Use CSV for fast experiments, Parquet for storing large tables, and Arrow for fast memory training. Knowing when to use each one keeps your pipeline fast and simple, so you can spend more time on the model.
Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of AI with medicine. He co-authored the eBook “Maximizing Productivity with ChatGPT.” As a 2022 Google Generation Scholar for APAC, she champions diversity and academic excellence. He has also been recognized as a Teradata Diversity in Tech Scholar, a MITACS GlobalLink Research Scholar, and a Harvard Wicked Scholar. Kanwal is a passionate advocate for change, having founded the Fame Code to empower women in stem fields.