Data cleaning on the command line for beginning data scientists

Photo by author

# Introduction

When you think of data cleaning, you probably think of spinning Jupyter Notebook. But here’s something that might surprise you: using tools already installed on your system, you can do some powerful data cleaning in your terminal.

In this article, you’ll learn how to use basic command-line utilities to clean, transform, and explore data files. No installations are required. Just your terminal and some CSV files.

Before we start cleaning data on the command line, let’s talk about why it matters:

Command line tools are easy to use, fast and efficient, especially for large files.
These tools are built in Linux/MacOS and are available on Windows.
They are great for getting a first look at the data before loading and analyzing it with Python.
It’s easy to chain commands together in a script and use those scripts for automation.

Now, let’s start coding!

Note: You can find all commands In this bash script on GitHub. To get the most out of this article, I encourage you to open your terminal and code together.

# Compilation of sample data

Let’s create a messy CSV file to work with. It simulates real-world data problems you’re likely to run into.

cat > messy_data.csv << 'EOF'
name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
   Bob Davis    ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Charlie Brown,45,70000,Sales,charlie@example.com
Dave Wilson,31,52000,Engineering,
Emma Davis,,58000,Marketing,emma@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
EOF

This dataset has several common problems: whitespace, missing values, and duplicate rows. It’s great to learn!

# 1. Searching with your data `head`for , for , for , . `tail`and `wc`

Before cleaning up your data, you need to understand who you’re dealing with. Let’s start with the basics.

# See the first 5 rows (including header)
head -n 5 messy_data.csv

# See the last 3 rows
tail -n 3 messy_data.csv

# Count total rows (including header)
wc -l messy_data.csv

Here’s what’s happening:

head -n 5 Shows the first 5 lines, giving you a quick preview.
tail -n 3 Displays the last 3 lines (useful to check if the data is complete).
wc -l Count Lines – Subtract 1 for the header to get your record count.

Output:

name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
   Bob Davis    ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
11 messy_data.csv

# Viewing specific columns with 2 `cut`

You don’t always need to see all columns. Let’s just extract the names and departments.

cut -d',' -f1,4 messy_data.csv

Breaking it down:

cut There is a tool for extracting parts from each line.
-d',' Sets the delimiter to comma (for CSV files).
-f1,4 Selects fields (columns) 1 and 4.
You can also use ranges: -f1-3 For columns 1 to 3.

The output is:

name,department
John Lee,Engineering
Jane Smith,Marketing
   Bob Davis    ,Engineering
Alice Williams,Marketing
Charlie Brown,Sales
Dave Wilson,Engineering
Emma Davis,Marketing
Frank Miller,Sales
John Lee,Engineering
Grace Lee,Engineering

# 3. Eliminate duplicate queues `sort` And `uniq`

Note that “John Lee” appears twice in our dataset. Let’s fix it.

# Save the header first
head -n 1 messy_data.csv > cleaned_data.csv

# Remove duplicates from the data (excluding header)
tail -n +2 messy_data.csv | sort | uniq >> cleaned_data.csv

Each order is: head -n 1 Holds only the header row. tail -n +2 Everything starting at line 2 (except the header) is found. then, sort Arranges lines. Please note uniq Works only on sorted data, and uniq Removes contiguous duplicate lines. Finally, >> included in the file (vs > which overwrite).

# 4. With searching and filtering `grep`

Now let’s do some searching and filtering. Want to find all engineers or filter rows with missing data? Grip Useful for all such tasks.

# Find all engineers
grep "Engineering" messy_data.csv

# Find rows with empty fields (two consecutive commas)
grep ",," messy_data.csv

# Exclude rows with missing data
grep -v ",," messy_data.csv > no_missing.csv

here, grep "pattern" Search for lines containing this pattern. grep -v Reverses the match (displays lines that do not match). This is a quick way to filter out incomplete records, provided that the value results in a missing value in double commas (,,).

# 5. Clipping with whitespace `sed`

Notice how the record for “Bob Davis” has extra spaces? Let’s clear up.

sed 's/^( \t)*//; s/( \t)*$//' messy_data.csv > trimmed_data.csv

Let us now understand the command: CED There is a stream editor for text conversion. s/pattern/replacement/ There is an alternative syntax. ^( \t)* Matches spaces/tabs at the beginning of a line. ( \t)*$ Matches spaces/tabs at the end of a line. A semicolon separates two operations (trim the line start, then trim the line end).

# 6. Substituting values `sed`

Sometimes you need to standardize values or fix typos. Let’s try to replace all occurrences of “engineering” with “tech”.

# Replace all "Engineering" with "Tech"
sed 's/Engineering/Tech/g' messy_data.csv

Next, let’s fill in the empty email fields (denoted by commas at the end of the line) with the default email value.

# Replace empty email fields with "no-email@example.com"
sed 's/,$/,no-email@example.com/' messy_data.csv

Run the above commands and observe the output. I have omitted the output here to avoid repetition.

Breaking it down:

g flag means “global” – replace all occurrences in each line.
,$ Matches a comma at the end of a line (indicates an empty last field).
You can chain multiple changes ; Between them

# 7. Counting and summing with it `awk`

awk Extremely useful for field based operations. Let’s do some basic analysis.

# Count records by department
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c

# Calculate average age (excluding header and empty values)
tail -n +2 messy_data.csv | awk -F',' '{if($2) sum+=$2; if($2) count++} END {print "Average age:", sum/count}'

In this awk given, -F',' Sets the field separator to a comma, and $2 Refers to the second field (age). condition if($2) ensures that only non-empty values are processed, while sum += $2 Collects tomorrow. Finally, END The block executes after reading all the lines to calculate and print the average age.

Output:

      5 Engineering
      3 Marketing
      2 Sales
Average age: 33

# 8. Combining commands with pipes

When you chain these command-line tools together, you get more useful processing.

# Get unique departments, sorted alphabetically
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq

# Find engineers with salary > 55000
tail -n +2 messy_data.csv | grep "Engineering" | awk -F',' '$3 > 55000' | cut -d',' -f1,3

# Count employees per department with counts
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c | sort -rn

Here, everyone | Next passes the output of a command as input. This allows you to create complex data transformations step by step. Sets the last step by counting in reverse numerical order (-rn)

These results:

Engineering
Marketing
Sales
   Bob Davis    ,60000
      5 Engineering
      3 Marketing
      2 Sales

# 9. Changing Data Formats

Sometimes you need to work with different demeters. Here, we try to use tab as separator instead of comma.

# Convert CSV to TSV (tab-separated)
sed 's/,/\t/g' messy_data.csv > data.tsv

# Add a new column with a fixed value
awk -F',' 'BEGIN{OFS=","} {print $0, "2024"}' messy_data.csv > data_with_year.csv

In this awk given, BEGIN{OFS=","} Sets the output field separator to comma. $0 represents the entire input line, and print $0, "2024" Adds “2024” as a new column to each line of output.

# 10. A complete cleaning pipeline

Let’s put it all together into one useful command that cleans up our messy data:

# Save header
head -n 1 messy_data.csv > final_clean.csv

# Clean the data: remove duplicates, trim whitespace, exclude missing values
tail -n +2 messy_data.csv | \
  sed 's/^( \t)*//; s/( \t)*$//' | \
  grep -v ",," | \
  sort | \
  uniq >> final_clean.csv

echo "Cleaning complete! Check final_clean.csv"

This pipeline first saves the header to preserve the column names, then discards it when processing the row of data. It trims leading and trailing whitespace from each line, removes any rows containing empty fields (especially double commas), sorts the data, and eliminates duplicate entries. Finally, it adds the cleaned data to the output file.

# The result

Command-line data cleaning is a powerful yet under-discussed skill for data scientists. These tools are fast and reliable. Although you’ll still use Python for complex analysis, mastering these basics will make you more efficient and give you options when Python isn’t ideal.

Best of all, these skills are transferable to data engineering, devops, and systems administration roles. Learning to manipulate data on the command line makes you a more versatile developer.

Start practicing with your own datasets, and you’ll be surprised how often you reach for these tools instead of digging around in Python notebooks. Happy data cleaning!

Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include devops, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces and more. Bala also engages resource reviews and coding lessons.

# Introduction

# Compilation of sample data

# 1. Searching with your data `head`for , for , for , . `tail`and `wc`

# Viewing specific columns with 2 `cut`

# 3. Eliminate duplicate queues `sort` And `uniq`

# 4. With searching and filtering `grep`

# 5. Clipping with whitespace `sed`

# 6. Substituting values `sed`

# 7. Counting and summing with it `awk`

# 8. Combining commands with pipes

# 9. Changing Data Formats

# 10. A complete cleaning pipeline

# The result

Editor's pick

Get latest news

Data cleaning on the command line for beginning data scientists

# Introduction

# Compilation of sample data

# 1. Searching with your data headfor , for , for , . tailand wc

# Viewing specific columns with 2 cut

# 3. Eliminate duplicate queues sort And uniq

# 4. With searching and filtering grep

# 5. Clipping with whitespace sed

# 6. Substituting values sed

# 7. Counting and summing with it awk

# 8. Combining commands with pipes

# 9. Changing Data Formats

# 10. A complete cleaning pipeline

# The result

3 Unexpected Uses for Notebook LM

AI2’s Olmo 3 family challenges Kevin and Llama with efficient, open reasoning and customization

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news

# 1. Searching with your data `head`for , for , for , . `tail`and `wc`

# Viewing specific columns with 2 `cut`

# 3. Eliminate duplicate queues `sort` And `uniq`

# 4. With searching and filtering `grep`

# 5. Clipping with whitespace `sed`

# 6. Substituting values `sed`

# 7. Counting and summing with it `awk`