

Photo by editor
# Introduction
If you’re just starting your data science journey, you may think you need tools like Python, R, or other software to run statistical analysis on data. However, the command line is already a powerful statistical toolkit.
Command-line tools can often process large datasets faster than loading them in memory-heavy applications. They are easy to script and automate. Additionally, these tools work on any Unix system without Installing anything.
In this article, you’ll learn how to perform essential statistical operations directly from your terminal using only the built-in Unix tools.
🔗 This is it Bash script on GitHub. Highly recommended to fully understand coding as well as concepts.
To follow this tutorial, you will need:
- You will need a Unix-like environment (Linux, macOS, or Windows with WSL).
- We’ll just use the standard Unix tools that are already installed.
Open your terminal to get started.
# Compilation of sample data
Before we can analyze the data, we need a dataset. Create a simple CSV file representing daily website traffic by running the following command in your terminal:
cat > traffic.csv << EOF
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8
2024-01-05,980,3400,51.2
2024-01-06,1100,3900,48.5
2024-01-07,1680,6100,40.1
2024-01-08,1550,5600,41.9
2024-01-09,1420,5100,44.2
2024-01-10,1290,4700,46.3
EOFThis creates a new file called Traffic. csv with ten rows of headers and sample data.
# Searching for your data
// Count the rows in your dataset
One of the first things to identify in a dataset is the number of records it contains. wc Command with (word count). -l The flag counts the number of lines in the file:
Output display: 11 traffic.csv (11 lines total, minus 1 header = 10 data rows).
// View your data
Before going into the calculations, it is helpful to verify the data structure. head The command displays the first few lines of the file:
It displays the first 5 lines, allowing you to preview the data.
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8// Extract a column
To work with specific columns in a CSV file, use cut command with a delimiter and a field number. The following command extracts the visitors column:
cut -d',' -f2 traffic.csv | tail -n +2It extracts field 2 (visitor column) using cutand tail -n +2 Skips the header row.
# Calculating measures of central tendency
// Finding the mean (average).
The mean is the sum of all values ​​divided by the number of values. We can calculate this by extracting the target column, then using awk To collect values:
cut -d',' -f2 traffic.csv | tail -n +2 | awk '{sum+=$1; count++} END {print "Mean:", sum/count}' awk The command adds and counts sums as it processes each line, then divides them into END Block.
Next, we calculate the median and mode.
// Finding the median
The median is the middle value when the dataset is sorted. For an even number of values, it is the average of the two middle values. First, sort the data, then find the mean:
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '{arr(NR)=$1; count=NR} END {if(count%2==1) print "Median:", arr((count+1)/2); else print "Median:", (arr(count/2)+arr(count/2+1))/2}'This allows the data to be sorted numerically sort -nstores the values ​​in an array, then finds the middle value (or the average of the two middle values, if counting too).
// Finding the mode
The mode is the most frequent value. We find this by sorting, counting occurrences, and identifying which value appears most often:
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'This sorts the values, along with the duplicate count uniq -csort by frequency in reverse order, and select the top result.
# Calculating Measures of Diffusion (or Diffusion)
// Finding the maximum value
To find the largest value in our dataset, we compare each value and track the maximum.
awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csvIt leaves the header with it NR>1compares each value to the current max, and updates it when its larger value is found.
// Finding the minimum value
Similarly, to find the smallest value, start with the minimum from the first row of data and update it when smaller values ​​are found:
awk -F',' 'NR==2 {min=$2} NR>2 {if($2Run the above commands to retrieve the maximum and minimum values.
// Searching for both min and max
Instead of running two separate commands, we can find both the minimum and maximum in a single pass:
awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv This single-pass approach initializes both variables from the first array, then updates each one independently.
// Calculate the (population) standard deviation
The standard deviation measures how spread out the values ​​are. For the complete population, use this formula:
awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csvIt collects the sum and sum of squares, then applies the formula: \(\sqrt {\frac {\sum x^2} {n} – \mu^2}\), output:
// Calculating the sample standard deviation
Use when working with a sample rather than the entire population Bessel correction (dividing by \(n-1\) for unbiased sample estimates:
awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csvIts production:
// Calculating Variations
The variance is the square of the standard deviation. This is another measure of dispersion useful in many statistical calculations:
awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}' traffic.csvThis calculation mirrors the standard deviation but omits the square root.
# Calculation of percent
// Calculating Quartiles
Quartiles divide ordinal data into four equal parts. They are particularly useful for understanding the distribution of data:
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr(NR)=$1; count=NR}
END {
q1_pos = (count+1)/4
q2_pos = (count+1)/2
q3_pos = 3*(count+1)/4
print "Q1 (25th percentile):", arr(int(q1_pos))
print "Q2 (Median):", (count%2==1) ? arr(int(q2_pos)) : (arr(count/2)+arr(count/2+1))/2
print "Q3 (75th percentile):", arr(int(q3_pos))
}'This script sorts the values ​​into an array, calculates the quartile positions using the \ ((n+1)/4 \) formula, and extracts the values ​​at those positions. Code outputs:
Q1 (25th percentile): 1100
Q2 (Median): 1355
Q3 (75th percentile): 1520// Calculating any percentage
You can calculate any percentage by adjusting the position calculation. The following flexible approach uses linear interpolation:
PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk -v p=$PERCENTILE '
{arr(NR)=$1; count=NR}
END {
pos = (count+1) * p/100
idx = int(pos)
frac = pos - idx
if(idx >= count) print p "th percentile:", arr(count)
else print p "th percentile:", arr(idx) + frac * (arr(idx+1) - arr(idx))
}'It computes the position as \((n+1)\times (percentage/100)\), then uses linear interpolation between array indices for fractional positions.
# Working with multiple columns
Often, you will want to calculate statistics in multiple columns at once. Here’s how to calculate averages for visitors, page views, and bounce rates together:
awk -F',' '
NR>1 {
v_sum += $2
pv_sum += $3
br_sum += $4
count++
}
END {
print "Average visitors:", v_sum/count
print "Average page views:", pv_sum/count
print "Average bounce rate:", br_sum/count
}' traffic.csvIt maintains separate accumulators for each column and shares the same count across all three, yielding the following output:
Average visitors: 1340
Average page views: 4850
Average bounce rate: 45.06// Calculating Correlation
Correlation measures the relationship between two variables. Pearson correlation coefficient From -1 (perfect negative correlation) to 1 (perfect positive correlation):
awk -F', *' '
NR>1 {
x(NR-1) = $2
y(NR-1) = $3
sum_x += $2
sum_y += $3
count++
}
END {
if (count < 2) exit
mean_x = sum_x / count
mean_y = sum_y / count
for (i = 1; i <= count; i++) {
dx = x(i) - mean_x
dy = y(i) - mean_y
cov += dx * dy
var_x += dx * dx
var_y += dy * dy
}
sd_x = sqrt(var_x / count)
sd_y = sqrt(var_y / count)
correlation = (cov / count) / (sd_x * sd_y)
print "Correlation:", correlation
}' traffic.csvFrom this the Pearson correlation is calculated by dividing the covariance by the product of the standard deviations.
# The result
The command line is a powerful tool for data analysis. You can process volumes of data, calculate complex statistics, and automate reports – all without installing anything that’s already on your system.
These skills complement your Python and R knowledge rather than replacing them. Use command-line tools for quick exploration and data validation, then transition to specialized tools for complex modeling and visualization when needed.
Best of all, these tools are available on virtually every system you’ll use in your data science career. Open your terminal and start exploring your data.
Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include devops, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces and more. Bala also engages resource reviews and coding lessons.