Statistics on the command line for beginning data scientists

Photo by editor

# Introduction

If you’re just starting your data science journey, you may think you need tools like Python, R, or other software to run statistical analysis on data. However, the command line is already a powerful statistical toolkit.

Command-line tools can often process large datasets faster than loading them in memory-heavy applications. They are easy to script and automate. Additionally, these tools work on any Unix system without Installing anything.

In this article, you’ll learn how to perform essential statistical operations directly from your terminal using only the built-in Unix tools.

🔗 This is it Bash script on GitHub. Highly recommended to fully understand coding as well as concepts.

To follow this tutorial, you will need:

You will need a Unix-like environment (Linux, macOS, or Windows with WSL).
We’ll just use the standard Unix tools that are already installed.

Open your terminal to get started.

# Compilation of sample data

Before we can analyze the data, we need a dataset. Create a simple CSV file representing daily website traffic by running the following command in your terminal:

cat > traffic.csv << EOF
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8
2024-01-05,980,3400,51.2
2024-01-06,1100,3900,48.5
2024-01-07,1680,6100,40.1
2024-01-08,1550,5600,41.9
2024-01-09,1420,5100,44.2
2024-01-10,1290,4700,46.3
EOF

This creates a new file called Traffic. csv with ten rows of headers and sample data.

# Searching for your data

// Count the rows in your dataset

One of the first things to identify in a dataset is the number of records it contains. wc Command with (word count). -l The flag counts the number of lines in the file:

Output display: 11 traffic.csv (11 lines total, minus 1 header = 10 data rows).

// View your data

Before going into the calculations, it is helpful to verify the data structure. head The command displays the first few lines of the file:

It displays the first 5 lines, allowing you to preview the data.

date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8

// Extract a column

To work with specific columns in a CSV file, use cut command with a delimiter and a field number. The following command extracts the visitors column:

cut -d',' -f2 traffic.csv | tail -n +2

It extracts field 2 (visitor column) using cutand tail -n +2 Skips the header row.

# Calculating measures of central tendency

// Finding the mean (average).

The mean is the sum of all values divided by the number of values. We can calculate this by extracting the target column, then using awk To collect values:

cut -d',' -f2 traffic.csv | tail -n +2 | awk '{sum+=$1; count++} END {print "Mean:", sum/count}'

awk The command adds and counts sums as it processes each line, then divides them into END Block.

Next, we calculate the median and mode.

// Finding the median

The median is the middle value when the dataset is sorted. For an even number of values, it is the average of the two middle values. First, sort the data, then find the mean:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '{arr(NR)=$1; count=NR} END {if(count%2==1) print "Median:", arr((count+1)/2); else print "Median:", (arr(count/2)+arr(count/2+1))/2}'

This allows the data to be sorted numerically sort -nstores the values in an array, then finds the middle value (or the average of the two middle values, if counting too).

// Finding the mode

The mode is the most frequent value. We find this by sorting, counting occurrences, and identifying which value appears most often:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'

This sorts the values, along with the duplicate count uniq -csort by frequency in reverse order, and select the top result.

# Calculating Measures of Diffusion (or Diffusion)

// Finding the maximum value

To find the largest value in our dataset, we compare each value and track the maximum.

awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csv

It leaves the header with it NR>1compares each value to the current max, and updates it when its larger value is found.

// Finding the minimum value

Similarly, to find the smallest value, start with the minimum from the first row of data and update it when smaller values are found:

awk -F',' 'NR==2 {min=$2} NR>2 {if($2

Run the above commands to retrieve the maximum and minimum values.

// Searching for both min and max

Instead of running two separate commands, we can find both the minimum and maximum in a single pass:

awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv

This single-pass approach initializes both variables from the first array, then updates each one independently.

// Calculate the (population) standard deviation

The standard deviation measures how spread out the values are. For the complete population, use this formula:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csv

It collects the sum and sum of squares, then applies the formula: \(\sqrt {\frac {\sum x^2} {n} – \mu^2}\), output:

// Calculating the sample standard deviation

Use when working with a sample rather than the entire population Bessel correction (dividing by \(n-1\) for unbiased sample estimates:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csv

Its production:

// Calculating Variations

The variance is the square of the standard deviation. This is another measure of dispersion useful in many statistical calculations:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}' traffic.csv

This calculation mirrors the standard deviation but omits the square root.

# Calculation of percent

// Calculating Quartiles

Quartiles divide ordinal data into four equal parts. They are particularly useful for understanding the distribution of data:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr(NR)=$1; count=NR}
END {
  q1_pos = (count+1)/4
  q2_pos = (count+1)/2
  q3_pos = 3*(count+1)/4
  print "Q1 (25th percentile):", arr(int(q1_pos))
  print "Q2 (Median):", (count%2==1) ? arr(int(q2_pos)) : (arr(count/2)+arr(count/2+1))/2
  print "Q3 (75th percentile):", arr(int(q3_pos))
}'

This script sorts the values into an array, calculates the quartile positions using the \ ((n+1)/4 \) formula, and extracts the values at those positions. Code outputs:

Q1 (25th percentile): 1100
Q2 (Median): 1355
Q3 (75th percentile): 1520

// Calculating any percentage

You can calculate any percentage by adjusting the position calculation. The following flexible approach uses linear interpolation:

PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk -v p=$PERCENTILE '
{arr(NR)=$1; count=NR}
END {
  pos = (count+1) * p/100
  idx = int(pos)
  frac = pos - idx
  if(idx >= count) print p "th percentile:", arr(count)
  else print p "th percentile:", arr(idx) + frac * (arr(idx+1) - arr(idx))
}'

It computes the position as \((n+1)\times (percentage/100)\), then uses linear interpolation between array indices for fractional positions.

# Working with multiple columns

Often, you will want to calculate statistics in multiple columns at once. Here’s how to calculate averages for visitors, page views, and bounce rates together:

awk -F',' '
NR>1 {
  v_sum += $2
  pv_sum += $3
  br_sum += $4
  count++
}
END {
  print "Average visitors:", v_sum/count
  print "Average page views:", pv_sum/count
  print "Average bounce rate:", br_sum/count
}' traffic.csv

It maintains separate accumulators for each column and shares the same count across all three, yielding the following output:

Average visitors: 1340
Average page views: 4850
Average bounce rate: 45.06

// Calculating Correlation

Correlation measures the relationship between two variables. Pearson correlation coefficient From -1 (perfect negative correlation) to 1 (perfect positive correlation):

awk -F', *' '
NR>1 {
  x(NR-1) = $2
  y(NR-1) = $3

  sum_x += $2
  sum_y += $3

  count++
}
END {
  if (count < 2) exit

  mean_x = sum_x / count
  mean_y = sum_y / count

  for (i = 1; i <= count; i++) {
    dx = x(i) - mean_x
    dy = y(i) - mean_y

    cov   += dx * dy
    var_x += dx * dx
    var_y += dy * dy
  }

  sd_x = sqrt(var_x / count)
  sd_y = sqrt(var_y / count)

  correlation = (cov / count) / (sd_x * sd_y)

  print "Correlation:", correlation
}' traffic.csv

From this the Pearson correlation is calculated by dividing the covariance by the product of the standard deviations.

# The result

The command line is a powerful tool for data analysis. You can process volumes of data, calculate complex statistics, and automate reports – all without installing anything that’s already on your system.

These skills complement your Python and R knowledge rather than replacing them. Use command-line tools for quick exploration and data validation, then transition to specialized tools for complex modeling and visualization when needed.

Best of all, these tools are available on virtually every system you’ll use in your data science career. Open your terminal and start exploring your data.

Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include devops, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces and more. Bala also engages resource reviews and coding lessons.

# Introduction

# Compilation of sample data

# Searching for your data

// Count the rows in your dataset

// View your data

// Extract a column

# Calculating measures of central tendency

// Finding the mean (average).

// Finding the median

// Finding the mode

# Calculating Measures of Diffusion (or Diffusion)

// Finding the maximum value

// Finding the minimum value

// Searching for both min and max

// Calculate the (population) standard deviation

// Calculating the sample standard deviation

// Calculating Variations

# Calculation of percent

// Calculating Quartiles

// Calculating any percentage

# Working with multiple columns

// Calculating Correlation

# The result

Editor's pick

Get latest news

Statistics on the command line for beginning data scientists

# Introduction

# Compilation of sample data

# Searching for your data

// Count the rows in your dataset

// View your data

// Extract a column

# Calculating measures of central tendency

// Finding the mean (average).

// Finding the median

// Finding the mode

# Calculating Measures of Diffusion (or Diffusion)

// Finding the maximum value

// Finding the minimum value

// Searching for both min and max

// Calculate the (population) standard deviation

// Calculating the sample standard deviation

// Calculating Variations

# Calculation of percent

// Calculating Quartiles

// Calculating any percentage

# Working with multiple columns

// Calculating Correlation

# The result

12 Ecommerce Marketing Strategies and Examples (2026)

Booking.com’s agent strategy: disciplined, modular and already delivering 2x accuracy

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news