Analytics patterns every data scientist should master.

Photo by the author Canva

# Introduction

Data analysis problems are not really unique. However, “even though your problems are non-unique, that doesn’t make them go away,” to paraphrase Neil Young. What will make them go away? Realizing that, beneath the surface, most of them rely on a handful of reusable patterns.

I’ll show you those patterns, so you can reuse them in your work or job interview, regardless of data or industry. Data is always just that – data. All samples will be inside. PostgreSQL Based on the coding interview questions StrataScratch. Then I will relate them to real business situations.

Paradigms of Analytics in Data Science.

# 1. Joins + Filters: Finding the right subset

Question: Movie Duration Match from Amazon

Work: Developing a feature that recommends individual movies from Amazon’s content database that fit a given flight duration.

For Flight 101, look for movies that have a runtime less than or equal to the duration of the flight. The output should contain a list of recommended movies for the flight, incl flight_id, movie_idand movie_duration.

Solution:

SELECT fs.flight_id,
       ec.movie_id,
       ec.duration AS movie_duration
FROM flight_schedule fs
JOIN entertainment_catalog ec ON ec.duration <= fs.flight_duration
WHERE fs.flight_id = 101
ORDER BY ec.duration;

Output:

Paradigms of Analytics in Data Science.

Pattern: JOIN + FILTER is the pattern where you join two datasets and filter rows (among them WHERE Before or within collection HAVING after collection) based on conditions.

This is the series.

Paradigms of Analytics in Data Science.

Identify the base table: the dataset that defines what you are analyzing (flight_schedule)
Include supplementary data: datasets that include context or attributes (entertainment_catalog)
Apply filters: Remove rows you don’t need (WHERE fs.flight_id = 101)

// Business use

HR: Incorporating employees with their working hours to identify overtime
Retail: Joining orders with product details to analyze product category performance
Streaming: Engaging users with their sessions to find active viewers

# 2. Functions of Window: Classification and Layout

Question: Top posts per channel from Meta

Work: Identify the top 3 posts with the highest like count for each channel. Assign a rank to each post based on the number of likes it has, with a difference in rank when posts have the same number of likes.

The output should show the channel name, post ID, post creation date, and like count for each post.

Solution:

WITH ranked_posts AS
  (SELECT post_id,
          channel_id,
          created_at,
          likes,
          RANK() OVER (PARTITION BY channel_id ORDER BY likes DESC) AS post_rank
   FROM posts
   WHERE likes > 0)
   
SELECT c.channel_name,
       r.post_id,
       r.created_at,
       r.likes
FROM ranked_posts AS r
JOIN channels AS c ON r.channel_id = c.channel_id
WHERE r.post_rank <= 3;

Output:

Paradigms of Analytics in Data Science.

Pattern: These are the window functions used for classification.

RANK(): classification with a difference
DENSE_RANK(): classification without differentiation
ROW_NUMBER(): Unique sequence with no relation

When grading, follow this pattern.

Paradigms of Analytics in Data Science.

Divide the data: Define the logical grouping you are analyzing (PARTITION BY channel_id)
Order within each distribution: specify a ranking or time order (ORDER BY likes DESC)
Apply the Hierarchical Window function – RANK(), DENSE_RANK() or ROW_NUMBER() OVER() Depends on the job

// Business use

This pattern is used to identify top performers, for example:

Sales: Top sales representatives per territory.
Education: Classifying students according to test scores in each class
Logistics: Ranking delivery drivers by completed deliveries in each region

# 3. Aggregation + grouping: roll-up pattern

Question: Same day orders from Walmart

Work: Find customers who started a session and placed an order on the same day. Calculate the total number of orders placed that day and the total order value for that day.

Solution:

SELECT s.user_id,
       s.session_date,
       COUNT(o.order_id) AS total_orders,
       SUM(o.order_value) AS total_order_value
FROM
  (SELECT DISTINCT user_id,
                   session_date
   FROM sessions) s
JOIN order_summary o ON s.user_id = o.user_id
AND s.session_date = o.order_date
GROUP BY s.user_id, s.session_date;

Output:

Paradigms of Analytics in Data Science.

Pattern: This pattern is for summarizing data, such as customers, dates, products, or other analytical dimensions.

This is the setting.

Paradigms of Analytics in Data Science.

Identify the grouping dimension: the column you want to group by (user_id And session_date)
Group data: Use GROUP BY On selected dimensions to group data
Sum the matrix: Sum the values for each group using the sum functions.
Filter aggregate results (optional): Use HAVING Keeping only specific groups in terms of overall value

// Business use

E-commerce: Orders and revenue per user per day
SaaS: Logins per user per week
Finance: Transactions per account per quarter

# 4. Pivoting: Converting rows to columns

Question: Highest paid from the city of San Francisco

Work: Create a pivot table that shows the highest payout for each employee in each year, sorted by employee name in ascending order. The table should show the years 2011, 2012, 2013 and 2014.

Solution:

SELECT employeename,
       MAX(pay_2011) AS pay_2011,
       MAX(pay_2012) AS pay_2012,
       MAX(pay_2013) AS pay_2013,
       MAX(pay_2014) AS pay_2014
FROM
    (SELECT employeename,
            CASE 
                WHEN year = 2011
                THEN totalpay
                ELSE 0
            END AS pay_2011,
            CASE 
                WHEN year = 2012
                THEN totalpay
                ELSE 0
            END AS pay_2012,
            CASE 
                WHEN year = 2013
                THEN totalpay
                ELSE 0
            END AS pay_2013,
            CASE 
                WHEN year = 2014
                THEN totalpay
                ELSE 0
            END AS pay_2014
    FROM sf_public_salaries) pmt
GROUP BY employeename
ORDER BY employeename;

Output:

Pattern: Pivoting changes row values to columns. This is useful when comparing metrics across years, categories, or segments.

This is the setting.

Paradigms of Analytics in Data Science.

Identify the key column: the column you want to display as a row (employeename)
Select Pivot Column: The field whose unique values will become the new column (year)
Define Metric: Specify the metric you want to calculate and collect (totalpay)
Create a conditional column using: CASE WHEN (Or PIVOTwhere supported) to assign values to each column based on the pivot column
Aggregate the conditional expression in the outer query: Aggregate each pivot column.
Group data: Use GROUP BY On the key column to group the output

// Business use

Finance: Comparing earnings quarter to quarter
HR: Comparison of salaries throughout the year
Retail: Comparing monthly sales totals

# 5. Aggregate metrics: growth, sustainability, and progress

Question: Earnings over time from Amazon

Work: Calculate the 3-month rolling average of total revenue from purchases. Returns—represented by negative purchase values—should not be included in the calculation.

The output should show year-month (YYYY-MM) and rolling averages, sorted from earliest to most recent month.

Solution:

SELECT t.month,
       AVG(t.monthly_revenue) OVER (ORDER BY t.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS avg_revenue
FROM
  (SELECT to_char(created_at::date, 'YYYY-MM') AS month,
          SUM(purchase_amt) AS monthly_revenue
   FROM amazon_purchases
   WHERE purchase_amt > 0
   GROUP BY to_char(created_at::date, 'YYYY-MM')
   ORDER BY to_char(created_at::date, 'YYYY-MM')
) t
ORDER BY t.month ASC;

Output:

Paradigms of Analytics in Data Science.

Pattern: Aggregate metrics (such as running totals, moving averages, or running counts) are used to understand trends rather than showing individual time periods separately.

This is the setting.

Paradigms of Analytics in Data Science.

Pre-aggregation by time period (optional): Summarize the analytics data into a total by the desired time period (all displayed in the query).
Apply an aggregate function: Use an aggregate function on the column you want to aggregate in the main query.
Convert an aggregate function to a window function: use OVER() Clause
Sort the time intervals: sort the data within the distribution by date so that the aggregate calculation is applied correctly (ORDER BY t.month)
Specify Window Frame: Specify the number of previous or subsequent periods to include in the total calculation (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

// Business use

E-commerce: Total revenue is running.
SaaS: Aggregate Active Users
Product Analytics: Cohort Retention Curves
Finance: Trailing Average
Operations: Rolling total support tickets

# 6. Funnel Analysis: Tracking sequential behavior

Question: Penetration analysis by Spotify

Work: We have to revise the requirements. The new task is to measure how users progress through the stages of engagement on Spotify. The stages of engagement are:

Active — The user had at least one session.
Engaged – User had 5+ sessions.
Power User – User has had 5+ sessions and at least 10 listening hours in the last 30 days

For each country, calculate how many users reach each stage and the overall conversion rate from first activity to power user status.

Solution:

WITH base AS (
  SELECT country,
         user_id,
         sessions,
         listening_hours,
         last_active_date
  FROM penetration_analysis
),

stage_1 AS (
  SELECT DISTINCT user_id, country
  FROM base
  WHERE sessions > 0
),

stage_2 AS (
  SELECT DISTINCT user_id, country
  FROM base
  WHERE sessions >= 5
),

stage_3 AS (
  SELECT DISTINCT user_id, country
  FROM base
  WHERE sessions >= 5 AND listening_hours >= 10
)

SELECT country,
       COUNT(DISTINCT s1.user_id) AS users_started,
       COUNT(DISTINCT s2.user_id) AS engaged_5_sessions,
       COUNT(DISTINCT s3.user_id) AS power_users,
       ROUND(100.0 * COUNT(DISTINCT s3.user_id) / NULLIF(COUNT(DISTINCT s1.user_id), 0), 2
  ) AS conversion_rate
FROM stage_1 s1
LEFT JOIN stage_2 s2 USING (user_id, country)
LEFT JOIN stage_3 s3 USING (user_id, country)
GROUP BY country;

Output:

Pattern: Funnel analysis shows how users move through a series of ordered stages. Because the analysis depends on completing the previous one, it focuses on conversions and drop-offs.

This is the setting.

Paradigms of Analytics in Data Science.

Define steps: Identify each step that the user must complete.
Extract one dataset per step: write a common table expression (CTE) or subquery for each step, containing only the users who are eligible
Ensure stage order: If needed, filter by timestamp or sequence rules so that later stages come after earlier ones
Join stages: Join using stage datasets. LEFT JOIN To see how many users reach each step.
Count users and calculate conversion rates: Compare the number of users at each stage.

// Business use

E-Commerce: View -> Add to Cart -> Purchase
SaaS: Signup -> Activate -> Maintain
Streaming: Listen once -> Engage regularly -> Become a power user.

# 7. Time-based comparisons: Period-over-period metrics

Question: Daily violation counts from the City of San Francisco

Work: Determine the change in the number of daily violations by calculating the difference between the number of current and previous violations as of the inspection date.

Show the change in inspection date and number of daily violations, from earliest to most recent inspection ordered.

Solution:

SELECT inspection_date::DATE,
       COUNT(violation_id) - LAG(COUNT(violation_id)) OVER(ORDER BY inspection_date::DATE) AS diff
FROM sf_restaurant_health_violations
GROUP BY 1
ORDER BY 1;

Output:

Paradigms of Analytics in Data Science.

Pattern: This pattern is useful when you want to see how a metric changes over time.

This is the setting.

Paradigms of Analytics in Data Science.

Aggregate Data: Summary of events over time periods (daily/weekly/monthly totals)
Apply window function: Use LAG() or LEAD() To access values from the previous or following period
Set the time period: Use ORDER BY In clause OVER() To sort the data chronologically so that comparisons are accurate.
Calculate the difference: Subtract the previous value from the current value to get the difference.

// Business use

Product: Daily changes to active users
Operations: Daily changes in support volume
Finance: Month-to-month revenue delta

# wrap up

Internalize these seven patterns and watch data analysis problems dissolve before your eyes. I am sure they will be helpful in many business situations and job interviews.

Net Rosiedi A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers all things SQL.

# Introduction

# 1. Joins + Filters: Finding the right subset

// Business use

# 2. Functions of Window: Classification and Layout

// Business use

# 3. Aggregation + grouping: roll-up pattern

// Business use

# 4. Pivoting: Converting rows to columns

// Business use

# 5. Aggregate metrics: growth, sustainability, and progress

// Business use

# 6. Funnel Analysis: Tracking sequential behavior

// Business use

# 7. Time-based comparisons: Period-over-period metrics

// Business use

# wrap up

Editor's pick

Get latest news

Analytics patterns every data scientist should master.

# Introduction

# 1. Joins + Filters: Finding the right subset

// Business use

# 2. Functions of Window: Classification and Layout

// Business use

# 3. Aggregation + grouping: roll-up pattern

// Business use

# 4. Pivoting: Converting rows to columns

// Business use

# 5. Aggregate metrics: growth, sustainability, and progress

// Business use

# 6. Funnel Analysis: Tracking sequential behavior

// Business use

# 7. Time-based comparisons: Period-over-period metrics

// Business use

# wrap up

Getting Started with Nanobot: Build Your First AI Agent

Drift: AI agent to run robot simulations 10x faster and more reliably.

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news