

Photo by author
# Introduction
Technical screening for data science roles at Fang companies is very good. However, even they can’t handle the endless stream of unique interview questions. Once you’ve gone through enough times, you start to notice that some SQL patterns keep showing up.
Here are the top 5, with examples and code (Postgres QL) for practice.


Photo by author Napkins
Master these and you will be ready for most SQL interviews.
# Pattern #1: Aggregating data with groups
Using aggregate functions GROUP BY Allows you to aggregate metrics by category.
This pattern is often combined with data filtering, which means using one of two clauses:
WHERE: Filters data before aggregation.HAVING: Filters the data after aggregation.
Example: This A meta-interview question Asks you to find the total number of comments per user 30 days or less before 2020-02-10. Users with no comments should be excluded from the output.
We use SUM() Work with a GROUP BY Clause according to the number of comments per user. Outputting only comments within a specified time period is achieved by filtering the data prior to collection, i.e., using WHERE. There is no need to calculate that “30 days before 2020-02-10” is the date. We subtract 30 days from this date using only this INTERVAL History event.
SELECT user_id,
SUM(number_of_comments) AS number_of_comments
FROM fb_comments_count
WHERE created_at BETWEEN '2020-02-10'::DATE - 30 * INTERVAL '1 day' AND '2020-02-10'::DATE
GROUP BY user_id;Here is the output.
| user_id | No_Of_Comment |
|---|---|
| 5 | 1 |
| 8 | 4 |
| 9 | 2 |
| 8 speaks of the day about living | 8 speaks of the day about living |
| 99 | 2 |
Business Use:
- User Activity Metrics: Dow and Maofor , for , for , . Manor rate.
- Revenue Measurement: Per Region/Product/Time Period.
- User engagement: average session length, average clicks per user.
# Pattern #2: Filtering with subqueries
When using subqueries for filtering, you create a subset of data, then filter the main query against it.
The two main subtypes are:
- Scalar Substrings: Return a single value, eg, the maximum amount.
- Associative Substrings: Rely on the result of an external query to return references and values.
Example: This Interview question from Meta Asks you to build a recommendation system for Facebook. For each user, you find pages that are not followed, but at least one of their friends is. The output should contain the ID of the user and the ID of the page that should be recommended to that user.
The outer query returns all pairs of user pages where the page is followed by at least one friend.
Then, we use a subscript in WHERE Clause to remove pages that the user already follows. The subquery has two conditions: one that only considers pages after that particular user (checks only for that user), and then checks if the page considered for suggestion is included in that user’s post (checks for that page only).
Since using subkey returns all subsequent pages to the user NOT EXISTS i WHERE Excludes all these pages from recommendation.
SELECT DISTINCT f.user_id,
p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
WHERE NOT EXISTS
(SELECT *
FROM users_pages pg
WHERE pg.user_id = f.user_id
AND pg.page_id = p.page_id);Here is the output.
| user_id | Page_id |
|---|---|
| 1 | 23 |
| 1 | 24 |
| 1 | 28 |
| 8 speaks of the day about living | 8 speaks of the day about living |
| 5 | 25 |
Business Use:
- Customer activity: per user, latest login, latest subscription change.
- Sales: Highest order per customer, highest revenue order per region.
- Product Performance: The most purchased products in each category, the highest revenue products in each month.
- User behavior: longest session per user, first purchase per customer.
- Reviews and opinions: Top reviewer, latest reviews for every product.
- Operations: Latest shipment status per order, fastest delivery time in every region.
# Pattern #3: Hierarchy with window functions
Using window functions such as ROW_NUMBER()for , for , for , . RANK()and DENSE_RANK() Allows you to order rows within data partitions, and then identifies first, second, or Ninth record.
Here is a window with a rating of each of them Functions does:
ROW_NUMBER(): Assigns a unique sequence number within each partition. Tied values ​​get different row numbers.RANK(): Assigns the same rank to the bound values ​​and skips the next rows for the next unbound value.DENSE_RANK(): The sameRANK()only it doesn’t leave rank after the relationship.
Example: In a Amazon Interview Questionwe need to find the highest daily order cost between 2019-02-01 and 2019-05-01. If a customer has multiple orders on a particular day, sum up the order costs on a daily basis. The output should contain the customer’s first name, the total cost of their order, and the order date.
In the first joint table expression (CTE), we find the orders between the specified dates and get the customer’s daily total for each date.
In other CTEs, we use RANK() Sorting customers for each date by order.
Now, we join the two CTEs to output the desired columns and filter only the orders to which they are assigned, i.e. the highest order.
WITH customer_daily_totals AS (
SELECT o.cust_id,
o.order_date,
SUM(o.total_order_cost) AS total_daily_cost
FROM orders o
WHERE o.order_date BETWEEN '2019-02-01' AND '2019-05-01'
GROUP BY o.cust_id, o.order_date
),
ranked_daily_totals AS (
SELECT cust_id,
order_date,
total_daily_cost,
RANK() OVER (PARTITION BY order_date ORDER BY total_daily_cost DESC) AS rnk
FROM customer_daily_totals
)
SELECT c.first_name,
rdt.order_date,
rdt.total_daily_cost AS max_cost
FROM ranked_daily_totals rdt
JOIN customers c ON rdt.cust_id = c.id
WHERE rdt.rnk = 1
ORDER BY rdt.order_date;Here is the output.
| first_name | order_date | max_coast |
|---|---|---|
| Mia | 2019-02-01 | 100 |
| Frieda | 2019-03-01 | 80 |
| Mia | 2019-03-01 | 80 |
| 8 speaks of the day about living | 8 speaks of the day about living | 8 speaks of the day about living |
| Frieda | 2019-04-23 | 120 |
Business Use:
- User Activity: “Most active users last month”.
- Revenue: “Second highest revenue generating region”.
- Product Popularity: “Top 10 Best Selling Products”.
- Buys “every customer’s first purchase”.
# Pattern #4: Calculating Moving Averages and Cumulative Amounts
A moving (rolling) average calculates the average n Rows, usually months or days. It is calculated using AVG() As the window function and window definition ROWS BETWEEN N PRECEDING AND CURRENT ROW.
The grand total (running total) is the sum from the first row to the current row, which is reflected in the defining window. ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW i SUM() Window function
Example: Amazon Interview Question We want to find the 3-month rolling average of total revenue from purchases. We should output the yearly month (YYYY-MM) and the 3-month rolling average, sorted from earliest to latest month.
Also, returns (negative purchase values) should not be included.
We use a subset to calculate monthly revenue SUM() And change the purchase date to YYYY-MM format TO_CHAR() The ceremony
Then, we use AVG() To calculate the moving average. i OVER() clause, we order the data in the distribution by month and specify a window ROWS BETWEEN 2 PRECEDING AND CURRENT ROW; We calculate a 3-month moving average, which takes into account the current and previous two months.
SELECT t.month,
AVG(t.monthly_revenue) OVER(ORDER BY t.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS avg_revenue
FROM
(SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
SUM(purchase_amt) AS monthly_revenue
FROM amazon_purchases
WHERE purchase_amt > 0
GROUP BY 1
ORDER BY 1) AS t
ORDER BY t.month ASC;Here is the output.
| The month | avg_revence |
|---|---|
| 2020-01 | 26292 |
| 2020-02 | 23493.5 |
| 2020-03 | 25535.666666666668 |
| 8 speaks of the day about living | 8 speaks of the day about living |
| 2020-10 | 21211 |
To calculate the total amount, we do it like this.
SELECT t.month,
SUM(t.monthly_revenue) OVER(ORDER BY t.month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_sum
FROM
(SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
SUM(purchase_amt) AS monthly_revenue
FROM amazon_purchases
WHERE purchase_amt > 0
GROUP BY 1
ORDER BY 1) AS t
ORDER BY t.month ASC;Here is the output.
| The month | Low_Sim |
|---|---|
| 2020-01 | 26292 |
| 2020-02 | 46987 |
| 2020-03 | 76607 |
| 8 speaks of the day about living | 8 speaks of the day about living |
| 2020-10 | 239869 |
Business Use:
- Engagement metrics: 7-day moving average DAU or messages sent, cumulative cancellations.
- Financial KPI: 30-day average price/conversion/stock prices, reporting revenue (gross YTD).
- Product Performance: Average logins per user, total app installs.
- Operations: Aggregate orders shipped, tickets resolved, bugs closed.
# Pattern #5: Applying Conditional Aggregations
Conditional aggregation lets you compute multiple class matrices in a single pass Matter when statement Within the overall functions.
Example: a Amazon Interview Questions Asks you to identify active users by finding users who make a second purchase within 1 to 7 days of their first purchase. The output should contain only those user IDs. Same day purchases should be ignored.
The first CTE identifies customers and their purchase dates, excluding same-day purchases using DISTINCT Keyword
The second CTE is each customer’s purchase history from oldest to newest.
The final CTE finds the first and second purchases for each customer using conditional aggregation. We use MAX() To select a single non-null value for the first and second purchase dates.
Finally, we use the result of the last CTE and retain only those customers who made a second (non-null) purchase within 7 days of their first purchase.
WITH daily AS (
SELECT DISTINCT user_id,
created_at::DATE AS purchase_date
FROM amazon_transactions
),
ranked AS (
SELECT user_id,
purchase_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date) AS rn
FROM daily
),
first_two AS (
SELECT user_id,
MAX(CASE WHEN rn = 1 THEN purchase_date END) AS first_date,
MAX(CASE WHEN rn = 2 THEN purchase_date END) AS second_date
FROM ranked
WHERE rn <= 2
GROUP BY user_id
)
SELECT user_id
FROM first_two
WHERE second_date IS NOT NULL AND (second_date - first_date) BETWEEN 1 AND 7
ORDER BY user_id;Here is the output.
| user_id |
|---|
| 100 |
| 103 |
| 105 |
| 8 speaks of the day about living |
| 143 |
Business Use:
- Subscription reporting: paid vs. free users, active vs. active, by plan tier.
- Marketing Funnel Dashboards: Signups vs. Purchasers by traffic source, emails opened vs. clicked vs. converted.
- E-Commerce: Completed vs. Refunds vs. Canceled Orders by Territory, New vs. Returning Buyers
- Product Analysis: iOS vs. Android vs. Web Usage, Feature Adopted vs. Not Included in Each Group
- Finance: Revenue from new vs. existing customers, gross vs. net revenue.
- A/B Testing and Experiments: Control vs. Treatment Matrix.
# The result
If you want a job at Feng (and other) companies, pay attention to these five SQL patterns for interviews. Of course, those aren’t the only SQL concepts that have been tested. But they are tested the most. By focusing on these, you ensure that your interview preparation for most SQL interviews at fun companies is as effective as possible.
Nate Rosedy A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Netcareer writes on the latest trends in the market, gives interview tips, shares data science projects, and covers everything SQL.