Most candidates fail these SQL concepts in data interviews

Photo by Author | Canva

The interviewer’s job is to find the most suitable candidates for the advertiser position. In doing so, they will happily set the SQL interview questions to see if they can remove you from the guard. There are many SQL concepts in which candidates often fail.

Hopefully, you will be among the people who avoid this destiny, because I will explain these concepts in detail below, with the example of completing certain issues with a way to solve the properly.

Most candidates fail these SQL concepts in data interviews

. 1. Window functions

Why is it difficult: Candidates memorize everyone Window function But don’t really understand how window frames, partitions, or orders actually work.

Common errors: A common error is not explaining ORDER BY Ranking window functions or in value window functions, such as LEAD() Or LAG()And to expect to work or consequently.

Example: I This example, You need to find users who made another purchase within 7 days of any previous purchase.

You can write this inquiry.

WITH ordered_tx AS (
  SELECT user_id,
         created_at::date AS tx_date,
         LAG(created_at::DATE) OVER (PARTITION BY user_id) AS prev_tx_date
  FROM amazon_transactions
)

SELECT DISTINCT user_id
FROM ordered_tx
WHERE prev_tx_date IS NOT NULL AND tx_date - prev_tx_date <= 7;

At first glance, everything may look fine. The code even outset something that may show the correct answer.

The functions of the window

First, we are lucky that the code works exactly! This is just because I’m writing it Postgresql. In some other SQL flavors, you will have an error since then ORDER BY Ranking and analytical window functions are essential.

Second, the output is wrong. I highlighted some rows that should not be there. Then why do they appear?

They appear because we did not make an explanation ORDER BY Clause LAG() Without it, the order of the row is discretion. Therefore, we are comparing the existing transaction with some random previous row for this user, not before it has already occurred.

This is not a question that asks the question. We need to compare each transaction before the previous one. In other words, we need to explain it clearly in ORDER BY Clause inside LAG() Ceremony

WITH ordered_tx AS (
  SELECT user_id,
         created_at::date AS tx_date,
         LAG(created_at::DATE) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_tx_date
  FROM amazon_transactions
)

SELECT DISTINCT user_id
FROM ordered_tx
WHERE prev_tx_date IS NOT NULL AND tx_date - prev_tx_date <= 7;

. 2. Filtering with gross (especially where vs)

Why is it difficult: people often do not understand the processing order in SQL, which is: FROM ..> WHERE ..> GROUP BY ..> HAVING ..> SELECT ..> ORDER BY. This order means WHERE Filters rows before collecting, and HAVING After the filters. Even, logically, it means that you cannot use overall functions WHERE Clause

Common error: I try to use overall functions WHERE A group is making a mistake in an inquiry.

Example: This interview question Asks you to find a total income made through each Winery. Only in the winery, where 90 is the lowest number of points for any of their types, should be considered.

Many people will see it as an easy question and soon write this question.

SELECT winery,
       variety,
       SUM(price) AS total_revenue
FROM winemag_p1
WHERE MIN(points) >= 90
GROUP BY winery, variety
ORDER BY winery, total_revenue DESC;

However, this code will throw an error in which to say that overall functions are not allowed WHERE The clause describes everything a lot. Slight? Move from the state of filtering WHERE to HAVING.

SELECT winery,
       variety,
       SUM(price) AS total_revenue
FROM winemag_p1
GROUP BY winery, variety
HAVING MIN(points) >= 90
ORDER BY winery, total_revenue DESC;

. 3. Self -related journey for time -based or event -based comparison

Why is it difficult: the idea of To join a table with you There is quite unreasonable, so candidates often forget that this is an option.

Common error: Using subtiters and joining a table with you will be easier and faster to complicate questions, especially when filtering through dates or events.

Example: Here is a Question Between January 1, 2020 and July 1, 2020, you ask you to change each currency exchange rate.

You can resolve the external communication sub -sir, which receives the July 1 exchange rate, then reduces the January 1 exchange rate, which comes from the inner sub -section.

SELECT jan_rates.source_currency,
  (SELECT exchange_rate 
   FROM sf_exchange_rate 
   WHERE source_currency = jan_rates.source_currency AND date="2020-07-01") - jan_rates.exchange_rate AS difference
FROM (SELECT source_currency, exchange_rate
      FROM sf_exchange_rate
      WHERE date="2020-01-01"
) AS jan_rates;

It returns a right output, but such a solution is unnecessarily complicated. A very simple solution with low lines of code, joining a table by itself and then applying two date filtering terms WHERE Clause

SELECT jan.source_currency,
       jul.exchange_rate - jan.exchange_rate AS difference
FROM sf_exchange_rate jan
JOIN sf_exchange_rate jul ON jan.source_currency = jul.source_currency
WHERE jan.date="2020-01-01" AND jul.date="2020-07-01";

. 4.

Why is it difficult: People often get stuck in sub -subtituds because they learn them before the normal table expression (CTE) and continue to use them for any question with layered logic. However, sub -reservoirs can be dirty very quickly.

Common error: Use of deep nest SELECT Statements will be very easy when the CTE will be very easy.

Example: I The interview question From Google and Netflix, you need to find top actors based on their average film rating in this genre, which they often appear.

Following is the solution using CTE.

WITH genre_stats AS
  (SELECT actor_name,
          genre,
          COUNT(*) AS movie_count,
          AVG(movie_rating) AS avg_rating
   FROM top_actors_rating
   GROUP BY actor_name,
            genre),
            
max_genre_count AS
  (SELECT actor_name,
          MAX(movie_count) AS max_count
   FROM genre_stats
   GROUP BY actor_name),
     
top_genres AS
  (SELECT gs.*
   FROM genre_stats gs
   JOIN max_genre_count mgc ON gs.actor_name = mgc.actor_name
   AND gs.movie_count = mgc.max_count),
     
top_genre_avg AS
  (SELECT actor_name,
          MAX(avg_rating) AS max_avg_rating
   FROM top_genres
   GROUP BY actor_name),
   
filtered_top_genres AS
  (SELECT tg.*
   FROM top_genres tg
   JOIN top_genre_avg tga ON tg.actor_name = tga.actor_name
   AND tg.avg_rating = tga.max_avg_rating),
     ranked_actors AS
  (SELECT *,
          DENSE_RANK() OVER (
                             ORDER BY avg_rating DESC) AS rank
   FROM filtered_top_genres),
   
final_selection AS
  (SELECT MAX(rank) AS max_rank
   FROM ranked_actors
   WHERE rank <= 3)
   
SELECT actor_name,
       genre,
       avg_rating
FROM ranked_actors
WHERE rank <=
    (SELECT max_rank
     FROM final_selection);

It is relatively complex, but it still contains six clear CTEs, in which the ability to read the code has been increased by clear alias names.

Interestingly, what will the same solution look like just use sub -reservoirs? This is.

SELECT ra.actor_name,
       ra.genre,
       ra.avg_rating
FROM (
    SELECT *,
           DENSE_RANK() OVER (ORDER BY avg_rating DESC) AS rank
    FROM (
        SELECT tg.*
        FROM (
            SELECT gs.*
            FROM (
                SELECT actor_name,
                       genre,
                       COUNT(*) AS movie_count,
                       AVG(movie_rating) AS avg_rating
                FROM top_actors_rating
                GROUP BY actor_name, genre
            ) AS gs
            JOIN (
                SELECT actor_name,
                       MAX(movie_count) AS max_count
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS genre_stats
                GROUP BY actor_name
            ) AS mgc
            ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
        ) AS tg
        JOIN (
            SELECT actor_name,
                   MAX(avg_rating) AS max_avg_rating
            FROM (
                SELECT gs.*
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS gs
                JOIN (
                    SELECT actor_name,
                           MAX(movie_count) AS max_count
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS genre_stats
                    GROUP BY actor_name
                ) AS mgc
                ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
            ) AS top_genres
            GROUP BY actor_name
        ) AS tga
        ON tg.actor_name = tga.actor_name AND tg.avg_rating = tga.max_avg_rating
    ) AS filtered_top_genres
) AS ra
WHERE ra.rank <= (
    SELECT MAX(rank)
    FROM (
        SELECT *,
               DENSE_RANK() OVER (ORDER BY avg_rating DESC) AS rank
        FROM (
            SELECT tg.*
            FROM (
                SELECT gs.*
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS gs
                JOIN (
                    SELECT actor_name,
                           MAX(movie_count) AS max_count
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS genre_stats
                    GROUP BY actor_name
                ) AS mgc
                ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
            ) AS tg
            JOIN (
                SELECT actor_name,
                       MAX(avg_rating) AS max_avg_rating
                FROM (
                    SELECT gs.*
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS gs
                    JOIN (
                        SELECT actor_name,
                               MAX(movie_count) AS max_count
                        FROM (
                            SELECT actor_name,
                                   genre,
                                   COUNT(*) AS movie_count,
                                   AVG(movie_rating) AS avg_rating
                            FROM top_actors_rating
                            GROUP BY actor_name, genre
                        ) AS genre_stats
                        GROUP BY actor_name
                    ) AS mgc
                    ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
                ) AS top_genres
                GROUP BY actor_name
            ) AS tga
            ON tg.actor_name = tga.actor_name AND tg.avg_rating = tga.max_avg_rating
        ) AS filtered_top_genres
    ) AS ranked_actors
    WHERE rank <= 3
);

Sub -reservoirs are repeated useless logic. How many sub -reservoirs are these? I have no idea. It is impossible to maintain the code. Although I just wrote it, I need half a day to understand it if I want to change something tomorrow. In addition, completely meaningless sub -sub -aliases are not helpful.

. 5.

Why is it difficult: candidates often think of it NULL Is equal to something. This is not. NULL Nothing is equal to anything – not even itself. Includes logic NULLS. does behave differently from the logic contained in the original values.

Common error: to use = NULL Instead of IS NULL In filtering or output rows because NULLs break the logic of the condition.

Example: There is a Interview Question by IBM It asks you to calculate the total number of conversation and the total number of contents made for each user.

It doesn’t look too difficult, so you can write this solution with two CTEs, where the number of CTEs per customer is counting, while the other calculates the number of materials manufactured by the user. In the final SELECTYou FULL OUTER JOIN Two CTEs, and you have a solution. OK?

WITH interactions_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_interactions
   FROM customer_interactions
   GROUP BY customer_id),
   
content_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_content_items
   FROM user_content
   GROUP BY customer_id)
   
SELECT i.customer_id,
  i.total_interactions,
  c.total_content_items
FROM interactions_summary AS i
FULL OUTER JOIN content_summary AS c ON i.customer_id = c.customer_id
ORDER BY customer_id;

Almost fine here. (Well, you look instead of a double quotation number (“”) NULL. Thus stratascratch ui shows it, but trust me, the engine still treats those who they are: NULL Values).

The featured rows occur NULLs. It makes the output wrong. A NULL The value is neither customer ID nor the number of interactions and contents, which the question clearly asks you to reveal.

We are losing something in the aforementioned solution COALESCE() To handle NULLS in the final SELECT. Now, all users will not find their identity without interaction content_summary CTE. Also, for those users who have interactions, or content, or both, now we will change NULL With 0, which is a valid number.

WITH interactions_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_interactions
   FROM customer_interactions
   GROUP BY customer_id),
   
content_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_content_items
   FROM user_content
   GROUP BY customer_id)
   
SELECT COALESCE(i.customer_id, c.customer_id) AS customer_id,
       COALESCE(i.total_interactions, 0) AS total_interactions,
       COALESCE(c.total_content_items, 0) AS total_content_items
FROM interactions_summary AS i
FULL OUTER JOIN content_summary AS c ON i.customer_id = c.customer_id
ORDER BY customer_id;

. 6. Group -based deduction

Why is it difficult: Group -based deduction means that you are choosing a line per group, such as “recent”, “most scores”, etc. First of all, it seems that you just need to choose a user. But you can’t use GROUP BY Until you are gathered. On the other hand, you often need a full row, not a price that deposits and GROUP BY Return

Common error: to use GROUP BY + LIMIT 1 (Or IsolatedWhich relates to the postgrass QL) instead of ROW_NUMBER() Or RANK()The latter if you want relationships to be involved.

Example: This question You ask you to identify the best -selling item for each month, and do not need to be separated for months. The highest selling item is calculated unitprice * quantity.

This will be the bid point of view. First, extract the sale month invoicedateSelect descriptionAnd find the total sale by summarizing unitprice * quantity. Then, to get the total sale in terms of months and product description L, we easily GROUP BY That two columns. Finally, we just need to use ORDER BY To configure Output to the best of the worst selling products and use LIMIT 1 Only the first row, namely the best -selling item.

SELECT DATE_PART('MONTH', invoicedate) AS sale_month,
       description,
       SUM(unitprice * quantity) AS total_paid
FROM online_retail
GROUP BY sale_month, description
ORDER BY total_paid DESC
LIMIT 1;

As I said, it’s said. Output is somewhat similar to our need, but we need it for every month, not just one.

Have to use a correct point of view RANK() With this approach, the window function, we follow a similar procedure of the previous code. The difference is that the queries now become a sub -subtal FROM Clause in addition, we use RANK() According to the month, dividing the data and then rating the row within each distribution (ie separately for each month) to the worst selling item.

Then, in the central question, we only choose the desired columns and outputs only where Rank 1 is WHERE Clause

SELECT month,
       description,
       total_paid
FROM
  (SELECT DATE_PART('month', invoicedate) AS month,
          description,
          SUM(unitprice * quantity) AS total_paid,
          RANK() OVER (PARTITION BY DATE_PART('month', invoicedate) ORDER BY SUM(unitprice * quantity) DESC) AS rnk
   FROM online_retail
   GROUP BY month, description) AS tmp
WHERE rnk = 1;

. Conclusion

The six concepts we have covered are usually displayed in SQL coding interview questions. Pay attention to them, follow the interview questions that include these concepts, learn the right view, and you will significantly improve your prospects in your interviews.

Net Razii A data is in a scientist and product strategy. He is also an affiliated professor of Teaching Analytics, and is the founder of Stratskrich, a platform that helps data scientists prepare for his interview with the real questions of high companies. The net carrier writes on the latest trends in the market, gives interview advice, sharing data science projects, and everything covers SQL.

. 1. Window functions

. 2. Filtering with gross (especially where vs)

. 3. Self -related journey for time -based or event -based comparison

. 4.

. 5.

. 6. Group -based deduction

. Conclusion

Editor's pick

Get latest news

Most candidates fail these SQL concepts in data interviews

. 1. Window functions

. 2. Filtering with gross (especially where vs)

. 3. Self -related journey for time -based or event -based comparison

. 4.

. 5.

. 6. Group -based deduction

. Conclusion

How to protect your company culture when you are growing faster

Former Scale AI CTO launches an AI agent who can solve the biggest problem of Big data

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news