7 Common Pitfalls Holding You Back in SQL Interviews

data by malcolm
7 min readJan 20, 2022

--

At this point in my career, I estimate I’ve hosted well over a hundred real and mock SQL interviews and hit 10k hours of writing SQL queries. Mastery is subjective, but I think I’m getting closer and may have a few tips to share for those looking to upskill and ace any SQL interview.

Your brain on SQL

I recall taking three courses in college related to database administration where SQL was briefly covered. The CRUD topics were pretty straightforward, but I was quickly overwhelmed when it came to SELECT statements. To this day, I still don’t know how I learned the basics. I guess it was a lot of on-the-job experimentation and being surrounded by experienced data professionals. Many years later, I’m sharing with you what I believe are the most common traps holding you back from SQL mastery.

The easiest raise I ever got (was because of SQL)

Many years ago I was being recruited for a data analyst role. In the interview process, I was given a 24-hour take-home problem set that was mostly SQL. I decided to invest 100% of my effort into the assessment. The hiring manager was impressed by my solutions and decided that I should instead interview for the senior position on the team. I eventually received and accepted the offer for the senior position. I’m guessing that was the easiest $25k raise I ever made.

1. You think Python (or R) is the ideal tool

In interviews, I like to ask candidates: “when would you use Python vs SQL?”

There’s no wrong answer here. However, I would argue that you would always use SQL unless you can’t. If you are transforming or aggregating data, use SQL. If you need access to statistics or web scraping packages, use Python. The answer I’m not looking for is “Python Pandas because it’s better than SQL”. This answer is a hint that you aren’t very comfortable with SQL. People who can write both Pandas and SQL fluently will tell you that SQL is much easier to manipulate data and often times more efficient (because you are leveraging the query engine built for scale).

2. You don’t understand execution order

The database execution engine is like a line cook

A rookie mistake I often see is not knowing the SQL order of operations. Here’s the general order of clauses you write your query in:

  1. SELECT
  2. FROM
  3. WHERE
  4. GROUP BY
  5. HAVING
  6. ORDER BY

However, that’s not the order that the database is executing your queries. Here’s the order that the database reads and executes your query:

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. ORDER BY
  7. LIMIT

Spot the difference? Filtering occurs before column selection (and creation). A frequent mistake I see some make is creating a new column in their SELECT clause but then attempting to filter those columns in the WHERE clause. For example, imagine you are trying to find the 100th user that signed up. So you use a row_number window function like so:

select *
, row_number() over (order by signup_date) as rn
from users

When filtering for the 100th user, a rookie might try the following:

select *
, row_number() over (order by signup_date) as rn
from users
where rn = 100
-- returns an error "rn" is not defined

The problem with the above is that you aren’t considering the order of operations from the database execution perspective (hint: WHEREexecutes before the SELECT). The WHERE is unaware of any rn column and cannot be filtered.

The solution, in this case, is to use a common table expression (CTE) like so:

with rankings as (
select *
, row_number() over (order by signup_date) as rn
from users
)
select *
from rankings
where rn = 100

If you thought the solution was to use a subquery, you aren’t wrong, but see the next section about why subqueries are the devil.

3. You write subqueries

Eventually, you get better at SQL and you feel comfortable writing a much longer query. To get the job done, your query contains many nested subqueries. Do you know how painful it is to attempt to review your query?

Me opening a SQL file containing nested subqueries

The problem with subqueries (and especially nested subqueries) is that I have no idea where the query really “begins”. A nested subquery reads more like a Fibonacci spiral.

Check out this problem that I pulled from w3 resources:

If we want to retrieve that unique job_id and there average salary from the employees table which unique job_id have a salary is smaller than (the maximum of averages of min_salary of each unique job_id from the jobs table which job_id are in the list, picking from (the job_history table which is within the department_id 50 and 100))

Here’s the subquery based solution they provided:

select job_id, avg(salary)  
from employees
group by job_id
having avg(salary)< (
select max(myavg)
from (
select job_id, avg(min_salary) as myavg
from jobs
where job_id in (
select job_id
from job_history
where department_id between 50 AND 100
)
group by job_id
) ss
)

Not only is this difficult to interpret to the reviewer, but you’re also making it unnecessarily hard for yourself to write the query in the first place. Here’s how I would rewrite the solution using CTEs:

with deparment_jobs as (
select job_id
from job_history
where department_id between 50 AND 100
), avg_min_salaries as (
select j.job_id, avg(min_salary) as myavg
from jobs j
join department_jobs d using(job_id)
group by 1
)
select job_id, avg(salary) as avg_salary
from employees
group by 1
having avg(salary) > ( select max(myavg) from avg_min_salaries )

See how much easier that is to read? The query now reads like English; top to bottom and left to right. This is the beauty of CTEs — clean code. Notice, the one instance that a subquery should be preferred over a CTE: when it’s more elegant. It’s these formatting decisions that demonstrate the art of SQL.

4. You only use inner join

One of the biggest differences between rookies and pros is the type of join they like to use by default. Rookies like to use inner join, whereas pros like to default to left join. Why is that? It has to do with the type of reports and database designs that pros are typically working with. In real-world environments, you’ll often encounter data warehouse designs like star-schema, where you have a transactional (fact) table that you need to left join dimension table to enrich the fact table. The reason for the left join is that, more often than not, a transaction doesn’t have a foreign key to every dimensional table, and inner join would otherwise exclude the transaction from being included in the final output. You don’t learn this until you’ve made the mistake.

5. You don’t understand window functions

Once you understand window functions you will be eternally grateful. Rookies often get stuck on problems like:

  • Find the Nth user that signed up
  • Find the Nth user for each country
  • Calculate the subtotal users for each country in addition to the total users
  • Find the cumulative total signups over time

All of these problems would be a major headache to attempt to solve without window functions. Window functions allow you to apply aggregate functions (sum, avg, count, min, etc.) and ranking functions (row_number, rank, etc.) across rows without a group by. It’s almost as if you can choose which panes of glass to aggregate.

For a good primer on window functions, check this out. Just remember that window functions and group by don’t mix like oil and water. However, when you start combining window functions with CTEs, you become a more lethal querier.

6. You go way too slow

In high school, I took art classes almost every day. I distinctly remember the art teacher telling us we had to paint as fast as possible: “You become a better painter by painting faster. You have 3 minutes to start and finish the painting”.

The same is also true for mastering SQL. I see rookies going way too slow during their SQL interviews and often don’t finish all the questions before the time ends. When practicing, you should also be trying to improve your solution times. Often, the interviewers are looking to see if you have a diverse toolbelt of SQL techniques and best practices; they care less about if you “smudged your (query) painting” in the process — that is unless your solutions are being evaluated by a computer.

7. You don’t read the error messages

When I host an interactive SQL interview, I often hear the interviewee asking me: “does this look right?”. The first “person” you should be asking if it looks correct, is the database. Try and run the query and see what the database says. Does the query return results or raise an error? If there’s an error, did you carefully read the error message?

ERROR:  column "jobs.job_id" must appear in the GROUP BY clause or be used in an aggregate function

That kind of error message should be obvious. Sure, sometimes the error message is wildly off from the root cause. Regardless, a pro would run the query before asking for verbal confirmation.

Hopefully, you’ve found this article helpful and learned about the subtle differences that separate a SQL pro from a rookie. If any topic is unclear or you have suggestions for other common rookie mistakes during interviews, I invite you to leave a comment. A big thank you to all my peers that have taught me SQL tricks and tips throughout the years.

--

--

No responses yet