This is the fourth post in a series of posts about doing data science with SQL. The previous post went over the commands required to set up the example recipes database from the first post in this series.
In this post, I will use the example recipes database from the
first post to
go over the basics of querying in SQL with the
I will start with the basic operators of filtering, joining, and
aggregating. Then I will show how these simple commands can be
combined to create powerful queries. By the end of this post, you
should be able to write advanced SQL queries.
SELECT, FROM, and WHERE in SQL
We can use the
SELECT statement in SQL to query data from a
database. For example, we might be interested in finding
all of the ingredients in the “Tomato Soup” recipe
recipes database described in the
in this series).
This query is non-trivial because this information is spread across three tables. As a first, step we could query for the recipe ID of this recipe with:
SELECT recipe_id FROM recipes WHERE recipe_name="Tomato Soup"
This says to take the
recipes table and take the
column for all rows where the
recipe_name column has a
particular value. This query returns the table
Given this recipe ID, we can get the ingredient IDs for the recipe using a similar query on the recipe-ingredients-mapping table:
SELECT ingredient_id FROM recipe_ingredients WHERE recipe_id = 2
Finally, we can find the ingredient names knowing their IDs:
SELECT ingredient_name FROM ingredients WHERE ingredient_id=3 OR ingredient_id=6
The JOIN Operator in SQL
Because our data is spread across three tables, it is cumbersome and error-prone to have to run multiple queries to find the information we want. We can avoid this by joining the tables together.
When we join two tables on a column in SQL, it will create every
possible combination of rows in the output table where the condition
holds For example, if we joined
on the recipe ID:
SELECT * FROM recipes JOIN recipe_ingredients ON recipes.recipe_id = recipe_ingredients.recipe_id
We get the table:
This joined table includes the recipe names along with the recipe IDs for each recipe-ingredient pair.
Getting back to our example from above, we can compute the ingredient
IDs for ‘Tomato Soup’ by joining
on the recipe ID.
SELECT recipe_ingredients.ingredient_id FROM recipes JOIN recipe_ingredients ON recipes.recipe_id = recipe_ingredients.recipe_id WHERE recipes.recipe_name = 'Tomato Soup'
In the next section, we will show how we can also join with the
ingredients table to directly get the ingredient names.
Having to use the full table names repeatedly in a SQL query is cumbersome. SQL provides a convenient shorthand where we can give each table a nickname:
SELECT b.ingredient_id FROM recipes AS a JOIN recipe_ingredients AS b ON a.recipe_id = b.recipe_id WHERE a.recipe_name = 'Tomato Soup'
This is the same query as before, but slightly less verbose.
JOINing Multiple Tables in SQL
SQL allows us to join multiple tables for even more powerful queries. Getting back to our original example, we can directly find the ingredient names for the ingredients in ‘Tomato Soup’ by joining all three tables together:
SELECT c.ingredient_name FROM recipes AS a JOIN recipe_ingredients AS b ON a.recipe_id = b.recipe_id JOIN ingredients AS c ON b.ingredient_id = c.ingredient_id WHERE a.recipe_name = "Tomato Soup"
As expected, this returns the table:
What is great about SQL is that by joining tables together, we can ask very diverse questions about our data. For example, finding all the recipes that include ‘tomatoes’ is just as straightforward:
SELECT a.recipe_name FROM recipes AS a JOIN recipe_ingredients AS b ON a.recipe_id = b.recipe_id JOIN ingredients AS c ON b.ingredient_id = c.ingredient_id WHERE c.ingredient_name = "tomatoes"
The GROUP BY Operator In SQL
The next important concept in SQL is aggregating rows.
This is done with the
GROUP BY command.
Supposed for example that we wanted to find the number
of ingredients in each recipe. We could do this by
grouping the rows in the
recipe_ingredients table by the
recipe ID and counting the number or grouped rows:
SELECT recipe_id, COUNT(ingredient_id) AS num_ingredients FROM recipe_ingredients GROUP BY recipe_id ORDER BY num_ingredients DESC
The code returns:
We can combine the
GROUP BY and
JOIN operators in a single query.
To compute in addition the price of each recipe, we would need to figure
out the price of each ingredient by joining with the ingredients table.
This query would look like:
SELECT recipe_id, COUNT(a.ingredient_id) AS num_ingredients, SUM(a.amount*b.ingredient_price) AS total_price FROM recipe_ingredients as a JOIN ingredients as b ON a.ingredient_id = b.ingredient_id GROUP BY a.recipe_id
Similarly, if we want to make the table display recipe names, we could also JOIN with the recipes tables:
SELECT c.recipe_name, COUNT(a.ingredient_id) AS num_ingredients, SUM(a.amount*b.ingredient_price) AS total_price FROM recipe_ingredients AS a JOIN ingredients AS b ON a.ingredient_id = b.ingredient_id JOIN recipes AS c ON a.recipe_id = c.recipe_id GROUP BY a.recipe_id
This returns a nicer formated table:
Finally, as a shorthand, SQL allows you to
refer to the columns in the
clause by numbering
SELECT c.recipe_name, COUNT(a.ingredient_id) AS num_ingredients, SUM(a.amount*b.ingredient_price) AS total_price FROM recipe_ingredients AS a JOIN ingredients AS b ON a.ingredient_id = b.ingredient_id JOIN recipes AS c ON a.recipe_id = c.recipe_id GROUP BY 1
Some people consider this to be more elegant and less error prone.
Aggregation Functions in SQL
As you saw above, SQL can apply different aggregation functions. This query demonstrates more of them:
SELECT COUNT(ingredient_price) as count, AVG(ingredient_price) as avg, SUM(ingredient_price) as sum, MIN(ingredient_price) as min, MAX(ingredient_price) as max, STDDEV(ingredient_price) as stddev, SUM(ingredient_price) as sum FROM ingredients
Note here that when you leave out the
GROUP BY class, but include
an aggregation function, SQL assumes that you want to group all
You can find the full list of aggregation functions in MySQL here.
The HAVING Operator in SQL
HAVING clause in SQL is almost exactly like the
clause, but filters the table after the aggregation has been
Suppose we wanted to find only recipes with 2 ingredients in it.
We could use the
SELECT recipe_id, COUNT(ingredient_id) AS num_ingredients FROM recipe_ingredients GROUP BY recipe_id HAVING num_ingredients = 2
This creates the table
As you will see below,
HAVING is just a convenient shorthand to
avoid using a subquery.
Subqueries in SQL
A more challenging query would be to make a list of the number of ingredients, but only for recipes that include tomatoes.
To do this, we first would need to find all the recipes which include tomatoes and then count the number of ingredients for each of those recipes.
We could imagine doing this in two steps. First, we find the recipes that have tomatoes in it:
SELECT a.recipe_id FROM recipe_ingredients AS a JOIN ingredients AS b ON a.ingredient_id = b.ingredient_id WHERE b.ingredient_name = 'Tomatoes'
This creates the table:
Next, we could joining this table with the ingredient count table from the query above to filter out the recipes that aren’t in this table.
This leads us to the idea of subqueries. Because every SQL query returns a table, another SQL query can be used instead of a table inside of another SQL query.
The final query is:
SELECT b.recipe_name, COUNT(a.ingredient_id) AS num_ingredients FROM recipe_ingredients AS a JOIN recipes AS b ON a.recipe_id = b.recipe_id JOIN ( SELECT c.recipe_id FROM recipe_ingredients AS c JOIN ingredients AS d ON c.ingredient_id = d.ingredient_id WHERE d.ingredient_name = 'Tomatoes' ) AS e ON b.recipe_id = e.recipe_id GROUP BY a.recipe_id
As expected, this returns
What’s cool about SQL is that it is very flexible and can allow multiple subqueries to be nested together.
The DISTINCT Operator
DISTINCT operator can be used to find all of the unique
For example, to find all the recipes that include either beef or cheese, we could use the SQL query:
SELECT DISTINCT recipe_name FROM recipe_ingredients AS a JOIN ingredients AS b ON a.ingredient_id = b.ingredient_id JOIN recipes AS c ON a.recipe_id = c.recipe_id WHERE b.ingredient_name = 'Cheese' OR b.ingredient_name = 'Beef'
Note that here the
keyword is required because otherwise two rows would
be returned for tacos since they contain both
cheese and beef.
We can count the number of distinct recipes by
COUNT keyword outside the
SELECT COUNT(DISTINCT recipe_name) AS num_recipes FROM recipe_ingredients AS a JOIN ingredients AS b ON a.ingredient_id = b.ingredient_id JOIN recipes AS c ON a.recipe_id = c.recipe_id WHERE b.ingredient_name = 'Cheese' OR b.ingredient_name = 'Beef'
The ORDER BY operator in SQL
ORDER BY can be used to sort the output rows based on a particular
column. For example, if we wanted to sort the ingredients by how
expensive they are in descending order of price, we could run the
SELECT * FROM ingredients ORDER BY ingredient_price DESC
If we wanted to sort columns of the same price alphabetically by name, we could use a similar query but perform a second sort based on the price:
SELECT * FROM ingredients ORDER BY ingredient_price DESC, ingredient_name
This creates the table:
The LIMIT operator in SQL
We can use the
LIMIT operator to limit the number of results returned
by the query. For example, to get only the most expensive ingredient, we could use
SELECT * FROM ingredients ORDER BY ingredient_price DESC LIMIT 1
This returns just one result:
Self AND Inequality Joins
The final concepts we will learn about is self and equality joins. As a concrete example, supposed that we wanted to compute the number of shared ingredients for all pairs of recipes.
To compute this, we can join the ingredient-recipe mapping table with itself and select for rows that have the same ingredient. This will create a row for ever matching ingredient in every pair of recipes:
SELECT a.recipe_id, b.recipe_id, a.ingredient_id FROM recipe_ingredients AS a JOIN recipe_ingredients AS b ON a.ingredient_id = b.ingredient_id AND a.recipe_id != b.recipe_id ORDER BY a.recipe_id, b.recipe_id
Note that we have to filter for
a.recipe_id != b.recipe_id to avoid
matching recipe with themselves. Joins with an inequality
condition are unsurprisingly called inequality joins.
This table shows the recipe 1 (“Tacos”) and recipe 2 (“Tomato Soup”) share ingredient 3 (“Tomatoes”). Similarly, recipe 1 (“Tacos”) and recipe 3 (“Grilled Cheese”) share ingredient 5 (“Cheese”).
One issue with this query is that it matches every pair of ingredients twice. To avoid this, we can modify the query to return only rows when the first recipe ID is less than the second.
Finally, we can can aggregate over the recipe IDs to compute the count of shared ingredients:
SELECT a.recipe_id, b.recipe_id, COUNT(*) as num_shared FROM recipe_ingredients AS a JOIN recipe_ingredients AS b ON a.recipe_id < b.recipe_id AND a.ingredient_id = b.ingredient_id GROUP BY a.recipe_id, b.recipe_id
As expected, this returns
We can include the recipe names by also joining with the recipes table:
SELECT c.recipe_name AS recipe_1, d.recipe_name AS recipe_2, COUNT(*) AS num_shared FROM recipe_ingredients AS a JOIN recipe_ingredients AS b ON a.recipe_id < b.recipe_id AND a.ingredient_id = b.ingredient_id JOIN recipes AS c ON a.recipe_id = c.recipe_id JOIN recipes AS d ON b.recipe_id = d.recipe_id GROUP BY a.recipe_id, b.recipe_id ORDER BY recipe_1, recipe_2
From these examples, I hope you see that the simple SQL operators can be combined to perform very powerful queries.