SQL Data Science: Understanding and Exploiting Joins

Recommendation: Use the NSDT scene editor to help you quickly build editable 3D application scenes

What is a join in SQL?

SQL joins allow you to combine data from multiple database tables based on common columns. This way, you can join information together and create meaningful connections between related datasets.

Join Types in SQL

There are several types of SQL joins:

  • inner join
  • left outer join
  • right outer join
  • full outer join
  • cross connect

Let's explain each type.

SQL inner join

An inner join returns only rows where there is a match in the two tables being joined. It merges rows from two tables based on a shared key or column, discarding non-matching rows.

We visualize this in the following way.

SQL Data Science: Understanding and Exploiting Joins

In SQL, this type of join is performed using the keywords JOIN or INNER JOIN.

SQL left outer join

A left outer join returns all rows from the left (or first) table and matching rows from the right (or second) table. If there is no match, returns a NULL value for the column in the right table.

We can imagine it this way.

SQL Data Science: Understanding and Exploiting Joins

If you want to use this join in SQL, you can use the LEFT OUTER JOIN or LEFT JOIN keyword to achieve it. Here is an article discussing left join vs left outer join .

SQL right outer join

Right joins are the opposite of left joins. It returns all rows from the right table and matching rows from the left table. If there is no match, returns a NULL value for the column in the left table.

SQL Data Science: Understanding and Exploiting Joins

In SQL, this type of join is performed using the keywords RIGHT OUTER JOIN or RIGHT JOIN.

SQL full outer join

A full outer join returns all rows from both tables, matching rows where possible, and filling non-matching rows with NULL values.

SQL Data Science: Understanding and Exploiting Joins

The keyword for this join in SQL is "full outer join" or "full join".

SQL cross join

This type of join combines all rows from one table with all rows from a second table. In other words, it returns the Cartesian product, all possible combinations of two table rows.

Here's the visualization to make it easier to understand.

SQL Data Science: Understanding and Exploiting Joins

When cross-joining in SQL, the keyword is CROSS JOIN.

Understanding SQL join syntax

To perform a join in SQL, you need to specify the tables to join, the columns to match on, and the type of join to perform. The basic syntax for joining tables in SQL is as follows:

SELECT columns
FROM table1
JOIN table2
ON table1.column = table2.column;

This example demonstrates how to use JOIN.

Refers to the first (or left) table in the FROM clause. Then, follow it with a JOIN referencing the second (or right) table.

Then there is the join condition in the ON clause. Here you can specify the columns that will be used to join the two tables. Typically, it's a shared column that is a primary key in one table and a foreign key in a second table.

Note: A primary key is a unique identifier for each record in a table. A foreign key establishes a link between two tables, i.e. it is a column in the second table that references the first table. We'll show you what this means in an example.

If you want to use left join, right join or full join, you just use these keywords instead of JOIN - everything else in the code is exactly the same!

The situation with cross joins is slightly different. Its nature is to join all combinations of rows in the two tables. That's why the ON clause is not required, the syntax is shown below.

SELECT columns
FROM table1
CROSS JOIN table2;

In other words, you only need to reference one table in the FROM and the second table in the CROSS JOIN.

Alternatively, you can reference the two tables in FROM and separate them with a comma - this is shorthand for CROSS JOIN.

SELECT columns
FROM table1, table2;

Self-join: a special type of join in SQL

There is also a specific way of joining tables - joining a table with itself. This is also known as a self-join table.

It's not exactly a unique type of join, since any of the previously mentioned join types can also be used for self-joins.

The syntax for self joins is similar to what I showed you earlier. The main difference is that the same tables are referenced in FROM and JOIN.

SELECT columns
FROM table1 t1
JOIN table1 t2
ON t1.column = t2.column;

Also, you need to give the tables two aliases to differentiate them. What you're doing is joining the table with itself and treating it as two tables.

I just wanted to mention this here, but I won't go into further detail. If you're interested in self-joins, see this illustrated guide to self-joins in SQL.

SQL join example

Time to show you how everything I mentioned works in practice. I'll use the SQL JOIN interview questions from StrataScratch to show each of the different types of joins in SQL.

1. Connection example

This question from Microsoft wants you to list each project and calculate the project's budget by employee.

expensive item

"Given a list of projects and employees mapped to each project, compute by the project budget amount assigned to each employee. The output should include project titles and project budgets, rounded to the nearest integer. First by the highest budget per employee Items sort the list.

data

The question gives two tables.

ms_projects

serial number: internationality
title: Valcar
Budget: internationality

ms_emp_projects

emp_id: internationality
project_id: internationality

Now,   the column id in the table ms_projects is the primary key of the table. The same column can be found in the table ms_emp_projects , albeit with a different name: project_id. This is a foreign key to the table, referencing the first table.

I will use these two columns to join the tables in the solution.

code

SELECT title AS project,
       ROUND((budget/COUNT(emp_id)::FLOAT)::NUMERIC, 0) AS budget_emp_ratio
FROM ms_projects a
JOIN ms_emp_projects b 
ON a.id = b.project_id
GROUP BY title, budget
ORDER BY budget_emp_ratio DESC;

I have joined two tables using JOIN. The table  ms_projects  is referenced in FROM and ms_emp_projects is referenced after JOIN. I provided an alias for both tables so that I don't use the long name of the table later.

Now, I need to specify the columns to join the tables. I've already mentioned which columns are primary keys in one table and which are foreign keys in the other, so I'll use them here.

I equal these two columns because I want to get all data with the same item ID. I also used the table's alias in front of each column.

Now I can access the data in both tables and I can list the columns in the SELECT. The first column is the item name and the second column is calculated.

This calculation uses the COUNT() function to count the number of employees for each project. Then, I divide each project's budget by the number of employees. I also converted the result to a decimal value and rounded it to zero decimal places.

output

Below is what the query returns.

SQL Data Science: Understanding and Exploiting Joins

2. Left join example

Let's practice this join on an Airbnb interview question. It wants you to find the number of orders, the number of customers, and the total cost of the order for each city.

Customer Orders and Details

"Find the number of orders, number of customers, and total cost of the order for each city. Only include cities where at least 5 orders have been placed, and count all customers in each city, even if they did not place an order.

Output each calculation along with the corresponding city name.

data

You will get tables for customers and orders .

client

serial number: internationality
first_name: Valcar
last_name: Valcar
City: Valcar
address: Valcar
phone_number: Valcar

Order

serial number: internationality
cust_id: internationality
order_date: date time
order_details: Valcar
total_order_cost: internationality

The shared columns are id from table customers  and cust_id from table orders . I will use these columns to join the tables.

code

Here's how to solve this problem using a left join.

SELECT c.city,
       COUNT(DISTINCT o.id) AS orders_per_city,
       COUNT(DISTINCT c.id) AS customers_per_city,
       SUM(o.total_order_cost) AS orders_cost_per_city
FROM customers c
LEFT JOIN orders o ON c.id = o.cust_id
GROUP BY c.city
HAVING COUNT(o.id) >=5;

I reference the table customers in FROM (which is our left table) and left join it with orders on the customer  id  column.

Now I can select the city, use COUNT() to get the number of orders and customers by city, and use SUM() to calculate the total order cost by city.

To get all these calculations by city, I group the output by city.

There is an additional requirement in the question: "Include only cities with at least 5 orders placed..." I use "must" to only show cities with five or more orders to achieve this.

The question is, why did I use  LEFT  JOIN  instead of  JOIN? The clue is in the question: "...and count all customers in each city, even if they didn't place an order. It might be that not all customers have placed an order. This means I want to display all customers in the table customers, which does exactly what the left Connection definition.

If I use JOIN, the result will be wrong because I will miss customers who didn't place any order.

Note: The complexity of joins in SQL is not reflected in their syntax, but in their semantics! As you can see, each join is written the same way, only the keywords have changed. However, each join works differently and thus can output different results depending on the data. Therefore, you must fully understand what each join does, and choose the one that returns exactly what you want!

output

Now, let's look at the output.

SQL Data Science: Understanding and Exploiting Joins

3. Right join example

A right join is the mirror image of a left join. That's why I can easily solve the previous problem using RIGHT JOIN. Let me tell you how.

data

The tables remain the same; I'll just use a different type of join.

code

SELECT c.city,
       COUNT(DISTINCT o.id) AS orders_per_city,
       COUNT(DISTINCT c.id) AS customers_per_city,
       SUM(o.total_order_cost) AS orders_cost_per_city
FROM orders o
RIGHT JOIN customers c ON o.cust_id = c.id 
GROUP BY c.city
HAVING COUNT(o.id) >=5;

Here's what changed. When I use RIGHT JOIN, I switch the order of the tables. Now the table orders become left orders and the table customer orders become right orders . The join conditions remain the same. I just switched the order of the columns to reflect the order of the table, but it wasn't necessary.

By switching the order of the tables and using a RIGHT JOIN, I'm outputting all customers again, even if they haven't placed any orders.

The rest of the query is the same as in the previous example. The same goes for output.

Note: In practice, right joins are relatively rarely used. LEFT JOIN seems more natural to SQL users, so they use it more often. Anything that can be done with a RIGHT JOIN can also be done with a LEFT JOIN. Therefore, there is no specific situation where RIGHT JOIN may be preferred.

output

SQL Data Science: Understanding and Exploiting Joins

4. Fully connected example

The Salesforce and Tesla question wants you to calculate the net difference between the number of product companies launching in 2020 and the number of product companies launching the year before.

New product

"You get a table of product launches by year by company. Write a query to calculate the net difference between the number of product companies that launched in 2020 and the number of product companies that launched in the previous year. Output the company name and Year-over-year published 2020 net product net balance.

data

The question provides a table with the following columns.

car_launches

Year: internationality
company_name: Valcar
product_name: Valcar

How would I join the tables when there is only one table? Well, let's see that too!

code

This query is a bit complicated, so I'll reveal it gradually.

SELECT company_name,
       product_name AS brand_2020
FROM car_launches
WHERE YEAR = 2020;

The first SELECT statement finds the company and product names for the year 2020. This query will later be converted to a subquery.

This question wants you to find the difference between 2020 and 2019. So let's write the same query for 2019.

SELECT company_name,
       product_name AS brand_2019
FROM car_launches
WHERE YEAR = 2019;

Now I'm going to turn these queries into subqueries and join them using a full outer join.

SELECT *
FROM
  (SELECT company_name,
          product_name AS brand_2020
   FROM car_launches
   WHERE YEAR = 2020) a
FULL OUTER JOIN
  (SELECT company_name,
          product_name AS brand_2019
   FROM car_launches
   WHERE YEAR = 2019) b 
ON a.company_name = b.company_name;

Subqueries can be treated as tables and thus can be joined. I give the first subquery an alias and put it in the FROM clause. I then joined it with a second subquery on the company name column using a "full outer join".

Using this type of SQL join, I am merging all companies and products in 2020 with all companies and products in 2019.

SQL Data Science: Understanding and Exploiting Joins

Now I can complete my query. Let's choose a company name. Also, I'll use the COUNT() function to find the number of products launched each year, then subtract that to get the difference. Finally, I'll group the output by company and sort it alphabetically by company.

Here is the entire query.

SELECT a.company_name,
       (COUNT(DISTINCT a.brand_2020)-COUNT(DISTINCT b.brand_2019)) AS net_products
FROM
  (SELECT company_name,
          product_name AS brand_2020
   FROM car_launches
   WHERE YEAR = 2020) a
FULL OUTER JOIN
  (SELECT company_name,
          product_name AS brand_2019
   FROM car_launches
   WHERE YEAR = 2019) b 
ON a.company_name = b.company_name
GROUP BY a.company_name
ORDER BY company_name;

output

Below is a list of companies and differences in product launches between 2020 and 2019.

SQL Data Science: Understanding and Exploiting Joins

5. Example of cross-connect

This question from Deloitte is great for showing how CROSS JOIN works.

up to two numbers

"Given a list of numbers, consider all possible permutations of the two numbers, assuming pairs of numbers (x, y) and (y, x) are two different permutations. Then, for each permutation, find the largest of the two numbers value.

Outputs three columns: the first column, the second number, and the maximum value in both columns.

The problem expects you to find all possible permutations of two numbers, assuming the pairs of numbers (x,y) and (y,x) are two different permutations. Then, we need to find the maximum value of each permutation.

data

This question gives us a table with one column.

deloitte_numbers

number: internationality

code

This code is an example of a CROSS JOIN, which is also an example of a self-join.

SELECT dn1.number AS number1,
       dn2.number AS number2,
       CASE
           WHEN dn1.number > dn2.number THEN dn1.number
           ELSE dn2.number
       END AS max_number
FROM deloitte_numbers AS dn1
CROSS JOIN deloitte_numbers AS dn2;

I reference the table in FROM and give it an alias. I then cross-join it with itself by referencing it after the cross-join and giving the table another alias.

One table can now be used since they are two. I select column numbers from each table. I then use a CASE statement to set a condition that will display the maximum number of two numbers.

Why use a cross join here? Remember, it's a type of SQL join that will show all combinations of all rows from all tables. That's exactly what the question is asking!

output

Here's a snapshot of all combined and the higher number of the two.

SQL Data Science: Understanding and Exploiting Joins

Use SQL joins for data science

Now that you know how to use SQL joins, the question is how to use this knowledge in data science.

SQL joins play a vital role in data science tasks such as data exploration, data cleaning, and feature engineering.

Here are a few examples of how to take advantage of SQL joins:

  1. Merging data: By joining tables, disparate data sources can be brought together to analyze relationships and correlations between multiple data sets. For example, joining a customer table with a transaction table can provide insights into customer behavior and buying patterns.
  1. Data Validation: Joins can be used to validate data quality and integrity. By comparing data from different tables, inconsistencies, missing values, or outliers can be identified. This helps you with data cleansing and ensures that the data used for analysis is accurate and reliable.
  1. Feature Engineering: Joins help create new features for machine learning models. By combining related tables, you can extract meaningful information and generate features that capture important relationships in your data. This can enhance the predictive power of the model.
  1. Aggregation and Analysis: Joins enable you to perform complex aggregation and analysis across multiple tables. By combining data from various sources, you can gain a holistic view of your data and gain valuable insights. For example, joining a sales table with a product table can help you analyze sales performance by product category or region.

Best practices for SQL joins

As I already mentioned, the complexity of joins is not reflected in their syntax. You see that the syntax is relatively simple.

Best practices for joins reflect this as well, as they are not concerned with the encoding itself, but with what the join does and performs.

To get the most out of joins in SQL, consider the following best practices.

  1. Know your data:  Become familiar with the structures and relationships in your data. This will help you choose the appropriate join type, and choose the correct columns to match against.
  1. Use indexes: If your tables are large or joined frequently, consider adding indexes on the columns used for joins. Indexes can significantly improve query performance.
  1. Be careful with performance: Joining large tables or multiple tables can be computationally expensive. Optimize queries by filtering data, using appropriate join types, and considering temporary tables or subqueries.
  1. Test and Validate: Always validate join results for correctness. Perform sanity checks and verify that the joined data conforms to your expectations and business logic.

in conclusion

SQL joins are a fundamental concept that enable data scientists to combine and analyze data from multiple sources. By understanding the different types of SQL joins, mastering their syntax, and using them effectively, data scientists can unlock valuable insights, validate data quality, and drive data-driven decisions.

I show you how to do this with five examples. Now you can harness the power of SQL and join your data science projects with better results.

Original Link: SQL Data Science: Understanding and Exploiting Joins (mvrlink.com)

Guess you like

Origin blog.csdn.net/ygtu2018/article/details/132145308