【Pandas】 sql query

introduce

When it comes to data analysis, Pandas is a very popular Python data analysis library. However, if you are used to retrieving and manipulating data from a database using SQL, Pandas' syntax can be a bit daunting. This is where pandasql comes in handy. This library can run SQL queries directly on Pandas DataFrame.

In this blog post, I will show you how to run SQL queries on a Pandas DataFrame using pandasql. The "tips" dataset in the seaborn library will be used as the dataset.

About pandasql

code

pip install pandas seaborn pandasql
``

# 代码
```python
import pandas as pd
import seaborn as sns
from pandasql import sqldf

# 从seaborn加载tips数据集
tips = sns.load_dataset('tips')

query1 = """
SELECT *
FROM tips
LIMIT 5;
"""

result1 = sqldf(query1, globals())
print(result1)

query2 = """
SELECT *
FROM df
WHERE total_bill > 30
LIMIT 5;
"""

result2 = exec_query(query2)
print(result2)

query3 =  """
SELECT day, avg(total_bill)
FROM df
GROUP BY day;
"""
result3 = exec_query(query3)
print(result3)

Commentary

import pandas as pd
import seaborn as sns
from pandasql import sqldf

# 从seaborn加载tips数据集
tips = sns.load_dataset('tips')

Import required libraries
Seaborn's dataset called "tips" contains data on tips paid in restaurants.

query = """
SELECT *
FROM tips
LIMIT 5;
"""

result = sqldf(query, globals())
print(result)
#    total_bill   tip     sex smoker  day    time  size
# 0       16.99  1.01  Female     No  Sun  Dinner     2
# 1       10.34  1.66    Male     No  Sun  Dinner     3
# 2       21.01  3.50    Male     No  Sun  Dinner     3
# 3       23.68  3.31    Male     No  Sun  Dinner     2
# 4       24.59  3.61  Female     No  Sun  Dinner     4

Use pandasql's sqldf() function to execute the query.

query2 = """
SELECT *
FROM df
WHERE total_bill > 30
LIMIT 5;
"""

result2 = exec_query(query2)
print(result2)
#    total_bill   tip     sex smoker  day    time  size
# 0       35.26  5.00  Female     No  Sun  Dinner     4
# 1       39.42  7.58    Male     No  Sat  Dinner     4
# 2       31.27  5.00    Male     No  Sat  Dinner     3
# 3       30.40  5.60    Male     No  Sun  Dinner     4
# 4       32.40  6.00    Male     No  Sun  Dinner     4

can be filtered using the WHERE clause.
Extract data where the tip total exceeds 30.

query3 =  """
SELECT day, avg(total_bill)
FROM df
GROUP BY day;
"""
result3 = exec_query(query3)
print(result3)
#     day  avg(total_bill)
# 0   Fri        17.151579
# 1   Sat        20.441379
# 2   Sun        21.410000
# 3  Thur        17.682742

can be grouped using GROUP BY.
Group by day (day of the week) and extract the average tip amount.
Tips tend to be higher on Saturdays and Sundays.

In summary

pandasql is a powerful tool for querying Pandas DataFrames using SQL. For data analysts and data scientists familiar with SQL, this library can make using Pandas more intuitive.

In this blog post, I introduce the basic usage of pandasql and an example of using the seaborn Tips dataset. Learn how to easily run SQL queries against Pandas DataFrame.

Try pandasql to simplify your data analysis!

Guess you like

Origin blog.csdn.net/Allan_lam/article/details/134964931