Author | Liu Zaoqi
Source | Early Python (ID: zaoqi-python)
Head picture | CSDN download from Oriental IC
SQL and Python are almost two languages that current data analysts must understand. What is the difference between them when processing data? This article will use MySQL and pandas to show seven commonly used operations in data analysis. I hope it can help readers who master one of these languages quickly understand the other method !
Before reading this article, you can visit the website below to download the sample data used in this article, and import it into MySQL and pandas, and read while typing the code!
https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/io/data/csv/tips.csv
select
In SQL, we can use the SELECT statement to select data from a table, and the results are stored in a result table, the syntax is as follows:
SELECT column_name,column_name
FROM table_name;
If you don't want to display all the records , you can use TOP or LIMIT to limit the number of rows. Therefore, to select some columns in the tips table, you can use the following statement
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
In pandas, we can complete the column selection by passing the list of column names to the DataFrame
In SQL, you can perform calculations while selecting, such as adding a column
SELECT *, tip/total_bill as tip_rate
FROM tips
LIMIT 5;
This can also be done using DataFrame.assign() in pandas
Find
Single condition search
In SQL, the WHERE clause is used to extract records that meet the specified conditions, the syntax is as follows
SELECT column_name,column_name
FROM table_name
WHERE column_name operator value;
For example, find the record of time = dinner in the sample data
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
In pandas, searching according to conditions can have many forms, for example, you can pass a Series object containing True/False to a DataFrame, and return all rows with True
Multi-condition search
In SQL, multi-condition search can be done using AND/OR
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
There are similar operations in pandas
Find empty values
Checking for null values in pandas is done using the notna() and isna() methods.
frame[frame['col1'].notna()]
Can use IS NULL and IS NOT NULL to complete in SQL
SELECT *
FROM frame
WHERE col2 IS NULL;
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
Update
Use UPDATE in SQL
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In pandas, there are many ways, such as using the loc function
tips.loc[tips['tip'] < 2, 'tip'] *= 2
delete
Use DELETE in SQL
DELETE FROM tips
WHERE tip > 9;
In pandas, we choose the rows that should be kept instead of deleting them
tips = tips.loc[tips['tip'] <= 9]
Grouping
In pandas, use the groupby() method to achieve grouping. groupby() usually refers to a process in which we want to divide the data set into several groups, apply certain functions (usually aggregation), and then group the groups together.
A common SQL operation is to get the number of records in each group in the entire data set. For example, by grouping and querying gender
SELECT sex, count(*)
FROM tips
GROUP BY sex;
The equivalent operation in pandas is note that in the above code, we use size() instead of count(). This is because count() applies the function to each column and returns the number of non-empty records in each column!
connection
In pandas, you can use join() or merge() to connect. Each method has parameters, allowing you to specify the type of join (LEFT, RIGHT, INNER, FULL) or the column to be joined.
Now let us recreate two sets of sample data, and use code to demonstrate different connections.
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
....: 'value': np.random.randn(4)})
....:
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
....: 'value': np.random.randn(4)})
Internal connection
Inner joins use comparison operators to match rows in two tables based on the values of the columns shared by each table. Inner joins are implemented in SQL using INNER JOIN
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
In pandas you can use merge() merge() provides some parameters to join the columns of a DataFrame with the index of another DataFrame????
Left/right outer join
To achieve left/right outer joins in SQL, you can use LEFT OUTER JOIN and RIGHT OUTER JOIN
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
Pandas can be used to achieve the same in the merge () and specify how keywords for left or right to
Fully connected
Full join returns all rows in the left and right tables, regardless of whether they match or not, but not all databases support it. For example, mysql does not support it . FULL OUTER JOIN can be used to implement full join in SQL
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
In pandas, you can also use merge() and specify the how keyword as outer
merge
The UNION operation in SQL is used to merge the result sets of two or more SELECT statements. UNION is similar to UNION ALL , but UNION will delete duplicate rows. The sample code is as follows
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/
In pandas, you can use concat() to achieve UNION ALL
The above is UNION ALL to retain duplicate values, if you want to delete you can use drop_duplicates()
The above is the entire content of this article. You can see that different languages have different characteristics in different scenarios. If you want to learn more about it, you can read the official documents and practice more!
Source: pandas official document
https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html
Compilation: Liu Zaoqi (with deletion and modification)
更多精彩推荐
☞北京 10 年,难说再见!
☞致敬所有的程序员们~ | 每日趣闻
☞腾讯否认微信测试语音消息进度调节;监证会同意蚂蚁集团科创板IPO注册;React 17 正式版发布|极客头条
☞韩辉:国产操作系统的最大难题在于解决“生产关系”
☞蓝色巨人IBM全力奔赴的混合云之旅能顺利吗?
☞区块链赋能供应链金融|应用优势与四类常见模式
点分享点点赞点在看