Python or SQL for introductory data analysis in 2020? Seven common operations comparison!

Author | Liu Zaoqi

Source | Early Python (ID: zaoqi-python)

Head picture | CSDN download from Oriental IC

SQL and Python are almost two languages ​​that current data analysts must understand. What is the difference between them when processing data? This article will use  MySQL  and  pandas  to show seven commonly used operations in data analysis. I hope it can help readers who master one of these languages ​​quickly understand the other method !

Before reading this article, you can visit the website below to download the sample data used in this article, and import it into MySQL and pandas, and read while typing the code!

https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/io/data/csv/tips.csv

select

In SQL, we can use the  SELECT  statement to select data from a table, and the results are stored in a result table, the syntax is as follows:

SELECT column_name,column_name
FROM table_name;

If you don't want to display all the records , you can use  TOP  or  LIMIT  to limit the number of rows. Therefore, to select some columns in the tips table, you can use the following statement

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

In pandas, we can complete the column selection by passing the list of column names to the DataFrame 

In SQL, you can perform calculations while selecting, such as adding a column

SELECT *, tip/total_bill as tip_rate
FROM tips
LIMIT 5;

This can also be done using DataFrame.assign()  in pandas 

  

Find

Single condition search

In SQL, the WHERE  clause is used to extract records that meet the specified conditions, the syntax is as follows

SELECT column_name,column_name
FROM table_name
WHERE column_name operator value;

For example, find the record of time = dinner  in the sample data 

SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;

In pandas, searching according to conditions can have many forms, for example, you can pass a Series object containing  True/False  to a DataFrame, and return all rows with True

Multi-condition search

In SQL, multi-condition search can be done using AND/OR

SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;

There are similar operations in pandas

Find empty values

Checking for null values ​​in pandas is done using the  notna()  and  isna()  methods.

frame[frame['col1'].notna()]

Can use IS NULL  and  IS NOT NULL to  complete in SQL 

SELECT *
FROM frame
WHERE col2 IS NULL;


SELECT *
FROM frame
WHERE col1 IS NOT NULL;

  Update

Use UPDATE in SQL 

UPDATE tips
SET tip = tip*2
WHERE tip < 2;

In pandas, there are many ways, such as using the  loc  function

tips.loc[tips['tip'] < 2, 'tip'] *= 2

delete

Use DELETE in SQL 

DELETE FROM tips
WHERE tip > 9;

In pandas, we choose the rows that should be kept instead of deleting them

tips = tips.loc[tips['tip'] <= 9]

  Grouping

In pandas, use the  groupby()  method to achieve grouping. groupby()  usually refers to a process in which we want to divide the data set into several groups, apply certain functions (usually aggregation), and then group the groups together.

A common SQL operation is to get the number of records in each group in the entire data set. For example, by grouping and querying gender

SELECT sex, count(*)
FROM tips
GROUP BY sex;

The equivalent operation in pandas is note that in the above code, we use size()  instead of count().   This is because count() applies the function to each column and returns the number of non-empty records in each column!

connection

In pandas, you can use  join()  or  merge()  to connect. Each method has parameters, allowing you to specify the type of join (LEFT, RIGHT, INNER, FULL) or the column to be joined.

Now let us recreate two sets of sample data, and use code to demonstrate different connections.

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
   ....:                     'value': np.random.randn(4)})
   ....: 


df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
   ....:                     'value': np.random.randn(4)})

Internal connection

Inner joins use comparison operators to match rows in two tables based on the values ​​of the columns shared by each table. Inner joins are implemented in SQL using  INNER JOIN

SELECT *
FROM df1
INNER JOIN df2
  ON df1.key = df2.key;

In pandas you can use  merge() merge()  provides some parameters to join the columns of a DataFrame with the index of another DataFrame????

Left/right outer join

To achieve left/right outer joins in SQL, you can use  LEFT OUTER JOIN  and  RIGHT OUTER JOIN

SELECT *
FROM df1
LEFT OUTER JOIN df2
  ON df1.key = df2.key;


SELECT *
FROM df1
RIGHT OUTER JOIN df2
  ON df1.key = df2.key;

Pandas can be used to achieve the same in the  merge ()  and specify how keywords for  left  or  right  to

Fully connected

Full join returns all rows in the left and right tables, regardless of whether they match or not, but not all databases support it. For example, mysql does not support it . FULL OUTER JOIN can be used to implement full join in SQL 

SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;

In pandas, you can also use  merge()  and specify the how keyword as  outer

  merge

The UNION operation in SQL is used to merge the result sets of two or more SELECT statements. UNION is  similar to  UNION ALL  , but UNION will delete duplicate rows. The sample code is as follows

SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
      Chicago     1
       Boston     4
  Los Angeles     5
*/

In pandas, you can use  concat() to  achieve  UNION ALL


The above is  UNION ALL to  retain duplicate values, if you want to delete you can use   drop_duplicates()

The above is the entire content of this article. You can see that different languages ​​have different characteristics in different scenarios. If you want to learn more about it, you can read the official documents and practice more!

Source: pandas official document

https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html

Compilation: Liu Zaoqi (with deletion and modification)


更多精彩推荐
☞北京 10 年,难说再见!
☞致敬所有的程序员们~ | 每日趣闻
☞腾讯否认微信测试语音消息进度调节;监证会同意蚂蚁集团科创板IPO注册;React 17 正式版发布|极客头条
☞韩辉:国产操作系统的最大难题在于解决“生产关系”
☞蓝色巨人IBM全力奔赴的混合云之旅能顺利吗?
☞区块链赋能供应链金融|应用优势与四类常见模式
点分享点点赞点在看

Guess you like

Origin blog.csdn.net/csdnsevenn/article/details/109233553