Pandas and SQL comparison

Think they are more familiar with SQL, just started using Pandas time, always feel a lot of places better to SQL convenient treatment However, after the familiar Pandas, found Pandas tend to have a very simple solution, some parts of exactly the advantages of local Pandas the following lists some scenes have encountered:

Sliding window / smoothing curves of various
scenarios:

As each month when large data jitter cumulative want n months of data to then calculate the average month trend.
1
DB solution:

I did not expect a particularly simple method for caring people to want tips.
1
Pandas solution:

cost.sort_values(['city', 'month']).groupby('city')['score'].rolling(3).mean()
1
备注

Obviously, Pandas to be much more convenient
1
Jupyter example of
income distribution
scenarios:

A person's expenditure will be applied to many fields, such as entertainment, education, now need to calculate the percentage of investment expenditure accounts for each piece of data the parties..
1
DB solution:

Because SQL to write will be longer, just tell us about the following ideas
1) by grouping, calculating the current total revenue for each person
2) The first step of the result set (the child table) and by people associated with the original table, and then calculate the percentage of
1
2
3
Pandas solution:

cost['total'] = cost.groupby('person_id')['amount'].transform(sum)
cost['percentage'] = cost.amount/cost.total
1
2
备注:

Which program is good, it goes without saying
1
Jupyter example
handle outliers
scenarios:

Treatment of outliers, such as if it is the growth of various statistical indicators of primary school students, some students did not height, if the average global alternative to this is obviously not appropriate. Alternatively, is averaging gender and age to replace .    
1
DB solution:

And above income distribution is completely similar, that is, ask each age, the average gender, then Join to go back, and finally use to replace null values nvl
1
Pandas solution:

df.groupby('age','gender')['height'].transform(lambda x: x.fillna(x.mean()))
1
备注:

Which program is good, it goes without saying
1
Jupyter example
ranks of conversion
scenarios:

Rows and columns
1
DB solution:

Column transfer line: after multiple select union
column switch: SELECT ID, max (Case When type = 'Gender' null the then value the else End) Gender
       from Student 
       Group by ID
. 1
2
. 3
. 4
Pandas Solution:

Row transfer column: student.stack ()
column switch:. Student.stack () unstack an ()
. 1
2
Notes:

When multiple columns, the solution will look very cumbersome DB
1
Jupyter example
ordering packets (row_number)
Scenario:

Find the best scores of each class school a person
1
DB solution:

select * from (SELECT class_id, student_id,, Row_Number() OVER (partition by class_id ORDER BY score desc) rank FROM student_score ) where rank=1
1
Pandas解决方案:

student_score.sort_value(['class_id', 'score'], False).groupby('class_id').nth(0)
1
备注:

Almost two kinds of programs, pandas slightly a little bit better.
1
Jupyter example

Remove the extra record
scenarios:

For instance, your database which has a student performance of a number of historical data, you just want to get the last piece of data.
1
DB solution:

select * from student_score where id in (
    select max(id) from student_score group by student_id
)   
1
2
3
Pandas解决方案:

student_score.drop_duplicate(subset='student_id',keep='last')
1
备注:

Not very different two kinds of programs
. 1
Jupyter exemplary
chain
scenarios:

DB Solution:

select c2.city, c2.month,  (c2.score - c1.score)/c1.score
from cost c1, cost c2 where 
    c1.month = c2.month-1
    and c1.city = c2.city
1
2
3
4
Pandas解决方案:

cost['previous'] = cost.sort_values(['city', 'month']).groupby('city')['score'].shift(1)
cost['percentage'] = (cost.score -  cost.previous)/cost.previous
1
2
备注:

Solution two kinds of very different, but the feeling is more familiar with the Pandas to direct some
1
Jupyter example of
row-level summary of
scenarios:

For example, a city of PM2.5 sampling once every minute, every day to save 24 * 60 to record one line if the current application is to calculate the highest daily value of PM2.5 is how much, and when the highest value.
1
DB solution:

Maximum: select greatest (col1, col2, col3, ***), day from records

Corresponding to the maximum time: I did not expect a good way
1
2
3
Pandas solution:

Maximum: df [ 'max'] = df.apply (lambda row: max (row), axis = 1)

Corresponding to the maximum time: DF [ 'min_sn'] = df.apply (the lambda Row: row.idxmax (), = Axis. 1)
. 1
2
. 3
Notes:

It is clear to the right on some Pandas function, and when the number Column relatively long time, the more obvious advantages Pandas
1
Jupyter example

Conclusion The
different techniques used in different scenarios, to some extent, this comparison is unfair, such as SQL does best Join operations, not covered here. The purpose of this comparison, but in order to better understand the different features of SQL and Pandas
--------------------- 
author: flyfoxs 
source: CSDN 
original: https: //blog.csdn.net/flyfoxs/article/details/81322649 
Disclaimer: This article as a blogger original article, reproduced, please attach Bowen link!

Guess you like

Origin blog.csdn.net/huobanjishijian/article/details/84863554