Summary of commonly used basic knowledge points of Pandas-DataFrame

Note: The following summary of knowledge points is based on converting data into DataFrame format data

(The first thing to do is to convert the data to DataFrame format)

Example DataFrame format:

import pandas as pd
data = {
    "code": ['000008', '000009', '000021', '000027', '000034', '000058', '000062', '000063', '000063', '000063', '000063'],
    "name": ['神州科技', '中国宝安', '深科技', '深圳能源', '神州数码', '深赛格', '深圳华强', '中兴通讯', '中兴通讯', '中兴通讯', '中兴通讯'],
    "concept": ['5G', '创投', '芯片概念', '创投', '网络安全', '创投', '创投', '芯片概念', '边缘计算', '网络安全', '5G'],
        }
stock_df = pd.DataFrame(data=data)
print(stock_df)

1. Replace a column of characters

For example: replace the code 'ys4ng35toofdviy9ce0pn1uxw2x7trjb' with 'entertainment' (this method is recommended for more replacements)

dicts = {'ys4ng35toofdviy9ce0pn1uxw2x7trjb':'娱乐',
        'vekgqjtw3ax20udsniycjv1hdsa7t4oz':'经济',
        'vjzy0fobzgxkcnlbrsduhp47f8pxcoaj':'军事',
        'uamwbfqlxo7bu0warx6vkhefigkhtoz3':'政治',
        'lyr1hbrnmg9qzvwuzlk5fas7v628jiqx':'文化',
        }
res['name'] = res['name'].map(lambda x:dicts[x] if x in dicts else x)
print(res)

or:

For example: replace 5G with 6G, replace venture capital with venture capital (individuals need to replace and recommend this method)

stock_df['concept'] = stock_df['concept'].str.replace('5G', '6G').str.replace('创投', '创业投资')
print(stock_df)  

2. Group statistics

  name  value
0   娱乐      8
1   经济      5
2   军事      3
3   政治      3
4   娱乐      2
5   文化      1
6   政治      1
7   经济      1
8   军事      1
9   文化      1
#分组统计  一列分类统计求和用sum
result = res.groupby(['name']).sum().reset_index()
print(result)


运行结果:
 name  value
0   军事      4
1   娱乐     10
2   政治      4
3   文化      2
4   经济      6

3. Aggregated statistics (multi-layer grouping by multiple columns)

# 聚合统计  多列分类聚合求和用size()
data = result.groupby(['name', 'type']).size().reset_index(name='value')

4. Sort according to a column

#排序
result = result.sort_values(['value'], ascending=False)

 5. Convert dataframe format to dictionary

# 输出为list,前端需要的数据格式
    data_dict = result.to_dict(orient='records')
print(data_dict)
# 指定某两列转字典
res_df = res_df[['zhongzhi_date', 'cumsum']].to_dict(orient='records')


# 结果
[{'name': '娱乐', 'value': 10}, {'name': '经济', 'value': 6}, 
{'name': '军事', 'value': 4}, {'name': '政治', 'value': 4},
{'name': '文化', 'value': 2}]

6. Multiple lines of datafrane are merged into one line

def ab(df):
    return','.join(df.values)
    
df = df.groupby(['code','name'])['concept'].apply(ab)
df = df.reset_index()
print(df)

7. Add a column and delete a column

# 新增一列
stock_df['new_column'] = '股票'
print(stock_df)
# 删除一列法一
stock_df.pop('new_column')
print(stock_df)
# 删除一列法二
stock_df = stock_df.drop('new_column', axis=1)
print(stock_df)

8. Delete a row with characters greater than 8 in a column

# 删除字符大于8的行
text_data = text_data.drop(text_data[text_data['name'].str.len() > 8].index) 

9. Convert a column of dataframe to string or integer

# 将code列转字符串
stock_df['code'] = stock_df['code'].astype(str)
# 将code列转整型
stock_df['code'] = stock_df['code'].astype(int)

10. Delete lines containing a special character

# 去掉包含'null'的行
result = result[~ result['name'].str.contains('null')] 
# 去掉包含'0'的行
result = result[~ result['name'].str.contains('0')]  

11. Replace and delete characters contained in a column of text

text_data['name'] = text_data['name'].map(lambda x: x.replace('罪', ''))

 12. Dataframe intercepts a column of characters

For example: casenumber column format: *** Xing Tong [3099] No. 666

result['category'] = result['casenumber'].map(lambda x:str(x)[3:5])
result['category'] = result['category'].replace('刑通', '刑事案件').replace('民申', '民事案件').replace('刑申', '刑事案件').replace('行申', '行政案件').replace('刑认', '刑事案件')
print(result)

13. Specify the time format

result['noticetime'] = result['noticetime'].dt.strftime('%Y%m%d')

14. Merge the two dataframe formats and fill in the null/missing values

Can join rows from different DataFrames based on one or more keys and fill null values ​​with 'unknown'

1. Join rows of different DataFrames based on single or multiple keys

2. Similar to sql join operation

3. By default, the column names of overlapping columns are connected as "foreign keys"

        on Explicitly specify the "foreign key" result = pd.merge(res2, res, how='outer').fillna('unknown')

        left_on The foreign key of the left data

        right_on The foreign key of the data on the right

4. The default is "inner join", that is, the key in the result is the intersection

result = pd.merge(res2, res, how='outer').fillna('不详')  
  • concat: multiple objects can be joined together along an axis
  • merge: You can join rows from different DataFrames based on one or more keys.
  • join: inner is the intersection, and outer is the union.

15. Add special characters at the end of a column of data

stock_df['new_column'] = stock_df['new_column'].map(lambda x: str(x) + '年')

16. dropna discards missing data (handling missing data)

# 删除
stock_df.dropna() 
# 填充
stock_df.fillna('null')

17. Obtain index and data preview data, and print the first five by default

#获取索引
print(stock_df.index)
# 获取数据
print(stock_df.values)
# 预览数据,默认是前5个
print(stock_df.head(3))
# 预览数据,默认是后5个
print(stock_df.tail(3))

18. Delete duplicate data

#重复数据
df_obj.duplicated()
#删除重复数据
df_obj.drop_duplicates()
#删除指定列重复数据
df_obj.drop_duplicates('data2')

19. Commonly used statistical calculations: sum, mean, max, min

axis=0 counts by column, axis=1 counts by row (axis=0 means that the last data is one row, so it needs to be counted by column, and when axis=1, it means that the last data is one column, so it needs to be counted by row)

stock_df.sum()
stock_df.max()
stock_df.min(axis=1)
# 统计描述
print(stock_df.describe())

20. Aggregation

dict_obj = {'key1' : ['a', 'b', 'a', 'b',
                      'a', 'b', 'a', 'a'],
            'key2' : ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
            'data1': np.random.randint(1,10, 8),
            'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)

# 内置的聚合函数
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())

# 自定义聚合函数,传入agg方法中
def peak_range(df):
    """
        返回数值范围
    """
    #print type(df) #参数为索引所对应的记录
    return df.max() - df.min()

print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj5.groupby('key1').agg(lambda df : df.max() - df.min()))

21. Using countains can be used for regular matching and filtering

#使用countains可以用来正则匹配筛选 (将 合计 列中包含change值的行 数量列 值改为C)
df.loc[df['合计'].str.contains('change'), '数量'] = 'C'
# print(df)
#某些列满足特定条件,然后改变另外的某些列的值(先定位到 数量 列中为A的行,再将对应该行以及对应 合计 列中的值改为changed )
df.loc[df['数量'] == 'A', '合计'] = 'changed'  # 关键句,直接改变df的值

22. Cumulative summation function of a certain column

res_df['cumsum'] = res_df['data_num'].cumsum()
print(res_df)

 23. Overall calculation of a column (add a number to the whole)

res_total[0] = 7131
res_df['cumsum'] = res_df['cumsum'] + res_total[0]
print(res_df)

24. According to the value range of a certain column, assign a value to another column

 25. According to the range of values ​​in certain columns, assign values ​​to another column

26. Compare the current row with the previous row in Pandas

      Open     High      Low    Close  Volume Position
0  1.20821  1.20821  1.20793  1.20794  138.96        -
1  1.20794  1.20795  1.20787  1.20788  119.61     DOWN
2  1.20788  1.20793  1.20770  1.20779  210.42     DOWN
3  1.20779  1.20791  1.20779  1.20789   77.51     DOWN
4  1.20789  1.20795  1.20789  1.20792   56.97     DOWN
df['Position'] = np.where((df['Volume'] > df['Volume'].shift(-1)) & 
    ((df['Close'] >= df['Close'].shift(-1)) & (df['Open'] <= df['Open'])),
    "UP","DOWN")

27. Process a certain column (process the time format of a certain column)

concat_df['changed_on'] = concat_df['changed_on'].apply(lambda x: str(x)[:19])

28. Combine multiple columns into one column

res_df['huji_address'] = res_df['huji_name']+res_df['hujisuozaidi']

29. According to the records in a column, delete duplicates

res_df = res_df.drop_duplicates(['p_name'], keep='last').reindex()

30. Subtract the time of two columns

df_data['new_time'] = pd.DataFrame(pd.to_datetime(
df_data['target_time']) - pd.to_datetime(df_data['start_time']))

31. pandas dataframe delete remove default index, cancel index, reset index

Delete the default index (set a new index for it, and the default index will be removed):

res_df.set_index(['report_time'], inplace=True)

Unindex (cancel all indexes, revert to default index):

res_df = res_df.reset_index()

Reset index:

res_df = res_df.reset_index(drop=True)

32. Pandas counts the number of occurrences of each value in a column

#可以通过df.colname 来指定某个列,value_counts()在这里进行计数,其中city为某一列的字段名
df2 = df1.city.value_counts()  
print(df2)

Pandas official documentation: Pandas: Powerful Python data analysis support library | Pandas

Guess you like

Origin blog.csdn.net/weixin_40547993/article/details/131315045
Recommended