[Turn] Instructions for using 159 common methods of Python pandas library

 

Instructions for 159 commonly used methods of Python pandas library

The Pandas library is designed for data analysis, and it is an important factor that makes Python a powerful and efficient data analysis environment.

One, Pandas data structure

1、import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

2. S1=pd.Series(['a','b','c']) series is a data structure composed of a set of data and a set of indexes (row indexes)

3. S1=pd.Series(['a','b','c'],index=(1,3,4)) specify the index

4. S1=pd.Series({1:'a',2:'b',3:'c')) Specify the index in dictionary form

5. S1.index() returns the index

6. S1.values() return value

7. Df=pd.DataFrame(['a','b','c']) A dataframe is a data structure composed of a set of data and two sets of indexes (row and column indexes)

8. Df=pd.DataFrame([[a,A],[b,B],[c,C]],columns=['lowercase','uppercase'], index=['one','two' ,'three'])

Columms is the column index, and index is the row index

9. pip install -i  https://pypi.tuna.tsinghua.edu.cn/simple  pyspider Tsinghua mirror

10. data={'lowercase':['a','b','c'],'uppercase':['A','B','C']} incoming dictionary

Df=Pd.DataFrame(data)

11、Df.index() df.columns()

Two, read the data

12、df=pd.read_excel(r’C:\user\...xlsx’,sheet_name=’sheet1’) 或

Pd.read_excel(r'C:\user\...xlsx',sheet_name=0) read excel sheet

13、Pd.read_excel(r’C:\user\...xlsx’,index_col=0,header=0)

index_col specifies the row index, header specifies the column index

14. pd.read_excel(r'C:\user\...xlsx',usecols=[0,1]) import the specified column without index_col and header

15. pd.read_tablel(r'C:\user\...txt', sep='') Import a txt file, and sep specifies what the separator is

16. df.head(2) displays the first two rows, the first 5 rows are displayed by default

17. df.shape displays several rows and columns of data, excluding row and column indexes

18. http://df.info()  can view the type of data in the table

19. df.describe() can obtain the distribution value (sum, average, variance, etc.) of the numerical type fingers in the table

Three, data preprocessing

20. http://df.info()  can show which data in the table is empty

21. The df.isnull() method can determine which value is a missing value, if it is missing, it returns True, otherwise it is False

22, df.dropna() deletes rows with missing values ​​by default

23. df.dropna(how='all') delete rows with all empty values, and rows with not all empty values ​​will not be deleted

24, df.fillna(0) fills all empty values ​​with 0

25. df.fillna({'sex':'male','age':'30'}) Fill male with a null value in the gender column, and fill with 30 for age

26. df.drop_duplicates() checks all values ​​for duplicate values ​​by default, and retains the value of the first row

27. df.drop_duplicates(subset='gender') retains the first row for the duplicate value query in the gender column

28. df.drop_duplicates(subset=['gender','company'], keep='last') double check the gender and company

The keep setting defaults to first (keep the first one), can be set to last (keep the last one) or False (do not keep the part)

29, df['ID'].dtype view the data type of the ID column

30. df['ID'].astype('float') converts the data type of the ID column to float type

31. Data types: int, float, object, string, unicode, datetime

32. df['ID'][1] The second data of the ID column

33, df.columns=['uppercase','lowercase','Chinese'] add column index for non-indexed table

34, df.index=[1,2,3] Add row index

35. df.set_index('number') specifies the column to be used as a row index

36, df.rename(index=('order number':'new order number','customer name':'new customer name')) to rename the row index

37, df.rename(columns=(1:'one',2:'two')) rename the column index

38. df.reset_index() converts all indexes into columns by default

39, df.reset_index(level=0) converts level 0 index into column

40, df.reset_index(drop=True) delete the original index

Four, data selection

41. df[['ID','name']] multiple column names need to be loaded into the list

42, df.iloc[[1,3],[2,4]] select data with row and column number

43. df.iloc[1,1] selects the 3rd row and 2 column data in the table, the first row defaults to the column index

44, df.iloc[:,0:4] #Get the value of column 1 to column 4

45. df.loc['一'] #loc uses the row name to select the row data, the format is Series, but it can be accessed in the form of a list

46. ​​df.loc['一'][0] or df.loc['一']['serial number']

47. df.iloc[1]#iloc select row data with row number

48. df.iloc[[1,3]]# Multi-row numbering to select row data, use list to encapsulate, otherwise it becomes row and column selection

49. df.iloc[1:3]#Select the second and fourth rows

50. df[df['age']<45] #Add the judgment condition to return all the data that meet the conditions, and the age column is not limited

51. df[(df['age']<45)&(df['ID']<4)] #Judging multi-condition selection data

52. df.iloc[[1,3],[2,4]] is equivalent to df.loc[['一','二'],['age','ID']] #loc is the name, iloc Is the number

53. df[df['age']<45][['age','ID']]# first select rows through age conditions, and then specify columns through different indexes

54. df.iloc[1:3,2:4]#Slice index

Five, numerical operations

55. df['age'].replace(100,33)# Replace 100 in the age column with 33

56. df.replace(np.NaN,0)# is equivalent to fillna(), where np.NaN is the representation of the default value in python

57. df.replace([A,B],C)#Many-to-one replacement, A and B are replaced with C

58. df.replace(('A':'a','B':'b','C':'c'))#Many-to-many replacement

59. df.sort_values(by=['Application Form Number'],ascending=False)#The application form number column is sorted in descending order, Ture sorted in ascending order (default)

60. df.sort_values(by=['Application Form Number'],na_position='first')#The application form number column is arranged in ascending order, and the missing values ​​are ranked first

The default missing value is at the last digit last

61. df.sort_values(by=['col1','col2'],ascending=[False,True])#Multi-column sorting

62. df['Sales'].rank(method='first')#Sales ranking (not sorting), method has first\min\max\average

63. df.drop(['Sales','ID'],axis=1)#Delete the column, directly the column name

64. df.drop(df.columns[[4,5]],axis=1)#Delete column, which is the number

65. df.drop(colums=['Sales','ID'])#This way to delete columns, you don’t need to write axis=1

66, df.drop(['a','b'],axis=0)#Delete the row, directly the column name

67. df.drop(df.index[[4,5]],axis=0)#Delete row, which is the number

68. df.drop(index=['a','b'])#Delete rows in this way, you don’t need to write axis=0

69. df['ID'].value_counts()# Count the number of times the data in the ID column appears

70. df['ID'].value_counts(normalize=Ture,sort=False)# Count the number of occurrences of data in the ID column and sort them in descending order

71. df['ID'].unique()#Get the unique value of the column

72. df['age'].isin(['a',11])#Check whether this column contains a or 11

73. pd.cut(df['ID'],bins=[0,3,6,10])#Use bins to specify the cut partition

74. pd.qcut(df['ID'],3)#ID column is cut into 3 parts, the number of data in each part is as consistent as possible

75. df.insert(2,'product',['book','pen','calculator'])#insert the third column

76. df['product']=['book','pen','calculator'])#Insert a new column, at the end of the table

77, df.T exchange of rows and columns

78. df.tack()#Convert tabular data into tree data

79. df.set_index(['ID','name']).stack().reset_index()#Wide table conversion to growth table, first set the common column to the row index, and then to other columns

Transform into tree data, and then reset the row index

80. df.melt(id_vars=['ID','name'],var_name='year',value_name='sale')#id_var parameter indicates that the wide table is converted to a long table.

Changed column, var_name parameter represents the original column index converted to the column name corresponding to the row index, value_name represents the column name corresponding to the value of the new index

81. df['C1'].apply(lambda x:x+1)#equivalent to map(), but need to cooperate with lambda

82, df.applymap(lambda x: x+1), perform the same function operation on all data in the table

Six, data calculation

83. df['ID']+Df['ID']# can add, subtract, multiply and divide

84. df['ID']>Df['ID']# can perform comparison operations such as> <== !=

85. df.count()#Count the number of non-null values ​​in each column

86, df.count(axis=1)#Count the number of non-empty values ​​in each row

87. df['ID'].count()#Count the number of non-empty values ​​in the specified column

88. df.sum(axis=1)#Sum result of each column/row

89, df.mean(axis=1)#Each column/row average

90, df.max(axis=1)#Find the maximum value for each column/row

91, df.min(axis=1)#Find the minimum value for each column/row

92, df.median(axis=1)#Find the median for each column/row

93, df.mode(axis=1)#The most frequent value in each column/row

94, df.var(axis=1)#Find the variance for each column/row

95, df.std(axis=1)#Find the standard deviation for each column/row

96, df.quantile(0.25)# Find 1/4 quantile, 0.5, 0.75 equal quantile can be obtained

97, df.corr()#Find the correlation in the entire DataFrame table

Seven, time series

98、from datetime import datetime

99, datatime.now()#Return the current time year, month, day, hour, minute and second

100. datatime.now().year# returns the year, can be .month\.day

101, datatime.now().weekday()-1#returns the day of the week

102, datatime.now().isocalendar()#Return the week number

103. (2018, 41, 7) # The 7th day of the 41st week in 2018

104, datatime.now().date()# Only return year, month and day

105, datatime.now().time()#Return only time

106、datatime.now().strftime(‘%Y-%m-%d %H:%M:%S’)#返回2020-03-13 09:09:12

107、from dateutil.parer import parse

108, parse(str_time)#Convert the time of the string into a time format

109, pd.Datetimeindex(['2020-02-03',2020-03-05'])#Set time index

110, data['2018']#Get the data of 2018

111, data['2018-01']#Get the data of January 2018

112. data['2018-01-05':'2018-01-15']#Get data for this period

113. Non-time index table processing

114. df[df['Deal time']==datetime(2018,08,05)]

115. df[df['Deal time']>datetime(2018,08,05)]

116. df[(df['Deal time']>datetime(2018,08,05))&(df['Deal time'] <datetime(2018,08,15))]

117 、 cha = datatime (2018,5,21,19,50) -datatime (2018,5,18,17,50)

118, cha.days#The time difference of returning days

119, cha.seconds#Return the time difference in seconds

120, cha.seconds/3600#Return the time difference in hours

121, datatime(2018,5,21,19,50)+timedelta(days=1)# move one day back

122, datatime(2018,5,21,19,50)+timedelta(seconds=20)# move back 20 seconds

123, datatime(2018,5,21,19,50)-timedelta(days=1)# move forward one day

Eight, pivot table

124, df.groupby('Customer Classification').count()#Calculate the number after customer classification

125. df.groupby('Customer Classification').sum()#Sum operation after customer classification

126, df.groupby('customer classification','area classification').sum()#Sum operation after multi-column classification

127, df.groupby('customer classification','area classification')['ID'].sum()#ID summation operation after multi-column classification

128. df['ID']#DataFrame takes out one column is the Series type

129. df.groupby(df['ID']).sum() is equivalent to df.groupby('ID').sum()

130. df.groupby('customer classification').aggregate(['sum','count']# aggregate can realize multiple aggregation methods

131, df.groupby('customer classification').aggregate({'ID':'count','sales':'sum'})

132, # aggregate can do different summary operations for different columns

133. df.groupby('Customer Classification').sum().reset_index()# After grouping and summarizing, reset the index and become a standard DataFrame

134、pd.pivot_table(data,values,index,columms,aggfunc,fill_value,margins,dropna,margins_name)

135. Pivot table, data: data table df, values: values, index: row index, columns: column index, aggfunc: calculation type of values, fill_value: filling method for empty values; margins: whether there is a total column; margins_name : The column name of the total column

136, pd.pivot_table(df,values=['ID','sales'],index='customer classification',columms='area',aggfunc={'ID':'count','sales':'sum ')),fill_value=0,margins=Ture,dropna=None,margins_name='Total')

Nine, multi-table splicing

137, pd.merge(df1,df2)#Automatically find common columns in the two tables for splicing by default

138, pd.merge(df1,df2,on="student number")#on to specify the connection column, if the connection column is a public column

139, pd.merge(df1,df2,on=['student number','name']#on to specify the connection column, if the connection column is a public column

140, pd.merge(df1,df2,left_on='student number'right_on='number') #Use the left and right keys to specify when the class name is different from the public column

141, pd.merge(df1,df2,left_index='student number'right_index='number')#When the public columns of the two tables are all index columns

142, pd.merge(df1,df2,left_index='student number'right_on='number')#When there is one public column, the index column is one normal column

143, pd.merge(df1,df2,on='student number',how='inner')#returns the corresponding public value splicing in the public column (inner join)

144, pd.merge(df1,df2,on='student number',how='left')#returns the corresponding left table value in the public column (left connection)

145, pd.merge(df1,df2,on='student number',how='right')#returns the corresponding right table value in the public column (right connection)

146, pd.merge(df1,df2,on='student number',how='outer')#returns all the corresponding values ​​in the public column (outer join)

147. pd.concat([df1,df2])# Two tables with the same structure are connected vertically, retaining the original index value

148. pd.concat([df1,df2], ignore_index=True)# Two tables with the same structure are connected vertically, and the index value is reset

149, pd.concat([df1,df2], ignore_index=True).drop_duplicates()#Remove duplicate values ​​after splicing

10. Export files

150. df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\测试.xlsx')#Export file format.xlsx uses the to_excel method, which is achieved through the excel_writer parameter

151. df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document')

152, df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document', index=False)#Export is to remove the index

153. df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document', index=False,columns=['ID','sales','name']) #Set the exported columns

154, df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document', index=False,columns=['ID','sales','name'], encoding='utf-8')#Set the exported column

155, df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document', index=False,columns=['ID','sales','name'], encoding='utf-8',na_rep=0)#Missing value filling

156, writer=pd.ExcelWriter(excelpath,engine='xlsxwirter')#Export multiple files to multiple sheets of one file

157、df1.to_excel(writer,sheet_name=‘表一’)

158、df2.to_excel(writer,sheet_name=’表二’)

159、writer.save()

Guess you like

Origin blog.csdn.net/weixin_52071682/article/details/113446340