The Pandas library is designed for data analysis, and it is an important factor that makes Python a powerful and efficient data analysis environment.
One, Pandas data structure
1、import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2. S1=pd.Series(['a','b','c']) series is a data structure composed of a set of data and a set of indexes (row indexes)
3. S1=pd.Series(['a','b','c'],index=(1,3,4)) specify the index
4. S1=pd.Series({1:'a',2:'b',3:'c')) Specify the index in dictionary form
5. S1.index() returns the index
6. S1.values() return value
7. Df=pd.DataFrame(['a','b','c']) A dataframe is a data structure composed of a set of data and two sets of indexes (row and column indexes)
8. Df=pd.DataFrame([[a,A],[b,B],[c,C]],columns=['lowercase','uppercase'], index=['one','two' ,'three'])
Columms is the column index, and index is the row index
9. pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspider Tsinghua mirror
10. data={'lowercase':['a','b','c'],'uppercase':['A','B','C']} incoming dictionary
Df=Pd.DataFrame(data)
11、Df.index() df.columns()
Two, read the data
12、df=pd.read_excel(r’C:\user\...xlsx’,sheet_name=’sheet1’) 或
Pd.read_excel(r'C:\user\...xlsx',sheet_name=0) read excel sheet
13、Pd.read_excel(r’C:\user\...xlsx’,index_col=0,header=0)
index_col specifies the row index, header specifies the column index
14. pd.read_excel(r'C:\user\...xlsx',usecols=[0,1]) import the specified column without index_col and header
15. pd.read_tablel(r'C:\user\...txt', sep='') Import a txt file, and sep specifies what the separator is
16. df.head(2) displays the first two rows, the first 5 rows are displayed by default
17. df.shape displays several rows and columns of data, excluding row and column indexes
18. http://df.info() can view the type of data in the table
19. df.describe() can obtain the distribution value (sum, average, variance, etc.) of the numerical type fingers in the table
Three, data preprocessing
20. http://df.info() can show which data in the table is empty
21. The df.isnull() method can determine which value is a missing value, if it is missing, it returns True, otherwise it is False
22, df.dropna() deletes rows with missing values by default
23. df.dropna(how='all') delete rows with all empty values, and rows with not all empty values will not be deleted
24, df.fillna(0) fills all empty values with 0
25. df.fillna({'sex':'male','age':'30'}) Fill male with a null value in the gender column, and fill with 30 for age
26. df.drop_duplicates() checks all values for duplicate values by default, and retains the value of the first row
27. df.drop_duplicates(subset='gender') retains the first row for the duplicate value query in the gender column
28. df.drop_duplicates(subset=['gender','company'], keep='last') double check the gender and company
The keep setting defaults to first (keep the first one), can be set to last (keep the last one) or False (do not keep the part)
29, df['ID'].dtype view the data type of the ID column
30. df['ID'].astype('float') converts the data type of the ID column to float type
31. Data types: int, float, object, string, unicode, datetime
32. df['ID'][1] The second data of the ID column
33, df.columns=['uppercase','lowercase','Chinese'] add column index for non-indexed table
34, df.index=[1,2,3] Add row index
35. df.set_index('number') specifies the column to be used as a row index
36, df.rename(index=('order number':'new order number','customer name':'new customer name')) to rename the row index
37, df.rename(columns=(1:'one',2:'two')) rename the column index
38. df.reset_index() converts all indexes into columns by default
39, df.reset_index(level=0) converts level 0 index into column
40, df.reset_index(drop=True) delete the original index
Four, data selection
41. df[['ID','name']] multiple column names need to be loaded into the list
42, df.iloc[[1,3],[2,4]] select data with row and column number
43. df.iloc[1,1] selects the 3rd row and 2 column data in the table, the first row defaults to the column index
44, df.iloc[:,0:4] #Get the value of column 1 to column 4
45. df.loc['一'] #loc uses the row name to select the row data, the format is Series, but it can be accessed in the form of a list
46. df.loc['一'][0] or df.loc['一']['serial number']
47. df.iloc[1]#iloc select row data with row number
48. df.iloc[[1,3]]# Multi-row numbering to select row data, use list to encapsulate, otherwise it becomes row and column selection
49. df.iloc[1:3]#Select the second and fourth rows
50. df[df['age']<45] #Add the judgment condition to return all the data that meet the conditions, and the age column is not limited
51. df[(df['age']<45)&(df['ID']<4)] #Judging multi-condition selection data
52. df.iloc[[1,3],[2,4]] is equivalent to df.loc[['一','二'],['age','ID']] #loc is the name, iloc Is the number
53. df[df['age']<45][['age','ID']]# first select rows through age conditions, and then specify columns through different indexes
54. df.iloc[1:3,2:4]#Slice index
Five, numerical operations
55. df['age'].replace(100,33)# Replace 100 in the age column with 33
56. df.replace(np.NaN,0)# is equivalent to fillna(), where np.NaN is the representation of the default value in python
57. df.replace([A,B],C)#Many-to-one replacement, A and B are replaced with C
58. df.replace(('A':'a','B':'b','C':'c'))#Many-to-many replacement
59. df.sort_values(by=['Application Form Number'],ascending=False)#The application form number column is sorted in descending order, Ture sorted in ascending order (default)
60. df.sort_values(by=['Application Form Number'],na_position='first')#The application form number column is arranged in ascending order, and the missing values are ranked first
The default missing value is at the last digit last
61. df.sort_values(by=['col1','col2'],ascending=[False,True])#Multi-column sorting
62. df['Sales'].rank(method='first')#Sales ranking (not sorting), method has first\min\max\average
63. df.drop(['Sales','ID'],axis=1)#Delete the column, directly the column name
64. df.drop(df.columns[[4,5]],axis=1)#Delete column, which is the number
65. df.drop(colums=['Sales','ID'])#This way to delete columns, you don’t need to write axis=1
66, df.drop(['a','b'],axis=0)#Delete the row, directly the column name
67. df.drop(df.index[[4,5]],axis=0)#Delete row, which is the number
68. df.drop(index=['a','b'])#Delete rows in this way, you don’t need to write axis=0
69. df['ID'].value_counts()# Count the number of times the data in the ID column appears
70. df['ID'].value_counts(normalize=Ture,sort=False)# Count the number of occurrences of data in the ID column and sort them in descending order
71. df['ID'].unique()#Get the unique value of the column
72. df['age'].isin(['a',11])#Check whether this column contains a or 11
73. pd.cut(df['ID'],bins=[0,3,6,10])#Use bins to specify the cut partition
74. pd.qcut(df['ID'],3)#ID column is cut into 3 parts, the number of data in each part is as consistent as possible
75. df.insert(2,'product',['book','pen','calculator'])#insert the third column
76. df['product']=['book','pen','calculator'])#Insert a new column, at the end of the table
77, df.T exchange of rows and columns
78. df.tack()#Convert tabular data into tree data
79. df.set_index(['ID','name']).stack().reset_index()#Wide table conversion to growth table, first set the common column to the row index, and then to other columns
Transform into tree data, and then reset the row index
80. df.melt(id_vars=['ID','name'],var_name='year',value_name='sale')#id_var parameter indicates that the wide table is converted to a long table.
Changed column, var_name parameter represents the original column index converted to the column name corresponding to the row index, value_name represents the column name corresponding to the value of the new index
81. df['C1'].apply(lambda x:x+1)#equivalent to map(), but need to cooperate with lambda
82, df.applymap(lambda x: x+1), perform the same function operation on all data in the table
Six, data calculation
83. df['ID']+Df['ID']# can add, subtract, multiply and divide
84. df['ID']>Df['ID']# can perform comparison operations such as> <== !=
85. df.count()#Count the number of non-null values in each column
86, df.count(axis=1)#Count the number of non-empty values in each row
87. df['ID'].count()#Count the number of non-empty values in the specified column
88. df.sum(axis=1)#Sum result of each column/row
89, df.mean(axis=1)#Each column/row average
90, df.max(axis=1)#Find the maximum value for each column/row
91, df.min(axis=1)#Find the minimum value for each column/row
92, df.median(axis=1)#Find the median for each column/row
93, df.mode(axis=1)#The most frequent value in each column/row
94, df.var(axis=1)#Find the variance for each column/row
95, df.std(axis=1)#Find the standard deviation for each column/row
96, df.quantile(0.25)# Find 1/4 quantile, 0.5, 0.75 equal quantile can be obtained
97, df.corr()#Find the correlation in the entire DataFrame table
Seven, time series
98、from datetime import datetime
99, datatime.now()#Return the current time year, month, day, hour, minute and second
100. datatime.now().year# returns the year, can be .month\.day
101, datatime.now().weekday()-1#returns the day of the week
102, datatime.now().isocalendar()#Return the week number
103. (2018, 41, 7) # The 7th day of the 41st week in 2018
104, datatime.now().date()# Only return year, month and day
105, datatime.now().time()#Return only time
106、datatime.now().strftime(‘%Y-%m-%d %H:%M:%S’)#返回2020-03-13 09:09:12
107、from dateutil.parer import parse
108, parse(str_time)#Convert the time of the string into a time format
109, pd.Datetimeindex(['2020-02-03',2020-03-05'])#Set time index
110, data['2018']#Get the data of 2018
111, data['2018-01']#Get the data of January 2018
112. data['2018-01-05':'2018-01-15']#Get data for this period
113. Non-time index table processing
114. df[df['Deal time']==datetime(2018,08,05)]
115. df[df['Deal time']>datetime(2018,08,05)]
116. df[(df['Deal time']>datetime(2018,08,05))&(df['Deal time'] <datetime(2018,08,15))]
117 、 cha = datatime (2018,5,21,19,50) -datatime (2018,5,18,17,50)
118, cha.days#The time difference of returning days
119, cha.seconds#Return the time difference in seconds
120, cha.seconds/3600#Return the time difference in hours
121, datatime(2018,5,21,19,50)+timedelta(days=1)# move one day back
122, datatime(2018,5,21,19,50)+timedelta(seconds=20)# move back 20 seconds
123, datatime(2018,5,21,19,50)-timedelta(days=1)# move forward one day
Eight, pivot table
124, df.groupby('Customer Classification').count()#Calculate the number after customer classification
125. df.groupby('Customer Classification').sum()#Sum operation after customer classification
126, df.groupby('customer classification','area classification').sum()#Sum operation after multi-column classification
127, df.groupby('customer classification','area classification')['ID'].sum()#ID summation operation after multi-column classification
128. df['ID']#DataFrame takes out one column is the Series type
129. df.groupby(df['ID']).sum() is equivalent to df.groupby('ID').sum()
130. df.groupby('customer classification').aggregate(['sum','count']# aggregate can realize multiple aggregation methods
131, df.groupby('customer classification').aggregate({'ID':'count','sales':'sum'})
132, # aggregate can do different summary operations for different columns
133. df.groupby('Customer Classification').sum().reset_index()# After grouping and summarizing, reset the index and become a standard DataFrame
134、pd.pivot_table(data,values,index,columms,aggfunc,fill_value,margins,dropna,margins_name)
135. Pivot table, data: data table df, values: values, index: row index, columns: column index, aggfunc: calculation type of values, fill_value: filling method for empty values; margins: whether there is a total column; margins_name : The column name of the total column
136, pd.pivot_table(df,values=['ID','sales'],index='customer classification',columms='area',aggfunc={'ID':'count','sales':'sum ')),fill_value=0,margins=Ture,dropna=None,margins_name='Total')
Nine, multi-table splicing
137, pd.merge(df1,df2)#Automatically find common columns in the two tables for splicing by default
138, pd.merge(df1,df2,on="student number")#on to specify the connection column, if the connection column is a public column
139, pd.merge(df1,df2,on=['student number','name']#on to specify the connection column, if the connection column is a public column
140, pd.merge(df1,df2,left_on='student number'right_on='number') #Use the left and right keys to specify when the class name is different from the public column
141, pd.merge(df1,df2,left_index='student number'right_index='number')#When the public columns of the two tables are all index columns
142, pd.merge(df1,df2,left_index='student number'right_on='number')#When there is one public column, the index column is one normal column
143, pd.merge(df1,df2,on='student number',how='inner')#returns the corresponding public value splicing in the public column (inner join)
144, pd.merge(df1,df2,on='student number',how='left')#returns the corresponding left table value in the public column (left connection)
145, pd.merge(df1,df2,on='student number',how='right')#returns the corresponding right table value in the public column (right connection)
146, pd.merge(df1,df2,on='student number',how='outer')#returns all the corresponding values in the public column (outer join)
147. pd.concat([df1,df2])# Two tables with the same structure are connected vertically, retaining the original index value
148. pd.concat([df1,df2], ignore_index=True)# Two tables with the same structure are connected vertically, and the index value is reset
149, pd.concat([df1,df2], ignore_index=True).drop_duplicates()#Remove duplicate values after splicing
10. Export files
150. df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\测试.xlsx')#Export file format.xlsx uses the to_excel method, which is achieved through the excel_writer parameter
151. df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document')
152, df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document', index=False)#Export is to remove the index
153. df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document', index=False,columns=['ID','sales','name']) #Set the exported columns
154, df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document', index=False,columns=['ID','sales','name'], encoding='utf-8')#Set the exported column
155, df.to_excel(excel_writer=r'C:\users\zhoulifu\Desktop\test.xlsx',sheet_name='document', index=False,columns=['ID','sales','name'], encoding='utf-8',na_rep=0)#Missing value filling
156, writer=pd.ExcelWriter(excelpath,engine='xlsxwirter')#Export multiple files to multiple sheets of one file
157、df1.to_excel(writer,sheet_name=‘表一’)
158、df2.to_excel(writer,sheet_name=’表二’)
159、writer.save()