Python pandas library 159 commonly used methods instructions

The Pandas library is designed for data analysis. It is an important factor that makes Python a powerful and efficient data analysis environment.

1. Pandas data structure

1、import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

2. S1 = pd.Series (['a', 'b', 'c']) series is a data structure composed of a set of data and a set of indexes (row index)

3. S1 = pd.Series (['a', 'b', 'c'], index = (1,3,4)) specifies the index

4. S1 = pd.Series ({1: 'a', 2: 'b', 3: 'c'}) Specify the index in the form of a dictionary

5. S1.index () returns the index

6, S1.values ​​() return value

7. Df = pd.DataFrame (['a', 'b', 'c']) dataframe is a data structure composed of a set of data and two sets of indexes (row and column indexes)

8. Df = pd.DataFrame ([[a, A], [b, B], [c, C]], columns = ['lowercase', 'uppercase'], index = ['one', 'two' ,'three'])

Columms is a column index, index is a row index

9. pip install -i   pyspider Tsinghua Mirror

10. data = {'lowercase': ['a', 'b', 'c'], 'uppercase': ['A', 'B', 'C']} incoming dictionary

Df=Pd.DataFrame(data)

11、Df.index() df.columns()

 

Second, read the data

12、df=pd.read_excel(r’C:\user\...xlsx’,sheet_name=’sheet1’) 或

Pd.read_excel (r'C: \ user \ ... xlsx ', sheet_name = 0) read excel sheet

13、Pd.read_excel(r’C:\user\...xlsx’,index_col=0,header=0)

index_col specifies the row index, header specifies the column index

14. pd.read_excel (r'C: \ user \ ... xlsx ', usecols = [0,1]) import the specified column, there can be no index_col and header

15, pd.read_tablel (r'C: \ user \ ... txt ', sep =' ') import txt file, sep specifies what the separator is

16. df.head (2) shows the first two lines, the default shows the first five lines

17. df.shape displays several rows and columns of data, excluding row and column indexes

18,  to view the data in a table type

19. df.describe () can obtain the distribution value (sum, average value, variance, etc.) of the finger type of the numeric type in the table

 

Three, data preprocessing

20,  to display the table in which data is empty

21. The df.isnull () method can determine which value is missing, if it is missing return True, otherwise False

22. df.dropna () deletes rows with missing values ​​by default

23. df.dropna (how = 'all') delete all empty rows, not empty rows will not be deleted

24. df.fillna (0) fills all null values ​​with 0

25. df.fillna ({'Gender': 'Male', 'Age': '30'}) Fill in the male of the null value of the gender column, fill in the age of 30

26. df.drop_duplicates () checks duplicate values ​​for all values ​​by default, and keeps the value of the first line

27. df.drop_duplicates (subset = 'sex') retains the first row for the duplicate value query in the gender column

28. df.drop_duplicates (subset = ['gender', 'company'], keep = 'last') Check the two columns of gender and company

The default setting of keep is first (keep the first one), and it can be set to last (keep the last one) or False (no part is not kept)

29. df ['ID']. Dtype View the data type of the ID column

30. df ['ID']. Astype ('float') converts the data type of ID column to float type

31. Data type: int, float, object, string, unicode, datetime

32. df ['ID'] [1] The second data in the ID column

33. df.columns = ['大字', '小写', 'Chinese'] Add column indexes for indexless tables

34. df.index = [1,2,3] add row index

35. df.set_index ('number') specifies the column to be used as a row index

36. df.rename (index = {'order number': 'new order number', 'customer name': 'new customer name'}) rename the row index

37. df.rename (columns = {1: 'one', 2: 'two'}) Rename the column index

38. df.reset_index () converts all indexes to column by default

39. df.reset_index (level = 0) Convert level 0 index to column

40. df.reset_index (drop = True) delete the original index

 

4. Data selection

41, df [['ID', 'name']] multiple column names to be loaded into the list

42, df.iloc [[1,3], [2,4]] select data with row and column numbers

43. df.iloc [1,1] selects the 3rd row and 2nd column data in the table

44, df.iloc [:, 0: 4] #Get the value of the first column to the fourth column

45. df.loc ['一'] #loc row data selected by row name, format is Series, but can be accessed in list form

46. ​​df.loc ['一'] [0] or df.loc ['一'] ['Serial Number']

47. df.iloc [1] #iloc selects row data by row number

48. df.iloc [[1,3]] # Multi-row number selection row data, use list encapsulation, otherwise it will become row and column selection

49. df.iloc [1: 3] #Select the second and fourth lines

50. df [df ['age'] <45] #Add judgment conditions to return all data that meets the conditions, without limiting the age column

51. df [(df ['age'] <45) & (df ['ID'] <4)] #judging multi-condition selection data

52. df.iloc [[1,3], [2,4]] is equivalent to df.loc [['一', '二'], ['age', 'ID']] #loc is the name, iloc Is number

53, df [df ['age'] <45] [['age', 'ID']] # First select rows by age conditions, then specify columns by different indexes

54. df.iloc [1: 3,2: 4] #Slice index

 

V. Numerical operations

55, df ['age']. Replace (100,33) #Replace 100 in the age column with 33

56, df.replace (np.NaN, 0) # is equivalent to fillna (), where np.NaN is the default value representation in python

57, df.replace ([A, B], C) # Many-to-one replacement, A, B replaced by C

58, df.replace ({'A': 'a', 'B': 'b', 'C': 'c'}) # Many to many replacement

59. df.sort_values ​​(by = ['application order number'], ascending = False) # application order number column is sorted in descending order, Ture is sorted in ascending order (default)

60. df.sort_values ​​(by = ['application form number'], na_position = 'first') # Application form number column is sorted in ascending order, and missing values ​​are ranked first

The default missing value is the last one

61. df.sort_values ​​(by = ['col1', 'col2'], ascending = [False, True]) # Multi-column sorting

62. df ['Sales']. Rank (method = 'first') # Sales ranking (not ranking), method has first \ min \ max \ average

63. df.drop (['Sales', 'ID'], axis = 1) #Delete the column, directly the column name

64, df.drop (df.columns [[4,5]], axis = 1) #Delete the column, it is the number

65. df.drop (colums = ['Sales', 'ID']) # This way to delete the column, you can not write axis = 1

66, df.drop (['a', 'b'], axis = 0) #Delete the row, directly the column name

67, df.drop (df.index [[4,5]], axis = 0) #Delete the row, it is the number

68. df.drop (index = ['a', 'b']) # Delete rows in this way, you can not write axis = 0

69. df ['ID']. Value_counts () # Count the number of occurrences of data in the ID column

70. df ['ID']. Value_counts (normalize = Ture, sort = False) # Count the number of occurrences of data in the ID column and sort in descending order

71, df ['ID']. Unique () # Get the unique value of the column

72, df ['age']. Isin (['a', 11]) # Check whether this column contains a or 11

73, pd.cut (df ['ID'], bins = [0,3,6,10]) # Use bins to indicate the partition between partitions

74. pd.qcut (df ['ID'], 3) #ID column is cut into 3 parts, the number of data in each part should be as consistent as possible

75. df.insert (2, 'Commodity', ['Book', 'Pen', 'Calculator']) # Insert the third column

76, df ['commodity'] = ['book', 'pen', 'calculator']) # Insert a new column at the end of the table

77, df.T row and column interchange

78, df.tack () # Convert table data into tree data

79. df.set_index (['ID', 'name']). Stack (). Reset_index () # Wide table conversion into a long table, first set the common column as the row index, and then for other columns

Transform into tree data, and then reset the row index

80. df.melt (id_vars = ['ID', 'Name'], var_name = 'year', value_name = 'sale') # id_var parameter indicates that the wide table will not remain when converted into a long table

Variable columns, the var_name parameter indicates that the original column index is converted to the column name corresponding to the row index, and value_name indicates the column name of the value corresponding to the new index

81, df ['C1']. Apply (lambda x: x + 1) # is equivalent to map (), but only needs to cooperate with lambda

82, df.applymap (lambda x: x + 1), perform the same function operation on all data in the table

 

Six, data operation

83, df ['ID'] + Df ['ID'] # can add, subtract, multiply and divide

84, df ['ID']> Df ['ID'] # can perform comparison operations such as> <==! =

85, df.count () # count the number of non-null values ​​in each column

86 、 df.count (axis = 1) # Count the number of non-null values ​​in each row

87, df ['ID']. Count () # count the number of non-null values ​​of the specified column

88 、 df.sum (axis = 1) #summing result of each column / row

89 、 df.mean (axis = 1) #Every column / row is averaged

90 、 df.max (axis = 1) # Find the maximum value of each column / row

91, df.min (axis = 1) # Find the minimum value of each column / row

92 、 df.median (axis = 1) # Find the middle value of each column / row

93, df.mode (axis = 1) #The most value appears in each column / row

94, df.var (axis = 1) # Find the variance of each column / row

95, df.std (axis = 1) # Find the standard deviation of each column / row

96 、 df.quantile (0.25) # Find 1/4 quantile, can be 0.5, 0.75 equal quantile

97, df.corr () # Seeking correlation in the entire DataFrame table

 

Seven, time series

98、from datetime import datetime

99 、 datatime.now () # Return the current time year, month, day, hour, minute, second

100, datatime.now (). Year # returns the year, it can be .month \ .day

101, datatime.now (). Weekday ()-1 # return the day of the week

102, datatime.now (). Isocalendar () # Return week number

103, (2018, 41, 7) # 2018, Week 41, Day 7

104 、 datatime.now (). Date () # Only return year, month and day

105 、 datatime.now (). Time () # Only return time

106、datatime.now().strftime(‘%Y-%m-%d %H:%M:%S’)#返回2020-03-13 09:09:12

107、from dateutil.parer import parse

108, parse (str_time) # Convert the time of the string to the time format

109 、 pd.Datetimeindex (['2020-02-03', 2020-03-05 ']) # Set time index

110 、 data ['2018'] # Get 2018 data

111 、 data ['2018-01'] # Get the data of January 2018

112, data ['2018-01-05': '2018-01-15'] # Get data for this period

113. Non-time index table processing

114, df [df ['deal time'] == datetime (2018,08,05)]

115, df [df ['deal time']> datetime (2018,08,05)]

116, df [(df ['deal time']> datetime (2018,08,05)) & (df ['deal time'] <datetime (2018,08,15))]

117, cha = datetime (2018,5,21,19,50) -datatime (2018,5,18,17,50)

118, cha.days # Returns the time difference of days

119, cha.seconds # Returns the time difference in seconds

120, cha.seconds / 3600 # returns the hour time difference

121 、 datatime (2018,5,21,19,50) + timedelta (days = 1) # Move back one day

122, datatime (2018,5,21,19,50) + timedelta (seconds = 20) #Move back 20 seconds

123 、 datatime (2018,5,21,19,50) -timedelta (days = 1) #Move forward one day

 

Eight, pivot table

124, df.groupby ('customer classification'). Count () # count operation after customer classification

125, df.groupby ('customer classification'). Sum () # summation operation after customer classification

126, df.groupby ('customer classification', 'region classification'). Sum () #summation operation after multi-column classification

127, df.groupby ('customer classification', 'region classification') ['ID']. Sum () # multi-column ID sum operation

128, df ['ID'] # DataFrame takes out a column is the Series type

129. df.groupby (df ['ID']). Sum () is equivalent to df.groupby ('ID'). Sum ()

130, df.groupby ('customer classification'). Aggregate (['sum', 'count'] # aggregate can realize multiple aggregation methods

131. df.groupby ('Customer classification'). Aggregate ({'ID': 'count', 'sales': 'sum'})

132, # aggregate can do different summary operations for different columns

133, df.groupby ('customer classification'). Sum (). Reset_index () # Reset the index after grouping and summarizing, become a standard DataFrame

134、pd.pivot_table(data,values,index,columms,aggfunc,fill_value,margins,dropna,margins_name)

135. Pivot table, data: data table df, values: value, index: row index, columns: column index, aggfunc: calculation type of values, fill_value: how to fill the empty value; margins: whether there are total columns; margins_name : Total column name

136. pd.pivot_table (df, values ​​= ['ID', 'sales'], index =' customer classification ', columns =' region ', aggfunc = {' ID ':' count ',' sales': 'sum '}), fill_value = 0, margins = Ture, dropna = None, margins_name =' total ')

 

Nine, multi-table stitching

137, pd.merge (df1, df2) # By default, it automatically finds the common columns in the two tables for splicing

138, pd.merge (df1, df2, on = "study number") #on to specify the connection column, if the connection column is a public column

139. pd.merge (df1, df2, on = ['Student ID', 'Name'] # on to specify the connection column, if the connection column is a public column

140. pd.merge (df1, df2, left_on = 'student number'right_on =' number ') #by public column, but when the class name is different, use left and right keys

141, pd.merge (df1, df2, left_index = 'student number' right_index = 'number') # When the common columns of both tables are index columns

142, pd.merge (df1, df2, left_index = 'student number' right_on = 'number') # When the public column is one, the index column is one and the ordinary column

143, pd.merge (df1, df2, on = 'student number', how = 'inner') # return to the corresponding public value mosaic in the public column (internal connection)

144, pd.merge (df1, df2, on = 'student number', how = 'left') # Return the corresponding left table value in the public column (left connection)

145, pd.merge (df1, df2, on = 'student number', how = 'right') #Return the corresponding right table value in the public column (right connection)

146, pd.merge (df1, df2, on = 'student number', how = 'outer') # return all the corresponding values ​​in the public column (outer connection)

147, pd.concat ([df1, df2]) # Two tables with the same structure are connected longitudinally, retaining the original index value

148, pd.concat ([df1, df2], ignore_index = True) # Two tables with the same structure are connected vertically, reset the index value

149, pd.concat ([df1, df2], ignore_index = True) .drop_duplicates () # Remove duplicate values ​​after splicing

 

Ten, export files

150. df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ') # Export file format.xlsx uses to_excel method, through excel_writer parameters

151, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ')

152, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ', index = False) #Export is to remove the index

153, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ', index = False, columns = [' ID ',' sales', 'name']) #Set exported columns

154, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ', index = False, columns = [' ID ',' sales', 'name'], encoding = 'utf-8') # Set exported columns

155, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ', index = False, columns = [' ID ',' sales', 'name'], encoding = 'utf-8', na_rep = 0) #Missing value padding

156.writer = pd.ExcelWriter (excelpath, engine = 'xlsxwirter') # Export multiple files to multiple sheets of a file

157、df1.to_excel(writer,sheet_name=‘表一’)

158、df2.to_excel(writer,sheet_name=’表二’)

159、writer.save()

Python0 based crawling Douban data https://www.bilibili.com/video/BV1HV411o7CM/

Python learning roadmap https://www.bilibili.com/video/BV1V741117Zt/

Python employment direction and career prospects https://www.bilibili.com/video/BV1ut4y1U7bM/

Guess you like

Origin www.cnblogs.com/guran0823/p/12695960.html
Recommended