The Pandas library is designed for data analysis. It is an important factor that makes Python a powerful and efficient data analysis environment.
1. Pandas data structure
1、import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2. S1 = pd.Series (['a', 'b', 'c']) series is a data structure composed of a set of data and a set of indexes (row index)
3. S1 = pd.Series (['a', 'b', 'c'], index = (1,3,4)) specifies the index
4. S1 = pd.Series ({1: 'a', 2: 'b', 3: 'c'}) Specify the index in the form of a dictionary
5. S1.index () returns the index
6, S1.values () return value
7. Df = pd.DataFrame (['a', 'b', 'c']) dataframe is a data structure composed of a set of data and two sets of indexes (row and column indexes)
8. Df = pd.DataFrame ([[a, A], [b, B], [c, C]], columns = ['lowercase', 'uppercase'], index = ['one', 'two' ,'three'])
Columms is a column index, index is a row index
9. pip install -i https: // pypi.tuna.tsinghua.edu.cn / simple pyspider Tsinghua Mirror
10. data = {'lowercase': ['a', 'b', 'c'], 'uppercase': ['A', 'B', 'C']} incoming dictionary
Df=Pd.DataFrame(data)
11、Df.index() df.columns()
Second, read the data
12、df=pd.read_excel(r’C:\user\...xlsx’,sheet_name=’sheet1’) 或
Pd.read_excel (r'C: \ user \ ... xlsx ', sheet_name = 0) read excel sheet
13、Pd.read_excel(r’C:\user\...xlsx’,index_col=0,header=0)
index_col specifies the row index, header specifies the column index
14. pd.read_excel (r'C: \ user \ ... xlsx ', usecols = [0,1]) import the specified column, there can be no index_col and header
15, pd.read_tablel (r'C: \ user \ ... txt ', sep =' ') import txt file, sep specifies what the separator is
16. df.head (2) shows the first two lines, the default shows the first five lines
17. df.shape displays several rows and columns of data, excluding row and column indexes
18, HTTP: // df.info () to view the data in a table type
19. df.describe () can obtain the distribution value (sum, average value, variance, etc.) of the finger type of the numeric type in the table
Three, data preprocessing
20, HTTP: // df.info () to display the table in which data is empty
21. The df.isnull () method can determine which value is missing, if it is missing return True, otherwise False
22. df.dropna () deletes rows with missing values by default
23. df.dropna (how = 'all') delete all empty rows, not empty rows will not be deleted
24. df.fillna (0) fills all null values with 0
25. df.fillna ({'Gender': 'Male', 'Age': '30'}) Fill in the male of the null value of the gender column, fill in the age of 30
26. df.drop_duplicates () checks duplicate values for all values by default, and keeps the value of the first line
27. df.drop_duplicates (subset = 'sex') retains the first row for the duplicate value query in the gender column
28. df.drop_duplicates (subset = ['gender', 'company'], keep = 'last') Check the two columns of gender and company
The default setting of keep is first (keep the first one), and it can be set to last (keep the last one) or False (no part is not kept)
29. df ['ID']. Dtype View the data type of the ID column
30. df ['ID']. Astype ('float') converts the data type of ID column to float type
31. Data type: int, float, object, string, unicode, datetime
32. df ['ID'] [1] The second data in the ID column
33. df.columns = ['大字', '小写', 'Chinese'] Add column indexes for indexless tables
34. df.index = [1,2,3] add row index
35. df.set_index ('number') specifies the column to be used as a row index
36. df.rename (index = {'order number': 'new order number', 'customer name': 'new customer name'}) rename the row index
37. df.rename (columns = {1: 'one', 2: 'two'}) Rename the column index
38. df.reset_index () converts all indexes to column by default
39. df.reset_index (level = 0) Convert level 0 index to column
40. df.reset_index (drop = True) delete the original index
4. Data selection
41, df [['ID', 'name']] multiple column names to be loaded into the list
42, df.iloc [[1,3], [2,4]] select data with row and column numbers
43. df.iloc [1,1] selects the 3rd row and 2nd column data in the table
44, df.iloc [:, 0: 4] #Get the value of the first column to the fourth column
45. df.loc ['一'] #loc row data selected by row name, format is Series, but can be accessed in list form
46. df.loc ['一'] [0] or df.loc ['一'] ['Serial Number']
47. df.iloc [1] #iloc selects row data by row number
48. df.iloc [[1,3]] # Multi-row number selection row data, use list encapsulation, otherwise it will become row and column selection
49. df.iloc [1: 3] #Select the second and fourth lines
50. df [df ['age'] <45] #Add judgment conditions to return all data that meets the conditions, without limiting the age column
51. df [(df ['age'] <45) & (df ['ID'] <4)] #judging multi-condition selection data
52. df.iloc [[1,3], [2,4]] is equivalent to df.loc [['一', '二'], ['age', 'ID']] #loc is the name, iloc Is number
53, df [df ['age'] <45] [['age', 'ID']] # First select rows by age conditions, then specify columns by different indexes
54. df.iloc [1: 3,2: 4] #Slice index
V. Numerical operations
55, df ['age']. Replace (100,33) #Replace 100 in the age column with 33
56, df.replace (np.NaN, 0) # is equivalent to fillna (), where np.NaN is the default value representation in python
57, df.replace ([A, B], C) # Many-to-one replacement, A, B replaced by C
58, df.replace ({'A': 'a', 'B': 'b', 'C': 'c'}) # Many to many replacement
59. df.sort_values (by = ['application order number'], ascending = False) # application order number column is sorted in descending order, Ture is sorted in ascending order (default)
60. df.sort_values (by = ['application form number'], na_position = 'first') # Application form number column is sorted in ascending order, and missing values are ranked first
The default missing value is the last one
61. df.sort_values (by = ['col1', 'col2'], ascending = [False, True]) # Multi-column sorting
62. df ['Sales']. Rank (method = 'first') # Sales ranking (not ranking), method has first \ min \ max \ average
63. df.drop (['Sales', 'ID'], axis = 1) #Delete the column, directly the column name
64, df.drop (df.columns [[4,5]], axis = 1) #Delete the column, it is the number
65. df.drop (colums = ['Sales', 'ID']) # This way to delete the column, you can not write axis = 1
66, df.drop (['a', 'b'], axis = 0) #Delete the row, directly the column name
67, df.drop (df.index [[4,5]], axis = 0) #Delete the row, it is the number
68. df.drop (index = ['a', 'b']) # Delete rows in this way, you can not write axis = 0
69. df ['ID']. Value_counts () # Count the number of occurrences of data in the ID column
70. df ['ID']. Value_counts (normalize = Ture, sort = False) # Count the number of occurrences of data in the ID column and sort in descending order
71, df ['ID']. Unique () # Get the unique value of the column
72, df ['age']. Isin (['a', 11]) # Check whether this column contains a or 11
73, pd.cut (df ['ID'], bins = [0,3,6,10]) # Use bins to indicate the partition between partitions
74. pd.qcut (df ['ID'], 3) #ID column is cut into 3 parts, the number of data in each part should be as consistent as possible
75. df.insert (2, 'Commodity', ['Book', 'Pen', 'Calculator']) # Insert the third column
76, df ['commodity'] = ['book', 'pen', 'calculator']) # Insert a new column at the end of the table
77, df.T row and column interchange
78, df.tack () # Convert table data into tree data
79. df.set_index (['ID', 'name']). Stack (). Reset_index () # Wide table conversion into a long table, first set the common column as the row index, and then for other columns
Transform into tree data, and then reset the row index
80. df.melt (id_vars = ['ID', 'Name'], var_name = 'year', value_name = 'sale') # id_var parameter indicates that the wide table will not remain when converted into a long table
Variable columns, the var_name parameter indicates that the original column index is converted to the column name corresponding to the row index, and value_name indicates the column name of the value corresponding to the new index
81, df ['C1']. Apply (lambda x: x + 1) # is equivalent to map (), but only needs to cooperate with lambda
82, df.applymap (lambda x: x + 1), perform the same function operation on all data in the table
Six, data operation
83, df ['ID'] + Df ['ID'] # can add, subtract, multiply and divide
84, df ['ID']> Df ['ID'] # can perform comparison operations such as> <==! =
85, df.count () # count the number of non-null values in each column
86 、 df.count (axis = 1) # Count the number of non-null values in each row
87, df ['ID']. Count () # count the number of non-null values of the specified column
88 、 df.sum (axis = 1) #summing result of each column / row
89 、 df.mean (axis = 1) #Every column / row is averaged
90 、 df.max (axis = 1) # Find the maximum value of each column / row
91, df.min (axis = 1) # Find the minimum value of each column / row
92 、 df.median (axis = 1) # Find the middle value of each column / row
93, df.mode (axis = 1) #The most value appears in each column / row
94, df.var (axis = 1) # Find the variance of each column / row
95, df.std (axis = 1) # Find the standard deviation of each column / row
96 、 df.quantile (0.25) # Find 1/4 quantile, can be 0.5, 0.75 equal quantile
97, df.corr () # Seeking correlation in the entire DataFrame table
Seven, time series
98、from datetime import datetime
99 、 datatime.now () # Return the current time year, month, day, hour, minute, second
100, datatime.now (). Year # returns the year, it can be .month \ .day
101, datatime.now (). Weekday ()-1 # return the day of the week
102, datatime.now (). Isocalendar () # Return week number
103, (2018, 41, 7) # 2018, Week 41, Day 7
104 、 datatime.now (). Date () # Only return year, month and day
105 、 datatime.now (). Time () # Only return time
106、datatime.now().strftime(‘%Y-%m-%d %H:%M:%S’)#返回2020-03-13 09:09:12
107、from dateutil.parer import parse
108, parse (str_time) # Convert the time of the string to the time format
109 、 pd.Datetimeindex (['2020-02-03', 2020-03-05 ']) # Set time index
110 、 data ['2018'] # Get 2018 data
111 、 data ['2018-01'] # Get the data of January 2018
112, data ['2018-01-05': '2018-01-15'] # Get data for this period
113. Non-time index table processing
114, df [df ['deal time'] == datetime (2018,08,05)]
115, df [df ['deal time']> datetime (2018,08,05)]
116, df [(df ['deal time']> datetime (2018,08,05)) & (df ['deal time'] <datetime (2018,08,15))]
117, cha = datetime (2018,5,21,19,50) -datatime (2018,5,18,17,50)
118, cha.days # Returns the time difference of days
119, cha.seconds # Returns the time difference in seconds
120, cha.seconds / 3600 # returns the hour time difference
121 、 datatime (2018,5,21,19,50) + timedelta (days = 1) # Move back one day
122, datatime (2018,5,21,19,50) + timedelta (seconds = 20) #Move back 20 seconds
123 、 datatime (2018,5,21,19,50) -timedelta (days = 1) #Move forward one day
Eight, pivot table
124, df.groupby ('customer classification'). Count () # count operation after customer classification
125, df.groupby ('customer classification'). Sum () # summation operation after customer classification
126, df.groupby ('customer classification', 'region classification'). Sum () #summation operation after multi-column classification
127, df.groupby ('customer classification', 'region classification') ['ID']. Sum () # multi-column ID sum operation
128, df ['ID'] # DataFrame takes out a column is the Series type
129. df.groupby (df ['ID']). Sum () is equivalent to df.groupby ('ID'). Sum ()
130, df.groupby ('customer classification'). Aggregate (['sum', 'count'] # aggregate can realize multiple aggregation methods
131. df.groupby ('Customer classification'). Aggregate ({'ID': 'count', 'sales': 'sum'})
132, # aggregate can do different summary operations for different columns
133, df.groupby ('customer classification'). Sum (). Reset_index () # Reset the index after grouping and summarizing, become a standard DataFrame
134、pd.pivot_table(data,values,index,columms,aggfunc,fill_value,margins,dropna,margins_name)
135. Pivot table, data: data table df, values: value, index: row index, columns: column index, aggfunc: calculation type of values, fill_value: how to fill the empty value; margins: whether there are total columns; margins_name : Total column name
136. pd.pivot_table (df, values = ['ID', 'sales'], index =' customer classification ', columns =' region ', aggfunc = {' ID ':' count ',' sales': 'sum '}), fill_value = 0, margins = Ture, dropna = None, margins_name =' total ')
Nine, multi-table stitching
137, pd.merge (df1, df2) # By default, it automatically finds the common columns in the two tables for splicing
138, pd.merge (df1, df2, on = "study number") #on to specify the connection column, if the connection column is a public column
139. pd.merge (df1, df2, on = ['Student ID', 'Name'] # on to specify the connection column, if the connection column is a public column
140. pd.merge (df1, df2, left_on = 'student number'right_on =' number ') #by public column, but when the class name is different, use left and right keys
141, pd.merge (df1, df2, left_index = 'student number' right_index = 'number') # When the common columns of both tables are index columns
142, pd.merge (df1, df2, left_index = 'student number' right_on = 'number') # When the public column is one, the index column is one and the ordinary column
143, pd.merge (df1, df2, on = 'student number', how = 'inner') # return to the corresponding public value mosaic in the public column (internal connection)
144, pd.merge (df1, df2, on = 'student number', how = 'left') # Return the corresponding left table value in the public column (left connection)
145, pd.merge (df1, df2, on = 'student number', how = 'right') #Return the corresponding right table value in the public column (right connection)
146, pd.merge (df1, df2, on = 'student number', how = 'outer') # return all the corresponding values in the public column (outer connection)
147, pd.concat ([df1, df2]) # Two tables with the same structure are connected longitudinally, retaining the original index value
148, pd.concat ([df1, df2], ignore_index = True) # Two tables with the same structure are connected vertically, reset the index value
149, pd.concat ([df1, df2], ignore_index = True) .drop_duplicates () # Remove duplicate values after splicing
Ten, export files
150. df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ') # Export file format.xlsx uses to_excel method, through excel_writer parameters
151, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ')
152, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ', index = False) #Export is to remove the index
153, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ', index = False, columns = [' ID ',' sales', 'name']) #Set exported columns
154, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ', index = False, columns = [' ID ',' sales', 'name'], encoding = 'utf-8') # Set exported columns
155, df.to_excel (excel_writer = r'C: \ users \ zhoulifu \ Desktop \ test.xlsx ', sheet_name =' document ', index = False, columns = [' ID ',' sales', 'name'], encoding = 'utf-8', na_rep = 0) #Missing value padding
156.writer = pd.ExcelWriter (excelpath, engine = 'xlsxwirter') # Export multiple files to multiple sheets of a file
157、df1.to_excel(writer,sheet_name=‘表一’)
158、df2.to_excel(writer,sheet_name=’表二’)
159、writer.save()
Python0 based crawling Douban data https://www.bilibili.com/video/BV1HV411o7CM/
Python learning roadmap https://www.bilibili.com/video/BV1V741117Zt/
Python employment direction and career prospects https://www.bilibili.com/video/BV1ut4y1U7bM/