Pandas DataFrame basics
1.1 Introduction
Pandas is an open source Python library for data analysis. There are two data implementations: Dataframe and Series format.
Dataframe represents an entire spreadsheet or rectangle of data.
Series represents a single column, specifically a subset of Dataframe, representing one of its columns.
1.2 Load the data set
guide library
import pandas import pandas as pd
Example
df = pandas.read_csv(r'../file_name',sep='\t') print(df.head())
Commonly used attributes and functions: type() #Built-in function to view the data type of a variable Example: type(df) df.shape #Get the number of rows and columns df.colums #Get the column name df.dtypes #Get the type of each column df. info() #Get more data information
1.3 View columns, rows and cells
1.3.1 Column subsets
#Get a single column columns_1 = df['colums_name'] #Get multiple columns columns_name = df[['colums_1','colums_2','colums_3']] #Use the function to view the obtained columns colums_1.head() colums_1.tail( ) colums_name.head () colums_name.tail()
1.3.2 Row subsets
#Two methods loc and iloc #loc gets the row subset (row name, time series) based on the index label. The following example is the case where the row name is equal to the row number. #iloc gets the row subset (row number) based on the row index # Get the first row and start counting from 0 df.loc[0] #Get the last row and return the Series type df_row_index = df.shape[0] - 1 df.loc[df_row_index] #Function tail returns the last row and returns Daraframe data type df.tail(n=1) #loc function cannot enter an unknown tag name, such as -1, an error will be reported.
#loc selects multiple rows #Selected 2,12,112,1112 rows df.loc[[1,11,111,1111]]
#iloc Gets the positive index and negative index of a single row #Gets the 2nd row df.iloc[1] #Gets the 100th row df.iloc[99] #Gets the last row df.iloc[-1] #Gets multiple rows df.iloc [1,11,111,1111]
1.3.3 Mixed acquisition of row and column subsets
The general syntax for loc and iloc is to use square brackets with commas. The left side of the comma is the row value of the row subset to be fetched, and the right side is the column value of the column subset to be fetched, that is, df.loc[[row],[column]],
df.iloc[[row],[column]].
#Keep in mind the difference between loc and iloc df.loc['row name', 'column name'] df.loc[[row name, row name],['column name 1', 'column name 2']] df . iloc[[line number],[column number]] df.iloc[[line number 1,line number 2],['column number 1','column number 2']]
1.4 Grouping and aggregation calculations
1.4.1 Grouping method
#groupby() function df.groupby('condition')['displayed column']. Aggregation function Example: df.groupby('year')['age'].mean() displays the average age of each year # Multi-condition grouping Example: df.groupby(['year','continent'])[['age','gdp']].mean() displays the average age and average GDP of each country in each year The paving function reset_index() is beautiful but loses the sense of layering
1.4.2 Group frequency calculation
1.5 Basic Drawing
The garbled characters are because the Chinese default display is not set.