1. Description
Pandas is undoubtedly one of the best libraries Python ever had for tabular data wrangling and manipulation tasks. However, if you are new to Pandas and trying to get a firm grip on the Pandas library , things can seem very daunting and overwhelming at first if you start with the official Pandas documentation .
2. Overview of pandas theme
The list of topics is as follows:
List of topics in the official Pandas API documentation (image from author) (source: here )
You can find the code for this article here .
3. List of usage methods
3.1 Import library
Of course, if you want to use the pandas library, you should import it. The widely adopted convention here is to set the alias to .pandas
pd
import pandas as pd
3.2 Read CSV
CSV is generally the most popular file format to read Pandas dataframes from. You can use that method to create a pandas dataframe:pd.read_csv()
file = "file.csv"
df = pd.read_csv(file)
print(df)
>>
col1 col2 col3
0 1 2 A
1 3 4 B
We can verify the type of object created using this method.type()
type(df)
>>
pandas.core.frame.DataFrame
3.3 Storing DataFrames to CSV
Just as CSVs are commonly used to read data frames from, they are also widely used to dump data frames to.
Use the method shown below:df.to_csv()
df.to_csv("file.csv", sep = "|", index = False)
delimiter() indicates the column delimiter and instructs Pandas not to write the index of the DataFrame in the CSV file.sep
index=False
!cat file.csv
>>
col1|col2|col3
1|2|A
3|4|B
3.4 Create data frame
To create a pandas dataframe, use the following method:pd.DataFrame()
data = [[1, 2, "A"],
[3, 4, "B"]]
df = pd.DataFrame(data,
columns = ["col1", "col2", "col3"])
print(df)
>>
col1 col2 col3
0 1 2 A
1 3 4 B
3.4.1 Column creation from list
A popular approach is to convert a given list of lists into a dataframe:
data = [[1, 2, "A"],
[3, 4, "B"]]
df = pd.DataFrame(data,
columns = ["col1", "col2", "col3"])
print(df)
》》
col1 col2 col3
0 1 2 A
1 3 4 B
3.4.2 From dictionary
Another popular approach is to convert a Python dictionary to a DataFrame:
data = {'col1': [1, 2],
'col2': [3, 4],
'col3': ["A", "B"]}
df = pd.DataFrame(data=data)
print(df)
》》
col1 col2 col3
0 1 3 A
1 2 4 B
You can read more about creating dataframes here .
3.5 The shape of the data frame
Dataframes are essentially matrices with column headings. Therefore, it has a specific number of rows and columns.
You can print the dimensions using parameters like this:shape
print(df)
print("Shape:", df.shape)
》》
col1 col2 col3
0 1 3 A
1 2 4 B
Shape: (2, 3)
Here, the first element of the tuple() is the row number and the second element() is the column number.2
3
3.6 View the first N rows
Typically, in a real dataset, you will have many rows.
In this case, one is usually only interested in looking at the first row of the dataframe.n
You can use this method to print the first line:df.head(n)
n
print(df.head(5))
》》
col1 col2 col3
0 1 2 A
1 3 4 B
2 5 6 C
3 7 8 D
4 9 10 E
3.7 Print the data type of the column
Pandas assigns an appropriate data type to each column in a data frame.
You can print the datatypes of all columns with the following parameters:dtypes
df.dtypes
》》
col1 int8
col2 int64
col3 object
dtype: object
3.8 Modify the data type of a column
If you want to change the data type of a column, you can use a method like this:astype()
df["col1"] = df["col1"].astype(np.int8)
print(df.dtypes)
>>
col1 int8
col2 int64
col3 object
dtype: object
3.9 Printing descriptive information about a data frame
3.9.1 Method 1
The first method ( ) is used to print missing value statistics and data types.df.info()
df.info()
》》
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 10 non-null int8
1 col2 10 non-null int64
2 col3 10 non-null object
dtypes: int64(1), int8(1), object(1)
memory usage: 298.0+ bytes
3.9.2 Method 2
This is relatively more descriptive and prints standard statistics such as , , etc. for each numeric column.mean
standard deviation
maximum
the way is.df.describe()
print(df.describe())
>>
col1 col2
count 10.00 10.00
mean 10.00 11.00
std 6.06 6.06
min 1.00 2.00
25% 5.50 6.50
50% 10.00 11.00
75% 14.50 15.50
max 19.00 20.00
3.10 Padding NaN values
In real datasets, missing data is almost inevitable. Here you can use the method to replace them with specific values.df.fillna()
df = pd.DataFrame([[1, 2, "A"], [np.nan, 4, "B"]],
columns = ["col1", "col2", "col3"])
print(df)
>>
col1 col2 col3
0 1.0 2 A
1 NaN 4 B
Read more about dealing with missing data in my previous blog:
df.fillna(0, inplace = True) print(df)
>>
col1 col2 col3
0 1.0 2 A
1 0.0 4 B
3.11 Join data frame
If you want to merge two dataframes using a join key, use this:pd.merge()
df1 = ...
df2 = ...
print(df1)
print(df2)
>>
col1 col2 col3
0 1 2 A
1 3 4 A
2 5 6 B
col3 col4
0 A X
1 B Y
pd.merge(df1, df2, on = "col3")
>>
col1 col2 col3 col4
0 1 2 A X
1 3 4 A X
2 5 6 B Y
3.12 Sorting a DataFrame
Sorting is another typical operation used by data scientists to order data frames. The dataframe can be sorted using this method.df.sort_values()
df = pd.DataFrame([[1, 2, "A"],
[5, 8, "B"],
[3, 10, "B"]],
columns = ["col1", "col2", "col3"])
print(df.sort_values("col1"))
>>
col1 col2 col3
0 1 2 A
2 3 10 B
1 5 8 B
3.13 Grouping DataFrames
To group a dataframe and perform aggregations, use methods in Pandas as follows:groupby()
df = pd.DataFrame([[1, 2, "A"],
[5, 8, "B"],
[3, 10, "B"]],
columns = ["col1", "col2", "col3"])
df.groupby("col3").agg({"col1":sum, "col2":max})
>>
col1 col2
col3
A 1 2
B 8 10
3.14 Double-named columns
If you want to rename the column headers, use that method like this:df.rename()
df = pd.DataFrame([[1, 2, "A"],
[5, 8, "B"],
[3, 10, "B"]],
columns = ["col_A", "col2", "col3"])
df.rename(columns = {"col_A":"col1"})
>>
col1 col2 col3
0 1 2 A
1 5 8 B
2 3 10 B
3.15 Delete column
If you want to drop a column, use this method:df.drop()
df = pd.DataFrame([[1, 2, "A"],
[5, 8, "B"],
[3, 10, "B"]],
columns = ["col1", "col2", "col3"])
print(df.drop(columns = ["col1"]))
>>
col2 col3
0 2 A
1 8 B
2 10 B
3.16 Adding a new column
Two widely used methods of adding new columns are:
3.16.1 Method 1
You can add new columns using the assignment operator:
df = pd.DataFrame([[1, 2], [3, 4]],
columns = ["col1", "col2"])
df["col3"] = df["col1"] + df["col2"]
print(df)
>>
col1 col2 col3
0 1 2 3
1 3 4 7
3.16.1 Method 2
Alternatively, you can use the method as follows:df.assign()
df = pd.DataFrame([[1, 2], [3, 4]],
columns = ["col1", "col2"])
df = df.assign(col3 = df["col1"] + df["col2"])
print(df)
>>
col1 col2 col3
0 1 2 3
1 3 4 7
3.17 Filtering DataFrames
There are various ways to filter a dataframe based on conditions.
Method 1: Boolean filtering
Here, if the row's condition evaluates to .True
df = pd.DataFrame([[1, 2, "A"],
[5, 8, "B"],
[3, 10, "B"]],
columns = ["col1", "col2", "col3"])
print(df[df["col2"] > 5])
>>
col1 col2 col3
1 5 8 B
2 3 10 B
The value in should be greater than 5 for rows to be filtered .col2
This method is used to select the rows whose values belong to the list of values.isin()
df = pd.DataFrame([[1, 2, "A"],
[5, 8, "B"],
[3, 10, "C"]],
columns = ["col1", "col2", "col3"])
filter_list = ["A", "C"]
print(df[df["col3"].isin(filter_list)])
>>
col1 col2 col3
0 1 2 A
2 3 10 C
Method 2: Get Columns
You can also filter an entire column as follows:
df["col1"] ## or df.col1
>>
0 1
1 5
2 3
Name: col1, dtype: int64
Method 3: Select by label
In a label-based selection, each label requested must be in an index in the data frame.
Integers are also valid labels, but they refer to labels rather than positions.
Consider the following dataframe.
df = pd.DataFrame([[6, 5, 10],
[5, 8, 6],
[3, 10, 4]],
columns = ["Maths", "Science", "English"],
index = ["John", "Mark", "Peter"])
print(df)
>>
Maths Science English
John 6 5 10
Mark 5 8 6
Peter 3 10 4
We use a label-based selection method.df.loc
df.loc["John"]
>>
Maths 6
Science 5
English 10
Name: John, dtype: int64
df.loc["Mark", ["Maths", "English"]]
>>
Maths 5
English 6
Name: Mark, dtype: int64
However, in , it is not allowed to use position to filter the dataframe, like this:df.loc[]
df.loc[0]
>>
Execution Error
KeyError: 0
To achieve the above, you should use location-based selection.df.iloc[]
Method 4: Select by location
df.iloc[0]
>>
Maths 6
Science 5
English 10
Name: John, dtype: int64
3.18 Finding Unique Values in a DataFrame
To print all distinct values in a column, use this method.unique()
df = pd.DataFrame([[1, 2, "A"],
[5, 8, "B"],
[3, 10, "A"]],
columns = ["col1", "col2", "col3"])
df["col3"].unique()
>>
array(['A', 'B'], dtype=object)
If you want to print the number of unique values, use instead.nunique()
df["col3"].nunique()
>>
2
3.19 Applying functions to data frames
If you want to apply a function to a column, use something like this:apply()
def add_cols(row):
return row.col1 + row.col2
df = pd.DataFrame([[1, 2],
[5, 8],
[3, 9]],
columns = ["col1", "col2"])
df["col3"] = df.apply(add_cols, axis=1)
print(df)
>>
col1 col2 col3
0 1 2 3
1 5 8 13
2 3 9 12
You can also apply methods to individual columns like so:
def square_col(num):
return num**2
df = pd.DataFrame([[1, 2],
[5, 8],
[3, 9]],
columns = ["col1", "col2"])
df["col3"] = df.col1.apply(square_col)
print(df)
>>
col1 col2 col3
0 1 2 1
1 5 8 25
2 3 9 9
3.20 Handling Duplicate Data
You can mark all duplicate rows with: df.duplicate()
df = pd.DataFrame([[1, "A"],
[2, "B"],
[1, "A"]],
columns = ["col1", "col2"])
df.duplicated(keep=False)
All duplicate rows are marked True and keep = False.
Also, you can use that method to remove duplicate rows like this:df.drop_duplicates()
df = pd.DataFrame([[1, "A"],
[2, "B"],
[1, "A"]],
columns = ["col1", "col2"])
print(df.drop_duplicates())
》》
col1 col2
0 1 A
1 2 B
A copy of the duplicate row will be kept.
3.21 Distribution of Lookup Values
To find the frequency of each unique value in a column, use the following method: value_counts()
df = pd.DataFrame([[1, "A"],
[2, "B"],
[1, "A"]],
columns = ["col1", "col2"])
print(df.value_counts("col2"))
》》
col2
A 2
B 1
dtype: int64
3.22 Resetting the Index of a DataFrame
To reset the index of a dataframe, use the following method:df.reset_index()
df = pd.DataFrame([[6, 5, 10],
[5, 8, 6],
[3, 10, 4]],
columns = ["col1", "col2", "col3"],
index = [2, 3, 1])
print(df.reset_index())
》》
index col1 col2 col3
0 2 6 5 10
1 3 5 8 6
2 1 3 10 4
To delete the old index, pass an argument to the above method:drop=True
df.reset_index(drop=True)
》》
col1 col2 col3
0 6 5 10
1 5 8 6
2 3 10 4
3.23 Find Crosstabulation
To return the frequency of each combination of values in two columns, use the following method:pd.crosstab()
df = pd.DataFrame([["A", "X"],
["B", "Y"],
["C", "X"],
["A", "X"]],
columns = ["col1", "col2"])
print(pd.crosstab(df.col1, df.col2))
》》
col2 X Y
col1
A 2 0
B 0 1
C 1 0
3.24 Rotating DataFrames
Pivot table is a commonly used data analysis tool in Excel. Similar to the crosstabs discussed above, pivot tables in Pandas provide a way to crosstabulate data.
Consider the following dataframe:
df = ...
print(df)
》》
Name Subject Marks
0 John Maths 6
1 Mark Maths 5
2 Peter Maths 3
3 John Science 5
4 Mark Science 8
5 Peter Science 10
6 John English 10
7 Mark English 6
8 Peter English 4
Using this method, you can convert column entries into column headers:pd.pivot_table()
pd.pivot_table(df,
index = ["Name"],
columns=["Subject"],
values='Marks',
fill_value=0)
》》
Subject English Maths Science
Name
John 10 6 5
Mark 6 5 8
Peter 4 3 10