30 Ways You Should Master to Become a Pandas Professional

1. Description

        Pandas is undoubtedly one of the best libraries Python ever had for tabular data wrangling and manipulation tasks. However, if you are new to Pandas and trying to get a firm grip on the Pandas library , things can seem very daunting and overwhelming at first if you start with the  official Pandas documentation .

2. Overview of pandas theme

        The list of topics is as follows:

List of topics in the official Pandas API documentation (image from author) (source: here )

You can find the code for this article here .

3. List of usage methods

3.1 Import library

        Of course, if you want to use the pandas library, you should import it. The widely adopted convention here is to set the alias to .pandaspd

import pandas as pd

3.2 Read CSV

        CSV is generally the most popular file format to read Pandas dataframes from. You can use that method to create a pandas dataframe:pd.read_csv()

file = "file.csv"

df = pd.read_csv(file)
print(df)

>>
   col1  col2 col3
0     1     2    A
1     3     4    B

We can verify the type of object created using this method.type()

type(df)
>>
pandas.core.frame.DataFrame

3.3 Storing DataFrames to CSV

        Just as CSVs are commonly used to read data frames from, they are also widely used to dump data frames to.

Use the method shown below:df.to_csv()

df.to_csv("file.csv", sep = "|", index = False)

        delimiter() indicates the column delimiter and instructs Pandas  not to write the index of the DataFrame in the CSV file.sepindex=False

!cat file.csv
>>
col1|col2|col3
1|2|A
3|4|B

3.4 Create data frame

To create a pandas dataframe, use the following method:pd.DataFrame()

data = [[1, 2, "A"], 
        [3, 4, "B"]]

df = pd.DataFrame(data, 
                  columns = ["col1", "col2", "col3"])
print(df)
>>
   col1  col2 col3
0     1     2    A
1     3     4    B

3.4.1 Column creation from list

        A popular approach is to convert a given list of lists into a dataframe:

data = [[1, 2, "A"], 
        [3, 4, "B"]]

df = pd.DataFrame(data, 
                  columns = ["col1", "col2", "col3"])
print(df)
》》
   col1  col2 col3
0     1     2    A
1     3     4    B

3.4.2 From dictionary

        Another popular approach is to convert a Python dictionary to a DataFrame:

data = {'col1': [1, 2], 
        'col2': [3, 4], 
        'col3': ["A", "B"]}

df = pd.DataFrame(data=data)
print(df)
》》
   col1  col2 col3
0     1     3    A
1     2     4    B

You can read more about creating dataframes here .

3.5 The shape of the data frame

        Dataframes are essentially matrices with column headings. Therefore, it has a specific number of rows and columns.

       You can print the dimensions using parameters like this:shape

print(df)
print("Shape:", df.shape)
》》
   col1  col2 col3
0     1     3    A
1     2     4    B
Shape: (2, 3)

        Here, the first element of the tuple() is the row number and the second element() is the column number.23

3.6 View the first N rows

        Typically, in a real dataset, you will have many rows.

        In this case, one is usually only interested in looking at the first row of the dataframe.n

You can use this method to print the first line:df.head(n)n

print(df.head(5))
》》
   col1  col2 col3
0     1     2    A
1     3     4    B
2     5     6    C
3     7     8    D
4     9    10    E

3.7 Print the data type of the column

        Pandas assigns an appropriate data type to each column in a data frame.

        You can print the datatypes of all columns with the following parameters:dtypes

df.dtypes
》》
col1      int8
col2     int64
col3    object
dtype: object

3.8 Modify the data type of a column

        If you want to change the data type of a column, you can use a method like this:astype()

df["col1"] = df["col1"].astype(np.int8)

print(df.dtypes)
>>
col1      int8
col2     int64
col3    object
dtype: object

3.9 Printing descriptive information about a data frame

3.9.1 Method 1

        The first method ( ) is used to print missing value statistics and data types.df.info()

df.info()
》》
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col1    10 non-null     int8  
 1   col2    10 non-null     int64 
 2   col3    10 non-null     object
dtypes: int64(1), int8(1), object(1)
memory usage: 298.0+ bytes

3.9.2 Method 2

        This is relatively more descriptive and prints standard statistics such as , , etc. for each numeric column.meanstandard deviationmaximum

the way is.df.describe()

print(df.describe())
>>
        col1   col2
count  10.00  10.00
mean   10.00  11.00
std     6.06   6.06
min     1.00   2.00
25%     5.50   6.50
50%    10.00  11.00
75%    14.50  15.50
max    19.00  20.00

3.10 Padding NaN values

        In real datasets, missing data is almost inevitable. Here you can use the method to replace them with specific values.df.fillna()

df = pd.DataFrame([[1, 2, "A"], [np.nan, 4, "B"]], 
                  columns = ["col1", "col2", "col3"])
print(df)
>>
   col1  col2 col3
0   1.0     2    A
1   NaN     4    B

Read more about dealing with missing data in my previous blog:

df.fillna(0, inplace = True) print(df)
>>
   col1  col2 col3
0   1.0     2    A
1   0.0     4    B

3.11 Join data frame

If you want to merge two dataframes using a join key, use this:pd.merge()

df1 = ...
df2 = ...

print(df1)
print(df2)
>>
   col1  col2 col3
0     1     2    A
1     3     4    A
2     5     6    B
  col3 col4
0    A    X
1    B    Y
pd.merge(df1, df2, on = "col3")
>>
   col1  col2 col3 col4
0     1     2    A    X
1     3     4    A    X
2     5     6    B    Y

3.12 Sorting a DataFrame

        Sorting is another typical operation used by data scientists to order data frames. The dataframe can be sorted using this method.df.sort_values()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

print(df.sort_values("col1"))
>>
   col1  col2 col3
0     1     2    A
2     3    10    B
1     5     8    B

3.13 Grouping DataFrames

        To group a dataframe and perform aggregations, use methods in Pandas as follows:groupby()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

df.groupby("col3").agg({"col1":sum, "col2":max})
>>
      col1  col2
col3            
A        1     2
B        8    10

3.14 Double-named columns

        If you want to rename the column headers, use that method like this:df.rename()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col_A", "col2", "col3"])

df.rename(columns = {"col_A":"col1"})
>>
   col1  col2 col3
0     1     2    A
1     5     8    B
2     3    10    B

3.15 Delete column

        If you want to drop a column, use this method:df.drop()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

print(df.drop(columns = ["col1"]))
>>
   col2 col3
0     2    A
1     8    B
2    10    B

3.16 Adding a new column

        Two widely used methods of adding new columns are:

3.16.1 Method 1

        You can add new columns using the assignment operator:

df = pd.DataFrame([[1, 2], [3, 4]], 
                  columns = ["col1", "col2"])

df["col3"] = df["col1"] + df["col2"]
print(df)
>>
   col1  col2  col3
0     1     2     3
1     3     4     7

3.16.1 Method 2

        Alternatively, you can use the method as follows:df.assign()

df = pd.DataFrame([[1, 2], [3, 4]], 
                  columns = ["col1", "col2"])

df = df.assign(col3 = df["col1"] + df["col2"])

print(df)
>>
   col1  col2  col3
0     1     2     3
1     3     4     7

3.17 Filtering DataFrames

        There are various ways to filter a dataframe based on conditions.

Method 1: Boolean filtering

        Here, if the row's condition evaluates to .True

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

print(df[df["col2"] > 5])
>>
   col1  col2 col3
1     5     8    B
2     3    10    B

The value in should be greater than  5 for rows to be filtered .col2

This method is used to select the rows whose values ​​belong to the list of values.isin()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "C"]], 
                  columns = ["col1", "col2", "col3"])

filter_list = ["A", "C"]
print(df[df["col3"].isin(filter_list)])
>>
   col1  col2 col3
0     1     2    A
2     3    10    C

Method 2: Get Columns

You can also filter an entire column as follows:

df["col1"] ## or df.col1
>>
0    1
1    5
2    3
Name: col1, dtype: int64

Method 3: Select by label

In a label-based selection, each label requested must be in an index in the data frame.

Integers are also valid labels, but they refer to labels rather than positions.

Consider the following dataframe.

df = pd.DataFrame([[6, 5,  10], 
                   [5, 8,  6], 
                   [3, 10, 4]], 
                  columns = ["Maths", "Science", "English"],
                  index = ["John", "Mark", "Peter"])

print(df)
>>
       Maths  Science  English
John       6        5       10
Mark       5        8        6
Peter      3       10        4

We use a label-based selection method.df.loc

df.loc["John"]
>>
Maths       6
Science     5
English    10
Name: John, dtype: int64
df.loc["Mark", ["Maths", "English"]]
>>
Maths      5
English    6
Name: Mark, dtype: int64

        However, in , it is not allowed to use position to filter the dataframe, like this:df.loc[]

df.loc[0]
>>
Execution Error

KeyError: 0

        To achieve the above, you should use location-based selection.df.iloc[]

Method 4: Select by location

df.iloc[0]
>>
Maths       6
Science     5
English    10
Name: John, dtype: int64

3.18 Finding Unique Values ​​in a DataFrame

        To print all distinct values ​​in a column, use this method.unique()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "A"]], 
                  columns = ["col1", "col2", "col3"])

df["col3"].unique()
>>
array(['A', 'B'], dtype=object)

If you want to print the number of unique values, use instead.nunique()

df["col3"].nunique()
>>
2

3.19 Applying functions to data frames

If you want to apply a function to a column, use something like this:apply()

def add_cols(row):
    return row.col1 + row.col2

df = pd.DataFrame([[1, 2], 
                   [5, 8], 
                   [3, 9]], 
                  columns = ["col1", "col2"])
                  
df["col3"] = df.apply(add_cols, axis=1)
print(df)
>>
   col1  col2  col3
0     1     2     3
1     5     8    13
2     3     9    12

You can also apply methods to individual columns like so:

def square_col(num):
    return num**2

df = pd.DataFrame([[1, 2], 
                   [5, 8], 
                   [3, 9]], 
                  columns = ["col1", "col2"])
                  
df["col3"] = df.col1.apply(square_col)
print(df)
>>
   col1  col2  col3
0     1     2     1
1     5     8    25
2     3     9     9

3.20 Handling Duplicate Data

        You can mark all duplicate rows with: df.duplicate()

df = pd.DataFrame([[1, "A"], 
                   [2, "B"], 
                   [1, "A"]], 
                  columns = ["col1", "col2"])
                  
df.duplicated(keep=False)

All duplicate rows are marked  True and keep = False.

Also, you can use that method to remove duplicate rows like this:df.drop_duplicates()

df = pd.DataFrame([[1, "A"], 
                   [2, "B"], 
                   [1, "A"]], 
                  columns = ["col1", "col2"])
                  
print(df.drop_duplicates())
》》
   col1 col2
0     1    A
1     2    B

        A copy of the duplicate row will be kept.

3.21 Distribution of Lookup Values

        To find the frequency of each unique value in a column, use the following method: value_counts()

df = pd.DataFrame([[1, "A"], 
                   [2, "B"], 
                   [1, "A"]], 
                  columns = ["col1", "col2"])
                  
print(df.value_counts("col2"))
》》
col2
A    2
B    1
dtype: int64

3.22 Resetting the Index of a DataFrame

To reset the index of a dataframe, use the following method:df.reset_index()

df = pd.DataFrame([[6, 5,  10], 
                   [5, 8,  6], 
                   [3, 10, 4]], 
                  columns = ["col1", "col2", "col3"],
                  index = [2, 3, 1])

print(df.reset_index())
》》
   index  col1  col2  col3
0      2     6     5    10
1      3     5     8     6
2      1     3    10     4

To delete the old index, pass an argument to the above method:drop=True

df.reset_index(drop=True)
》》
   col1  col2  col3
0     6     5    10
1     5     8     6
2     3    10     4

3.23 Find Crosstabulation

        To return the frequency of each combination of values ​​in two columns, use the following method:pd.crosstab()

df = pd.DataFrame([["A", "X"], 
                   ["B", "Y"], 
                   ["C", "X"],
                   ["A", "X"]], 
                  columns = ["col1", "col2"])

print(pd.crosstab(df.col1, df.col2))
》》
col2  X  Y
col1      
A     2  0
B     0  1
C     1  0

3.24 Rotating DataFrames

        Pivot table is a commonly used data analysis tool in Excel. Similar to the crosstabs discussed above, pivot tables in Pandas provide a way to crosstabulate data.

        Consider the following dataframe:

df = ...

print(df)
》》
    Name  Subject  Marks
0   John    Maths      6
1   Mark    Maths      5
2  Peter    Maths      3
3   John  Science      5
4   Mark  Science      8
5  Peter  Science     10
6   John  English     10
7   Mark  English      6
8  Peter  English      4

        Using this method, you can convert column entries into column headers:pd.pivot_table()

pd.pivot_table(df, 
               index = ["Name"],
               columns=["Subject"], 
               values='Marks',
               fill_value=0)
》》
Subject  English  Maths  Science
Name                            
John          10      6        5
Mark           6      5        8
Peter          4      3       10

Four. Postscript

        The above lists some pandas data operations, which are some basic operations; in reality, pandas has a huge operation manual, we cannot describe all of it, we can only explain part of it, and others can only be learned while using.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131914649