Practical exercise Python data analysis [pandas]

foreword

This article comes from the sample data of "Using Python for Data Analysis"
Please combine the provided sample data to analyze the function of the code, and perform data analysis and visualization expansion. This article uses four examples to focus on the MoviesLens dataset, baby names in the United States from 1880 to 2010, the US Department of Agriculture video database, and the 2012 Federal Election Commission database.


1. MoviesLens dataset

  GroupLens Labs provides a collection of movie rating data collected from MoviesLens users from the late 1990s to early 2000s. Zhexi Data provides movie ratings, genres, years and audience data (age, zip code, gender, occupation) .

  The MovisLens1M dataset contains 1 million ratings of 4000 movies by 6000 users . The data is distributed among three tables: containing ratings, user information, and movie information respectively .

(1) First load some python data analysis libraries and give them abbreviations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

(2)

unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("datasets/movielens/users.dat", sep="::",
                      header=None, names=unames, engine="python")

rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("datasets/movielens/ratings.dat", sep="::",
                        header=None, names=rnames, engine="python")

mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("datasets/movielens/movies.dat", sep="::",
                       header=None, names=mnames, engine="python")
  • unames represents user information
  • rnames represent scoring information
  • mnames represent movie information

  Let's introduce the usage of pd.read_tablie:

  • The first parameter is filepath_or_buffer (file)
    generally refers to the path to read the file. Such as reading csv files.

  • The second parameter is sep (separator)
    to specify the separator. If no parameters are specified, the default comma separated

  • The third parameter is the header (header),
    which is the column name, and the data in row 0 is the header by default.

  • The fourth parameter is names (column name)
    for the case where the original data has no header and you want to set the column name.

  • The fifth parameter is the engine (engine)
    pandas uses when parsing data. The current parsing engine of pandas provides two types: c and python. The default is c, because the c engine parses faster, but its features are not as complete as the python engine. If you use features that the c engine does not have, it will automatically degenerate to the python engine.

  Of course, there are more parameters than these, but if you need to use it, you can search the pandas documentation.

Analysis function:

  1. User information table (users): This code reads user information data from the file users.dat and creates a DataFrame object named users. The file path is specified using the read_table function, the separator is ::, the column name line is not specified, but the custom column name list unames is used. The engine="python" parameter specifies to use the Python parsing engine. Columns in the data include UserID, Gender, Age, Occupation, and Zip Code.
  2. Ratings table (ratings): This code reads the ratings data from the file ratings.dat and creates a DataFrame object named ratings. Similarly, the read_table function is used to specify the file path, the separator is ::, and the column name line is not specified, but the custom column name list rnames is used. Columns in the data include user id, movie id, rating and timestamp.
  3. Movie information table (movies): This code reads movie information data from the file movies.dat and creates a DataFrame object named movies. Similarly, the read_table function is used to specify the file path, the separator is ::, and the column name line is not specified, but the custom column name list mnames is used. Columns in the data include Movie ID, Title, and Genre.

(3)

users.head(5)
ratings.head(5)
movies.head(5)
ratings

Output result:
insert image description here
analysis function:

  1. The above code will print out the first 5 rows of data for users, ratings, and movies, and display the entire DataFrame for ratings.

(4)

data = pd.merge(pd.merge(ratings, users), movies)
data
data.iloc[0]

Output result:
insert image description here
analysis function:

  1. Use the pd.merge() function to merge the three DataFrames of ratings, users and movies, and store the result in data. Next, the entire DataFrame and the first row of data are printed using data.

(5)

mean_ratings = data.pivot_table("rating", index="title",
                                columns="gender", aggfunc="mean")
mean_ratings.head(5)

Output result:
insert image description here
analysis function:

  1. The pivot_table() function takes data as the data source, by specifying "rating" as the value column (the column to be calculated), "title" as the row index column (the column to be grouped) and "gender" as the column index column (to be grouped column) and using "mean" as the aggregation function, the average rating of different movies for different genders is calculated.
  2. By executing mean_ratings.head(5), you can view the results of the first 5 rows.

(6)

ratings_by_title = data.groupby("title").size()
ratings_by_title.head()
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles

Output result:
insert image description here
analysis function:

  1. First use the groupby() function to group the data by movie title ("title"), and use the size() function to calculate the number of ratings corresponding to each movie title.
  2. Next, use the head() function to view the results of the first few rows, where each row represents a movie title, and the corresponding value represents the number of ratings for the movie title.
  3. Then, filter out the active movie titles according to the number of ratings, and use the index attribute to obtain the index of the movie titles with the number of ratings greater than or equal to 250. The resulting active_titles is an indexed list containing the active movie titles.

(7)

mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

Output result:
insert image description here
analysis function:

  1. Use the loc indexer to filter the mean_ratings dataframe to only those sections containing active movie titles. active_titles is a list of indexes representing the indexes of active movie titles.
    (8)
mean_ratings = mean_ratings.rename(index={
    
    "Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)":
                           "Seven Samurai (Shichinin no samurai) (1954)"})

Functional Analysis:

  1. Use the rename() function to change the title of a movie in the index to a new title. Specifically, amend "Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)" in the index to read "Seven Samurai (Shichinin no samurai) (1954)".

(9)

top_female_ratings = mean_ratings.sort_values("F", ascending=False)
top_female_ratings.head()

Output result:
insert image description here
analysis function:

  1. Use the sort_values() function to sort mean_ratings in descending order based on the average female audience rating.
  2. Through top_female_ratings.head(), you can view the results of the first few rows after sorting.

(10)

mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]

Analysis function:

  1. First, the code calculates the difference between male and female average ratings for each movie using mean_ratings["M"] - mean_ratings["F"].
  2. Then, use an assignment statement to store the result in a new column "diff" in mean_ratings.
  3. Finally, mean_ratings has an additional column "diff", which represents the difference between the average ratings for each movie by male and female users.

(11)

sorted_by_diff = mean_ratings.sort_values("diff")
sorted_by_diff.head()

Output result:
insert image description here
analysis function:

  1. This code uses the sort_values ​​function to sort the mean_ratings by the "diff" column. The first parameter of the sort_values ​​function is the name of the column to be sorted, here is "diff". Since the ascending parameter is not specified, it is sorted in ascending order by default.
  2. Finally, a new DataFrame object is generated, named sorted_by_diff. The code then displays the first five rows of sorted_by_diff using the .head() function.

(12)

sorted_by_diff[::-1].head()

Output result:
insert image description here
analysis function:

  1. This code uses the slice syntax [::-1] to flip the order of sorted_by_diff, and then uses the .head() function to display the first five rows of data after flipping.

(13)

rating_std_by_title = data.groupby("title")["rating"].std()
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.head()

Output result:
insert image description here
analysis function:

  1. This code first uses the groupby function to group the data by the "title" column, and then uses the std function to calculate the standard deviation of each group, which is the standard deviation of ratings for each movie. Finally, a new Series object is generated, named rating_std_by_title.
  2. Next, the code uses the .loc[] selector to select the row in rating_std_by_title whose row index is in active_titles. Ultimately, rating_std_by_title only contains rating standard deviation data for movies with 250 or more ratings.
  3. Then, the code uses the .head() function to display the first five rows of data for rating_std_by_title.

(14)

rating_std_by_title.sort_values(ascending=False)[:10]

Output result:
insert image description here
analysis function:

  1. This code uses the sort_values ​​function to sort rating_std_by_title. The first parameter ascending=False of the sort_values ​​function means to sort in descending order. Then, use the slice syntax [:10] to select the first ten rows of data after sorting.

(15)

movies["genres"].head()
movies["genres"].head().str.split("|")
movies["genre"] = movies.pop("genres").str.split("|")
movies.head()

Output result:
insert image description here
analysis function:

  1. This code first displays the first five rows of data in the "genres" column of movies using the .head() function.
  2. Next, the code uses the .str.split(“|”) function to split the string in the “genres” column. The .str.split(“|”) function will split each string into a list according to the “|” delimiter. Then, the code uses the .head() function to display the first five rows of split data.
  3. The code then pops the "genres" column of movies using the pop function and stores the split results in a new column "genre". The pop function pops the specified column from the DataFrame object and returns the data for that column.
  4. Finally, the code uses the .head() function to display the first five rows of the updated movies. Now, movies has an additional column "genre", which represents a list of movie genres.

(16)

movies_exploded = movies.explode("genre")
movies_exploded[:10]

Output result:
insert image description here
analysis function;

  1. This code uses the explode function to split the "genre" column of movies into rows. The explode function will split the list in the specified column into rows, each containing one element of the list.
  2. Finally, a new DataFrame object is generated, named movies_exploded. Then, the code uses the slice syntax [:10] to select the first ten rows of data.

(17)

ratings_with_genre = pd.merge(pd.merge(movies_exploded, ratings), users)
ratings_with_genre.iloc[0]
genre_ratings = (ratings_with_genre.groupby(["genre", "age"])
                 ["rating"].mean()
                 .unstack("age"))
genre_ratings[:10]

Output result:
insert image description here
analysis function:

  1. This code first uses the pd.merge() function to merge the three DataFrame objects movies_exploded, ratings, and users together. Finally, a new DataFrame object is generated, named ratings_with_genre.
  2. Next, the code uses the .iloc[0] selector to get the first row of data for ratings_with_genre.
  3. The code then uses the groupby function to group the ratings_with_genre by the "genre" and "age" columns, and then uses the mean function to calculate the mean of each group, which is the average rating for each genre and age group. Finally, use the unstack function to split the "age" column into multiple columns.
  4. Finally, a new DataFrame object is generated, named genre_ratings. Then, the code uses the slice syntax [:10] to select the first ten rows of data.

2. Baby names in the United States from 1880 to 2010

  The U.S. Social Security Administration (SSA) provides data on the frequency of baby names from 1880 to the present. There are many things you can do with this data:
Visualize the proportion of baby names over time based on a given name
  Determine the relative ranking of a name
  Determine the most popular names for each year, or the most or least popular names

(1)

!head -n 10 datasets/babynames/yob1880.txt

Analysis function:

  1. This code uses the pandas library in Python to read a CSV file named "yob1880.txt". This file contains data related to baby names from 1880, including the columns "Name", "Gender", and "Number of Births". Appropriate column labels can be assigned to the DataFrame by specifying the names of the columns "name", "gender", and "number of births" in the names parameter.

(2)

names1880 = pd.read_csv("datasets/babynames/yob1880.txt",
                        names=["name", "sex", "births"])
names1880

Output result:
insert image description here
analysis function:

  1. First, groupby("sex") groups the data by different values ​​of the "sex" column. This will create two groups, one for males ("M") and one for females ("F").
  2. Then, ["births"] selects the "births" column in each of the grouped groups.
  3. Finally, .sum() sums the selected "births" column, computing the sum of births in each group.
  4. To sum up, what this line of code does is calculate the sum of births for each group grouped by gender in the DataFrame named names1880.

(3)

names1880.groupby("sex")["births"].sum()

Output result:
insert image description here
analysis function:

  1. First, an empty list pieces is created to store the DataFrame for each year.
  2. Then, construct the file path corresponding to each year by looping from 1880 to 2010 (excluding 2011). The year in the file path is dynamically inserted into the string by means of {year}.
  3. Next, read each file using the pd.read_csv function and specify the column names as ["name", "sex", "births"] to read the data for each year as a DataFrame.
  4. Then, add a year to each DataFrame by adding a column called "year" that holds the current loop's year.
  5. Each time through the loop, append the read DataFrame to the pieces list.
  6. After the loop is over, use the pd.concat function to concatenate all the DataFrames into a single DataFrame, setting the ignore_index=True parameter to reset the index to ensure that the index is contiguous.
  7. The resulting names DataFrame contains all baby name data from 1880 to 2010, where each row represents the baby name data for a specific year.

(4)

pieces = []
for year in range(1880, 2011):
    path = f"datasets/babynames/yob{
      
      year}.txt"
    frame = pd.read_csv(path, names=["name", "sex", "births"])

    # Add a column for the year
    frame["year"] = year
    pieces.append(frame)

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

names

Output result:
insert image description here
analysis function:

  1. First, total_births is calculated by using the pivot_table function. This function is used to create a pivot table where the "births" column is used as the value, the "year" column is used as the row index, and the "sex" column is used as the column index. The aggregate function sum is applied to the "births" column to calculate the total number of births for each year and gender. Thus, total_births is a new DataFrame with rows representing years, columns representing gender, and cell values ​​being the total number of births for each year and gender.
  2. Next, total_births.tail() is used to display the last few rows of the total_births DataFrame, which will output data for the most recent years.
  3. Finally, total_births.plot(title="Total births by sex and year") is used to plot the data of the total_births DataFrame into a graph. The title of the chart is "Total births by sex and year", the horizontal axis represents the year, and the vertical axis represents the total number of births. The form of the chart can be further analyzed and visualized according to the distribution and trend of the data.

(5)

total_births = names.pivot_table("births", index="year",
                                 columns="sex", aggfunc=sum)
total_births.tail()
total_births.plot(title="Total births by sex and year")

Output result:
insert image description here
analysis function:

  1. The function add_prop takes a group as input and operates on that group. In this case, the grouping is done on the columns "year" and "sex". Inside the function, save the result in a new column called "prop" by calculating the proportion of the "births" column in each group to the total number of births in that group. This is done by dividing the "births" column by the sum of the "births" column in that grouping. Finally, the function returns the manipulated group.
  2. Next, the names DataFrame is grouped by the "year" and "sex" columns, and grouped by the groupby function. The group_keys=False parameter means that the key of the group is not used as an index.
  3. The apply function then applies the add_prop function to each grouping. This will perform the operations defined in the add_prop function on each group, i.e. calculate the proportion of the "births" column in each group and store the result in the "prop" column.
  4. Ultimately, a new DataFrame names is returned containing the data for each grouping with the "prop" column added. This allows us to analyze the proportion of each name in its group for each year and gender.

(6)

def add_prop(group):
    group["prop"] = group["births"] / group["births"].sum()
    return group
names = names.groupby(["year", "sex"], group_keys=False).apply(add_prop)

names

Output result:
insert image description here
analysis function:

  1. First, groupby([“year”, “sex”]) groups the data by the columns “year” and “sex”. This will create a multi-level index where the first level index is "year" and the second level index is "sex".
  2. Then, ["prop"] selects the "prop" column in each group after grouping. This is the proportion column calculated and added to the DataFrame using the add_prop function in the previous step.
  3. Finally, .sum() sums the selected "prop" columns, computing the sum of the proportions in each group.

(7)

names.groupby(["year", "sex"])["prop"].sum()

Output result:
insert image description here
analysis function:

  1. The get_top1000 function takes a group as input and sorts the group in descending order based on the value of the "births" column. Then, return the first 1000 rows of data after sorting. Thus, the function returns the 1000 records with the highest number of births in each grouping.
  2. Next, the names DataFrame is grouped by the columns "year" and "sex" and grouped using the groupby function. This will create a grouped object grouped
  3. The apply function then applies the get_top1000 function to each grouping. This will perform the operation defined in the get_top1000 function on each group, sort by birth count descending, and return the top 1000 rows of data in each group.
  4. Finally, assign the returned result to top1000, which will be a new DataFrame containing the 1000 records with the highest number of births for each year and gender combination.

(8)

def get_top1000(group):
    return group.sort_values("births", ascending=False)[:1000]
grouped = names.groupby(["year", "sex"])
top1000 = grouped.apply(get_top1000)
top1000.head()

Output result:
insert image description here
analysis function:

  1. First, top1000.reset_index(drop=True) is used to reset the index of the top1000 DataFrame. Setting the drop=True parameter means that the original index columns will not be retained when the index is reset. Indexing can be reset to a contiguous sequence of integers by resetting the index.
  2. Then, top1000.head() is used to display the first few rows of the top1000 DataFrame after reindexing. This will output the first few rows of the DataFrame after reindexing so you can see the data.
  3. To summarize, what this code does is reset the index of the top1000 DataFrame to a continuous sequence of integers and display the first few rows of the reindexed DataFrame. Doing so makes accessing and manipulating data easier and ensures index consistency

(9)

top1000 = top1000.reset_index(drop=True)

(10)

top1000.head()

Output result:
insert image description here
analysis function:

  1. First, boys = top1000[top1000[“sex”] == “M”] is used to filter out records whose gender is male from the top1000 DataFrame. This will create a new DataFrame boys containing all records where the gender is male.
  2. Next, girls = top1000[top1000[“sex”] == “F”] is used to filter out records whose gender is female from the top1000 DataFrame. This will create a new DataFrame girls that contains all records where the gender is female.
  3. Then, total_births = top1000.pivot_table("births", index="year", columns="name", aggfunc=sum) calculates the total number of births for each combination of year and name by using the pivot_table function. The function takes the "births" column as the value, the "year" column as the row index, and the "name" column as the column index. The aggregate function sum is applied to the "births" column to calculate the total number of births for each year and name combination. Thus, total_births is a new DataFrame with rows representing years, columns representing names, and cell values ​​being the total number of births for each year and name combination.
  4. Next, total_births.info() is used to display information about the total_births DataFrame, including the data type of each column and the number of non-null values.
  5. Then, subset = total_births[[“John”, “Harry”, “Mary”, “Marilyn”]] filters out the specific name column from the total_births DataFrame, including “John”, “Harry”, “Mary” and “Marilyn ". This will create a new DataFrame subset containing only those columns with specific names.
  6. Finally, subset.plot(subplots=True, figsize=(12, 10), title="Number of births per year") is used to plot the data of the subset DataFrame into a graph. Each name column will be displayed in a different subplot, and the subplots are arranged in a vertical layout. The size of the graph is (12, 10) and the title is "Number of births per year". A graph can be used to observe how the number of births for each name changes over time.

(11)

boys = top1000[top1000["sex"] == "M"]
girls = top1000[top1000["sex"] == "F"]

(12)

total_births = top1000.pivot_table("births", index="year",
                                   columns="name",
                                   aggfunc=sum)

(13)

total_births.info()
subset = total_births[["John", "Harry", "Mary", "Marilyn"]]
subset.plot(subplots=True, figsize=(12, 10),
            title="Number of births per year")

Output result:
insert image description here
insert image description here
(14)

plt.figure()

Output result:
insert image description here
analysis function:

  1. First, table = top1000.pivot_table("prop", index="year", columns="sex", aggfunc=sum) uses the pivot_table function to calculate the sum of proportions in each year and sex combination. The function takes the "prop" column as the value, the "year" column as the row index, and the "sex" column as the column index. The aggregate function sum is applied to the "prop" column to calculate the sum of the proportions in each year and gender combination. Thus, table is a new DataFrame with rows representing years, columns representing gender, and cell values ​​being sums of proportions in each combination of year and gender.
  2. Next, table.plot(title="Sum of table1000.prop by year and sex", yticks=np.linspace(0, 1.2, 13)) is used to plot the data of the table DataFrame into a chart. The title of the chart is "Sum of table1000.prop by year and sex". The horizontal axis represents the year, and the vertical axis represents the sum of proportions. The yticks=np.linspace(0, 1.2, 13) parameter is used to set the scale value on the vertical axis, np.linspace(0, 1.2, 13) generates 13 evenly spaced scale values ​​from 0 to 1.2 . This ensures that the scale range of the vertical axis fits the value range of the data. Graphs can be used to see how the sum of proportions in each year and gender combination has changed over time.

(15)

table = top1000.pivot_table("prop", index="year",
                            columns="sex", aggfunc=sum)
table.plot(title="Sum of table1000.prop by year and sex",
           yticks=np.linspace(0, 1.2, 13))

Output result:
insert image description here
analysis function:

  1. First, df = boys[boys[“year”] == 2010] filters out records from the boys DataFrame where the “year” column is equal to 2010. This will create a new DataFrame df containing records where the gender is male and the year is 2010.

(16)

df = boys[boys["year"] == 2010]
df

Output result:
insert image description here
analysis function:

  1. First, prop_cumsum = df["prop"].sort_values(ascending=False).cumsum() is used to sort the "prop" column of the df DataFrame in descending order and calculate its cumulative sum. This will create a new Series prop_cumsum where each element is the cumulative sum sorted in descending order by the "prop" column.
  2. Then, prop_cumsum[:10] is used to display the first 10 elements of the prop_cumsum Series. This will output the accumulated and sorted top 10 values ​​for viewing the corresponding proportional accumulations.
  3. Finally, prop_cumsum.searchsorted(0.5) uses the searchsorted function to find the first position where the cumulative sum reaches or exceeds 0.5. This returns an integer representing the index position of the first cumulative sum greater than or equal to 0.5.

(17)

prop_cumsum = df["prop"].sort_values(ascending=False).cumsum()
prop_cumsum[:10]
prop_cumsum.searchsorted(0.5)

Analysis function:

  1. First, df = boys[boys.year == 1900] filters out records from the boys DataFrame where the "year" column is equal to 1900. This will create a new DataFrame df containing records where the gender is male and the year is 1900.
  2. Then, in1900 = df.sort_values(“prop”, ascending=False).prop.cumsum() sorts the df DataFrame in descending order by column “prop” and computes the cumulative sum of that column. This will create a new Series in1900 where each element is the cumulative sum sorted in descending order by the "prop" column.
  3. Next, in1900.searchsorted(0.5) + 1 uses the searchsorted function to find the first position where the cumulative sum reaches or exceeds 0.5 and adds 1 to the result. This returns an integer representing how many male names were needed in 1900 for the cumulative ratio to be greater than or equal to 0.5.

(18)

df = boys[boys.year == 1900]
in1900 = df.sort_values("prop", ascending=False).prop.cumsum()
in1900.searchsorted(0.5) + 1

Analysis function:

  1. First, a function called get_quantile_count is defined. The function accepts a group as input and optionally specifies a quantile (0.5 by default). Inside the function, the groups are sorted in descending order by the value of the "prop" column. Then, calculate the cumulative sum of the "prop" column and use the searchsorted function to find the first position where the cumulative sum reaches or exceeds the specified quantile. Finally, return this position plus 1, representing the amount needed to reach that quantile.
  2. Next, diversity = top1000.groupby([“year”, “sex”]).apply(get_quantile_count) performs a grouping operation on the top1000 DataFrame. Group by the columns "year" and "sex" and apply the get_quantile_count function. This will execute the get_quantile_count function on each group, counting the number of names that reach the specified quantile for each year and gender combination.
  3. Then, diversity = diversity.unstack() converts the multilevel indexed Series to a DataFrame using the unstack function. This will convert the hierarchical index on the "year" column and the "sex" column into columns, resulting in a new DataFrame diversity. The rows represent the years, the columns represent the genders, and the cell values ​​represent the number of names that reach the specified quantile for each combination of year and gender.

(19)

def get_quantile_count(group, q=0.5):
    group = group.sort_values("prop", ascending=False)
    return group.prop.cumsum().searchsorted(q) + 1

diversity = top1000.groupby(["year", "sex"]).apply(get_quantile_count)
diversity = diversity.unstack()

(20)

fig = plt.figure()

Output result:
insert image description here
analysis function:

  1. First, diversity.head() is used to display the first few rows of the diversity DataFrame. This will output the first few rows of a DataFrame looking at the number of names that reach the specified quantile for each year and gender combination.
  2. Next, diversity.plot(title="Number of popular names in top 50%") is used to plot the data of the diversity DataFrame into a graph. The chart is titled "Number of popular names in top 50%". The horizontal axis represents the year, and the vertical axis represents the number of names reaching the specified quantile. A graph can be used to observe the change in the number of names reaching the specified quantile for each year and gender combination.

(21)

diversity.head()
diversity.plot(title="Number of popular names in top 50%")

Output result:
insert image description here
analysis function:

  1. First, a function called get_last_letter is defined. This function takes a string as input and returns the last letter of the string.
  2. Then, last_letters = names["name"].map(get_last_letter) uses the map function to apply the get_last_letter function to each element in the names["name"] column to get the last letter of each name. This will create a new Series last_letters where each element is the last letter of the corresponding name.
  3. Next, last_letters.name = "last_letter" sets the name for the last_letters Series to "last_letter" to better identify the column.
  4. Finally, table = names.pivot_table(“births”, index=last_letters, columns=[“sex”, “year”], aggfunc=sum) creates a pivot table using the pivot_table function. The function takes the "births" column as the value, the last_letters column as the row index, and the ["sex", "year"] column as the column index. The aggregate function sum is applied to the "births" column to calculate the total number of births for each combination of last letter, gender, and year. Thus, table is a new DataFrame with rows representing the last letter, columns representing gender and year combinations, and cell values ​​being the total number of births for each combination.

(22)

def get_last_letter(x):
    return x[-1]

last_letters = names["name"].map(get_last_letter)
last_letters.name = "last_letter"

table = names.pivot_table("births", index=last_letters,
                          columns=["sex", "year"], aggfunc=sum)

Analysis function:

  1. First, subtable = table.reindex(columns=[1910, 1960, 2010], level="year") reindexes the columns of the table DataFrame using the reindex function. Here, the column to be retained is specified as [1910, 1960, 2010], and the level to be reindexed is specified as "year". This will create a new DataFrame subtable with columns containing only the specified years [1910, 1960, 2010] and columns for other years will be dropped.
  2. Then, subtable.head() is used to display the first few rows of the subtable DataFrame. This will output the first few rows of a DataFrame looking at the data for the specified year.

(23)

subtable = table.reindex(columns=[1910, 1960, 2010], level="year")
subtable.head()

Output result:
insert image description here
analysis function:

  1. First, subtable.sum() sums the subtable DataFrame. This will count the total number of births for each combination of last letter, gender, and specified year. The result is a Series where each element is the total number of births for the corresponding combination.
  2. Then, letter_prop = subtable / subtable.sum() performs a division operation on the subtable DataFrame by the total number of births computed previously. This will calculate the proportion of births for each combination in the corresponding year. The result is a new DataFrame letter_prop where each cell value represents the proportion of the corresponding combination in the specified year.

(24)

subtable.sum()
letter_prop = subtable / subtable.sum()
letter_prop

Output result:
insert image description here
analysis function:

  1. First, fig, axes = plt.subplots(2, 1, figsize=(10, 8)) creates a figure object containing two subplots. The 2, 1 parameter indicates that the chart has a layout of two rows and one column, that is, the two subgraphs are arranged vertically. The figsize=(10, 8) parameter specifies the size of the figure to be 10 inches wide and 8 inches high.
  2. Then, letter_prop[“M”].plot(kind="bar", rot=0, ax=axes[0], title="Male") plots the combination named "M" in the first subplot proportion data. The kind="bar" parameter means to draw a histogram, the rot=0 parameter means that the x-axis scale label does not rotate, the ax=axes[0] parameter specifies that the drawing is in the first sub-graph, and the title="Male" parameter sets the sub-graph The graph is titled "Male".
  3. Next, letter_prop["F"].plot(kind="bar", rot=0, ax=axes[1], title="Female", legend=False) plots a second subplot named " The ratio data of the combination of F”. The parameter settings are similar to the previous subplot, except that the title is "Female", and the legend=False parameter indicates that the legend is not displayed.

(25)

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop["M"].plot(kind="bar", rot=0, ax=axes[0], title="Male")
letter_prop["F"].plot(kind="bar", rot=0, ax=axes[1], title="Female",
                      legend=False)

Output result:
insert image description here
insert image description here
(26)

plt.subplots_adjust(hspace=0.25)

Output result:
insert image description here
(27)

letter_prop = table / table.sum()

dny_ts = letter_prop.loc[["d", "n", "y"], "M"].T
dny_ts.head()

Output result:
insert image description here
(28)

fig = plt.figure()

Output result:

(29)

dny_ts.plot()

Output result:
insert image description here
analysis function:

  1. First, all_names = pd.Series(top1000["name"].unique()) creates a Series all_names containing all unique (unique) names. It extracts all unique names from the "name" column of the top1000 DataFrame and creates them as elements as a Series.
  2. Next, lesley_like = all_names[all_names.str.contains(“Lesl”)] filters out names that contain “Lesl” in the all_names Series by using the str.contains method. This will create a new Series lesley_like containing names related to "Lesl".

(30)

all_names = pd.Series(top1000["name"].unique())
lesley_like = all_names[all_names.str.contains("Lesl")]
lesley_like

Output result:
insert image description here
analysis function:

  1. First, filtered = top1000[top1000[“name”].isin(lesley_like)] uses the isin method to filter out records whose “name” column in the top1000 DataFrame is contained in the lesley_like Series. This will create a new DataFrame filtered containing records related to lesley_like.
  2. Then, filtered.groupby("name")["births"].sum() performs a grouping operation on the filtered DataFrame, groups the data by the "name" column, and calculates the total number of births for each name. This returns a new Series where each element is the total number of births for each name.

(31)

filtered = top1000[top1000["name"].isin(lesley_like)]
filtered.groupby("name")["births"].sum()

Output result:
insert image description here
analysis function:

  1. First, table = filtered.pivot_table("births", index="year", columns="sex", aggfunc="sum") creates a pivot table using the pivot_table function. The function takes the "births" column as the value, the "year" column as the row index, and the "sex" column as the column index. The aggregate function "sum" is applied to the "births" column to calculate the total number of births for each year and gender combination. Thus, table is a new DataFrame with rows representing years, columns representing gender, and cell values ​​being the total number of births for each year and gender combination.
  2. Then, table = table.div(table.sum(axis="columns"), axis="index") performs division on the table DataFrame, normalizing the data by the sum of rows corresponding to the total number of births in each year. table.sum(axis="columns") calculates the total number of births for each year, table.div(..., axis="index") divides each cell's value by the row index, dividing each value by The total number of births for the corresponding row.
  3. Finally, table.tail() is used to display the last few rows of the table DataFrame. This will output the last few rows of the DataFrame for viewing the normalized data.

(32)

table = filtered.pivot_table("births", index="year",
                             columns="sex", aggfunc="sum")
table = table.div(table.sum(axis="columns"), axis="index")
table.tail()

Output result:
insert image description here
(33)

fig = plt.figure()

Output result:
insert image description here
(34)

table.plot(style={
    
    "M": "k-", "F": "k--"})

Output result:
insert image description here

3. USDA Video Database

  The USDA provides a database of food nutrition information. Each transaction has some identifying attributes and two lists of nutrient elements and nutrient ratios. Data in this form is not suitable for analysis, so some work needs to be done to transform the data into a better form.

Analysis function:

  1. First, json.load(open(“datasets/usda_food/database.json”)) opens the file named “database.json” and uses the json.load function to load the contents of the file as a data structure in Python. This will return a data object containing the contents of the file, usually in the form of a dictionary, list, or some other combination.
  2. Then, len(db) is used to calculate the number of elements in the loaded data object db. This will return the number of elements in db, representing the number of items in the data object.
  3. To summarize, what this code does is load a JSON file and convert it to a data structure in Python. Then, get the number of items in the data object by counting the number of elements in the loaded data object.
    (1)
import json
db = json.load(open("datasets/usda_food/database.json"))
len(db)

Output result:
insert image description here
analysis function:

  1. First, db[0].keys() gets the key of the first element (dictionary) in db. This will return a list of all keys in that element, for looking at the keys in the dictionary.
  2. Next, db[0][“nutrients”][0] gets the first element of the value whose key is “nutrients” in the first element dictionary in db. This will return the first element in the list corresponding to the "nutrients" key, for viewing the contents of that element.
  3. Then, nutrients = pd.DataFrame(db[0]["nutrients"]) creates a DataFrame nutrients that uses the value with key "nutrients" in the first element dictionary in db as the data source. This way, the nutrients DataFrame will contain all the elements in the "nutrients" list, and each element will be a row in the DataFrame.
  4. Finally, nutrients.head(7) is used to display the first 7 rows of the nutrients DataFrame. This will output the first 7 rows of the DataFrame for viewing the contents of the data.

(2)

db[0].keys()
db[0]["nutrients"][0]
nutrients = pd.DataFrame(db[0]["nutrients"])
nutrients.head(7)

Output result:
insert image description here
analysis function:

  1. First, info_keys = ["description", "group", "id", "manufacturer"] defines a list info_keys containing specific keys. These keys are used to extract the corresponding value from each element (dictionary) in db.
  2. Then, info = pd.DataFrame(db, columns=info_keys) uses the pd.DataFrame function to create a DataFrame info that uses the db data object as the data source and the keys in the info_keys list as the column names. This way, the info DataFrame will contain the value from the specified key in the dictionary for each element in db, and each element will become a row in the DataFrame.
  3. Next, info.head() is used to display the first few rows of the info DataFrame. This will output the first few rows of the DataFrame, useful for viewing the contents of the data.
  4. Finally, info.info() is used to display relevant information about the info DataFrame, including the data type of each column and the number of non-null values, etc.

(3)

info_keys = ["description", "group", "id", "manufacturer"]
info = pd.DataFrame(db, columns=info_keys)
info.head()
info.info()

Output result:
insert image description here
analysis function:

  1. info[“group”] selects the “group” column in the info DataFrame, and then the pd.value_counts() function counts the values ​​of this column. [:10] means to get the top 10 items of the statistical results.
  2. Finally, pd.value_counts(info[“group”])[:10] will return a Series containing the top 10 most frequently occurring values ​​in the “group” column and their corresponding frequencies.

(4)

pd.value_counts(info["group"])[:10]

Output result:
insert image description here
analysis function:

  1. First, an empty list nutrients is created to store the "nutrients" data in each element.
  2. Then, use a for loop to iterate through each element in db. For each element, convert the value of the "nutrients" key in its dictionary to DataFrame fnuts. Also, add a column called "id" to the fnuts DataFrame and set it to the current element's "id" value.
  3. Next, add the fnuts DataFrame to the nutrients list.
  4. Finally, all the DataFrames in the nutrients list are concatenated using the pd.concat() function to create a DataFrame containing all the "nutrients" data. The ignore_index=True parameter means to ignore the original index and regenerate a new continuous index.

(5)

nutrients = []

for rec in db:
    fnuts = pd.DataFrame(rec["nutrients"])
    fnuts["id"] = rec["id"]
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)
nutrients

Output result:
insert image description here
analysis function:

  1. First, nutrients.duplicated().sum() uses the duplicated function to detect duplicate rows in the nutrients DataFrame, and counts the number of duplicate rows through the sum function. This will return the number of duplicate rows.
  2. Then, nutrients = nutrients.drop_duplicates() uses the drop_duplicates function to remove duplicate rows from the nutrients DataFrame. This will update the nutrients DataFrame to remove duplicate rows.
  3. Next, col_mapping = {"description" : "food", "group" : "fgroup"} defines a dictionary col_mapping to specify the column name mapping that needs to be renamed.
  4. Then, info = info.rename(columns=col_mapping, copy=False) uses the rename function to rename the column names in the info DataFrame according to the mapping relationship in the col_mapping dictionary. This will update the info DataFrame, renaming the "description" column to "food" and the "group" column to "fgroup".
  5. Next, info.info() displays information about the updated info DataFrame, including the data type of each column and the number of non-null values.
  6. Finally, col_mapping = {"description" : "nutrient", "group" : "nutgroup"} defines a dictionary col_mapping to specify the column name mapping that needs to be renamed.

(6)

nutrients.duplicated().sum()  # number of duplicates
nutrients = nutrients.drop_duplicates()

(7)

col_mapping = {
    
    "description" : "food",
               "group"       : "fgroup"}
info = info.rename(columns=col_mapping, copy=False)
info.info()
col_mapping = {
    
    "description" : "nutrient",
               "group" : "nutgroup"}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients

Output result:
insert image description here
insert image description here
analysis function:

  1. First, ndata = pd.merge(nutrients, info, on="id") uses the merge function to merge the nutrients DataFrame and info DataFrame based on the "id" column. This will create a new DataFrame ndata that contains the merged results of rows with the same "id" value from both DataFrames.
  2. Next, ndata.info() displays relevant information about the ndata DataFrame, including the data type of each column and the number of non-null values, etc.
  3. Finally, ndata.iloc[30000] selects the row with index 30000 in ndata DataFrame by index and returns the data of this row. This can be used to view the data content of a particular row.

(8)

ndata = pd.merge(nutrients, info, on="id")
ndata.info()
ndata.iloc[30000]

Output result:
insert image description here
(9)

fig = plt.figure()

Output result:
insert image description here
analysis function:

  1. First, result = ndata.groupby([“nutrient”, “fgroup”])[“value”].quantile(0.5) performs a grouping operation on the ndata DataFrame. Group by the "nutrient" and "fgroup" columns and calculate the quantile for the "value" column, specified as 0.5. This will return a Series result containing the quantile values ​​for each combination.
  2. Then, result[“Zinc, Zn”] selects from the result Series the item whose key is “Zinc, Zn”, which is the name of the specified nutrient. This will return a Series containing the corresponding nutrient quantiles.
  3. Next, result[“Zinc, Zn”].sort_values() sorts the selected Series by quantile values ​​in ascending order.
  4. Finally, .plot(kind="barh") plots the sorted results in a horizontal bar graph. This will generate a bar graph with the name of the nutrient on the y-axis and the corresponding quantile value on the x-axis.

(10)

result = ndata.groupby(["nutrient", "fgroup"])["value"].quantile(0.5)
result["Zinc, Zn"].sort_values().plot(kind="barh")

Output result:
insert image description here
analysis function:

  1. First, by_nutrient = ndata.groupby([“nutgroup”, “nutrient”]) uses the groupby function to group the ndata DataFrame. To group by the "nutgroup" and "nutrient" columns, create a group by nutrient object by_nutrient.
  2. Then, def get_maximum(x): return x.loc[x.value.idxmax()] defines a function called get_maximum that takes a DataFrame x as input. This function finds the index of the row with the maximum value in the "value" column in DataFrame x by value.idxmax(), and further selects the corresponding row by loc.
  3. Next, max_foods = by_nutrient.apply(get_maximum)[[“value”, “food”]] Use the apply function to apply the get_maximum function to each group, get the row with the maximum value, and select the “value” and “ food” column. This will create a new DataFrame max_foods with the "value" and "food" columns for the row with the maximum value in each nutrient grouping.
  4. Finally, max_foods[“food”] = max_foods[“food”].str[:50] truncates the string contents of the “food” column in the max_foods DataFrame to a maximum of 50 characters in length. This can be used to limit the string length of the "food" column, making it more compact.

(11)

by_nutrient = ndata.groupby(["nutgroup", "nutrient"])

def get_maximum(x):
    return x.loc[x.value.idxmax()]

max_foods = by_nutrient.apply(get_maximum)[["value", "food"]]

# make the food a little smaller
max_foods["food"] = max_foods["food"].str[:50]

Analysis function:

  1. max_foods.loc[“Amino Acids”] selects the row with the index label “Amino Acids” that contains the “value” and “food” columns of the row with the maximum value for that nutrient grouping.
  2. Finally, ["food"] selects the "food" column in the above select row to get information about the food with the largest value in that nutrient grouping.

(12)

max_foods.loc["Amino Acids"]["food"]

Output result:
insert image description here

4. 2012 Federal Election Commission Database

  The U.S. Federal Election Commission publishes data on the contributions of political campaigns. The data included the donor's name, occupation and employer, address and contribution amount. Here's an analysis you can try:
  Donations by Occupation and Employer Statistics
  by Donation Amount
  by State
(1)

fec = pd.read_csv("datasets/fec/P00000001-ALL.csv", low_memory=False)
fec.info()

Output result:
insert image description here

Analysis function:

  1. This code reads a CSV file named "datasets/fec/P00000001-ALL.csv" using the read_csv function from the pandas library and stores it in a variable named fec. The low_memory=False parameter means not to use low memory mode during reading.
  2. Next, use the info() function to display information about the fec data frame, including the index data type and the number of non-null values.

(2)

fec.iloc[123456]

Output result:
insert image description here
analysis function:

  1. This line of code uses the iloc attribute to select the row at index 123456 in the fec dataframe. iloc is a position-based index that allows you to select rows or columns in a data frame by integer position.

(3)

unique_cands = fec["cand_nm"].unique()
unique_cands
unique_cands[2]

Output result:
insert image description here
analysis function:

  1. This code first uses the unique function to get the unique values ​​of the "cand_nm" column in the fec dataframe and stores them in a variable named unique_cands.
  2. Next, it prints out the value of the unique_cands variable, showing all unique candidate names.
  3. Finally, it prints out the element at index 2 in unique_cands, the third unique candidate name.

(4)

parties = {
    
    "Bachmann, Michelle": "Republican",
           "Cain, Herman": "Republican",
           "Gingrich, Newt": "Republican",
           "Huntsman, Jon": "Republican",
           "Johnson, Gary Earl": "Republican",
           "McCotter, Thaddeus G": "Republican",
           "Obama, Barack": "Democrat",
           "Paul, Ron": "Republican",
           "Pawlenty, Timothy": "Republican",
           "Perry, Rick": "Republican",
           "Roemer, Charles E. 'Buddy' III": "Republican",
           "Romney, Mitt": "Republican",
           "Santorum, Rick": "Republican"}

Analysis function:

  1. This code defines a dictionary called parties, which contains the names of some candidates and the political party they belong to. The keys in the dictionary are the names of the candidates, and the values ​​are the political parties they belong to.

(5)

fec["cand_nm"][123456:123461]
fec["cand_nm"][123456:123461].map(parties)
# Add it as a column
fec["party"] = fec["cand_nm"].map(parties)
fec["party"].value_counts()

Output result:
insert image description here
analysis function:

  1. This code first uses slicing to select rows 123456 to 123460 (excluding row 123461) of the "cand_nm" column in the fec dataframe and prints out the values ​​for those rows.
  2. Next, it uses the map function to map the candidate names in these rows to the political parties they belong to, and prints out the mapped results.
  3. It then uses the map function to map all candidate names in the "cand_nm" column of the fec dataframe to the political party they belong to and stores the result in a new "party" column.
  4. Finally, it calculates the frequency distribution of the values ​​of the "party" column in the fec data frame using the value_counts function and prints out the result.

(6)

(fec["contb_receipt_amt"] > 0).value_counts()

Output result:
insert image description here
analysis function:

  1. This line of code first selects values ​​greater than 0 in the "contb_receipt_amt" column of the fec data frame using boolean indexing, and then uses the value_counts function to calculate the frequency distribution of these values.

(7)

fec = fec[fec["contb_receipt_amt"] > 0]

Analysis function:

  1. This line of code updates the fec dataframe by selecting rows in the fec dataframe with boolean indexing in the "contb_receipt_amt" column greater than 0 and storing those rows in the fec variable.

(8)

fec_mrbo = fec[fec["cand_nm"].isin(["Obama, Barack", "Romney, Mitt"])]

Analysis function:

  1. This line of code first uses the isin function to select the rows in the "cand_nm" column of the fec data frame that have the value "Obama, Barack" or "Romney, Mitt" and then stores those rows in a new variable called fec_mrbo.

(9)

fec["contbr_occupation"].value_counts()[:10]

Output result:
insert image description here
analysis function:

  1. This line of code first calculates the frequency distribution of the values ​​of the "contbr_occupation" column in the fec data frame using the value_counts function, then uses slicing to select the top 10 values ​​and prints them.

(10)

occ_mapping = {
    
    
   "INFORMATION REQUESTED PER BEST EFFORTS" : "NOT PROVIDED",
   "INFORMATION REQUESTED" : "NOT PROVIDED",
   "INFORMATION REQUESTED (BEST EFFORTS)" : "NOT PROVIDED",
   "C.E.O.": "CEO"
}

def get_occ(x):
    # If no mapping provided, return x
    return occ_mapping.get(x, x)

fec["contbr_occupation"] = fec["contbr_occupation"].map(get_occ)

Analysis function:

  1. This code first defines a dictionary called occ_mapping to map some occupation names to new values.
  2. Next, it defines a function called get_occ that takes an argument x and uses the mapping in the occ_mapping dictionary to return the new value. If x does not have a corresponding key in the dictionary, return x itself.
  3. It then uses the map function to map the values ​​in the "contbr_occupation" column in the fec dataframe to new values ​​and stores the result back in the "contbr_occupation" column.

(11)

emp_mapping = {
    
    
   "INFORMATION REQUESTED PER BEST EFFORTS" : "NOT PROVIDED",
   "INFORMATION REQUESTED" : "NOT PROVIDED",
   "SELF" : "SELF-EMPLOYED",
   "SELF EMPLOYED" : "SELF-EMPLOYED",
}

def get_emp(x):
    # If no mapping provided, return x
    return emp_mapping.get(x, x)

fec["contbr_employer"] = fec["contbr_employer"].map(get_emp)

Analysis function:

  1. This code first defines a dictionary called emp_mapping that maps some employer names to new values.
  2. Next, it defines a function called get_emp that takes an argument x and uses the mapping in the emp_mapping dictionary to return the new value. If x does not have a corresponding key in the dictionary, return x itself.
  3. It then uses the map function to map the values ​​in the "contbr_employer" column in the fec dataframe to new values ​​and stores the result back in the "contbr_employer" column.

(12)

by_occupation = fec.pivot_table("contb_receipt_amt",
                                index="contbr_occupation",
                                columns="party", aggfunc="sum")
over_2mm = by_occupation[by_occupation.sum(axis="columns") > 2000000]
over_2mm

Output result:
insert image description here
analysis function:

  1. This code first creates a pivot table using the pivot_table function, where the row labels are the values ​​of the "contbr_occupation" column in the fec dataframe, the column labels are the values ​​of the "party" column, and the values ​​are the sum of the "contb_receipt_amt" column. Each cell in the pivot table represents the total donations for a specific occupation and a specific political party.
  2. Next, it uses a Boolean index to select rows in the pivot table whose row sum is greater than 2000000 and stores these rows in a variable named over_2mm.
  3. Finally, it prints the value of the over_2mm variable, showing the occupations whose total donations exceed 2,000,000.

(13)

plt.figure()

Output result:
insert image description here
analysis function:

  1. This line of code creates a new figure window using the figure function from the matplotlib library

(14)

over_2mm.plot(kind="barh")

Output result:
insert image description here
analysis function:

  1. This line of code uses the plot function to draw a horizontal histogram in the current graphics window, with total donations on the x-axis and occupation on the y-axis.

(15)

def get_top_amounts(group, key, n=5):
    totals = group.groupby(key)["contb_receipt_amt"].sum()
    return totals.nlargest(n)

Analysis function:

  1. The function first groups the group data frame by the key column using the groupby function, and then calculates the sum of the "contb_receipt_amt" column in each group.
  2. Next, it selects the top n largest values ​​using the nlargest function and returns them.

(16)

grouped = fec_mrbo.groupby("cand_nm")
grouped.apply(get_top_amounts, "contbr_occupation", n=7)
grouped.apply(get_top_amounts, "contbr_employer", n=10)

Output result:
insert image description here
analysis function:

  1. This code first groups the fec_mrbo dataframe by the "cand_nm" column using the groupby function and stores the result in a variable named grouped.
  2. Next, it applies the get_top_amounts function to each group using the apply function, calculates the top 7 maximum values ​​of the "contbr_occupation" column in each group, and prints out the results.
  3. It then applies the get_top_amounts function to each group using the apply function again, calculates the top 10 maximum values ​​of the "contbr_employer" column in each group, and prints out the results.

(17)

bins = np.array([0, 1, 10, 100, 1000, 10000,
                 100_000, 1_000_000, 10_000_000])
labels = pd.cut(fec_mrbo["contb_receipt_amt"], bins)
labels

Output result:
insert image description here
analysis function:

  1. This code first defines an array named bins, representing the boundary values ​​for binning.
  2. Next, it bins the values ​​of the "contb_receipt_amt" column in the fec_mrbo data frame using the cut function and stores the result in a variable called labels.
  3. Finally, it prints out the value of the labels variable, showing which bin each donation amount belongs to.

(18)

grouped = fec_mrbo.groupby(["cand_nm", labels])
grouped.size().unstack(level=0)

Output result:
insert image description here
analysis function:

  1. This code first groups the fec_mrbo dataframe by the "cand_nm" column and the labels variable using the groupby function and stores the result in a variable named grouped.
  2. Next, it calculates the size of each group using the size function, then converts the result to a data frame using the unstack function, and prints it.

(19)

plt.figure()

Output result:
insert image description here
analysis function:

  1. This line of code creates a new figure window using the figure function from the matplotlib library

(20)

bucket_sums = grouped["contb_receipt_amt"].sum().unstack(level=0)
normed_sums = bucket_sums.div(bucket_sums.sum(axis="columns"),
                              axis="index")
normed_sums
normed_sums[:-2].plot(kind="barh")

Output result:
insert image description here
analysis function:

  1. This code first uses the sum function to calculate the sum of the "contb_receipt_amt" column in each group, then uses the unstack function to convert the result into a data frame and stores the result in a variable called bucket_sums.
  2. Next, it divides each cell's value by the row sum using the div function, and stores the result in a variable called normed_sums.
  3. It then prints out the value of the normed_sums variable, showing each candidate's contribution percentage for different contribution amount ranges.
  4. Finally, it uses slicing to select all but the last two rows, and then uses the plot function to draw a horizontal histogram in the current graph window showing each candidate's contribution percentage for different donation amount intervals.

(21)

grouped = fec_mrbo.groupby(["cand_nm", "contbr_st"])
totals = grouped["contb_receipt_amt"].sum().unstack(level=0).fillna(0)
totals = totals[totals.sum(axis="columns") > 100000]
totals.head(10)

Output result:
insert image description here
analysis function:

  1. This code first groups the fec_mrbo dataframe by the columns "cand_nm" and "contbr_st" using the groupby function and stores the result in a variable named grouped.
  2. Next, it uses the sum function to calculate the sum of the "contb_receipt_amt" column in each group, and then uses the unstack function to convert the result into a data frame. Since some groups may not have data, it fills missing values ​​with 0 using the fillna function.
  3. It then selects the rows with a row sum greater than 100000 using Boolean indexing and stores these rows in the totals variable.
  4. Finally, it uses the head function to select the first 10 rows and print them.

(22)

percent = totals.div(totals.sum(axis="columns"), axis="index")
percent.head(10)

Output result:
insert image description here
analysis function:

  1. This code first divides the value of each cell in the totals variable by the row sum using the div function, and stores the result in a variable named percent.
  2. Next, it uses the head function to select the first 10 rows and print them.

Summarize

  If you don't accumulate steps, you won't reach thousands of miles; if you don't accumulate trickles, you won't be able to form rivers and seas. I believe that after passing this article, readers will be able to master how to use python for data analysis. I believe that readers will be able to gain something, and they will be able to show their talents in future study or work and life!

Guess you like

Origin blog.csdn.net/qq_56086076/article/details/131678268