Data Analysis of New Coronary Pneumonia Epidemic in the United States Based on Spark

Data Analysis of New Coronary Pneumonia Epidemic in the United States Based on Spark

GCC students don't plagiarize! ! ! Plagiarism is strictly prohibited

foreword

The 2020 novel coronavirus pneumonia epidemic in the United States is a major global public health event, which has had a profound impact on global politics, economy, society and other fields. In this epidemic, scientists have played an important role, actively exploring virus characteristics, transmission mechanisms and prevention and control strategies, and continuously releasing relevant research results.
This paper aims to use Spark for data processing and analysis to understand the spread of the 2020 US COVID-19 epidemic in the country, and to explore the relationship between the state epidemic data. In terms of data processing and visualization, Spark and Python technologies are used for implementation.
Through the collection, cleaning, integration and analysis of data, it is hoped that we can have a more comprehensive understanding of the spread of the epidemic in the United States, provide data support and guidance for epidemic prevention and control, and provide a practical case for the application of technology in the field of data analysis.

1. Demand analysis

To analyze the data of confirmed cases of new crown pneumonia in the United States in 2020, use Python as the programming language, use Spark to analyze the data, describe the analysis results, and use python to visualize the analysis results. There are mainly two aspects of analysis:
Time trend analysis: analyze the daily/weekly/monthly number of new confirmed cases, the trend of cure rate and death rate, and other indicators related to time.
Geographical distribution analysis: Analyze the number of confirmed cases, deaths, and cured numbers in different states/cities, and explore the relationship between geographical differences and factors such as population density. Provide experience and reference for future epidemic prevention and control.

1.1 Data source

The data set used comes from the US new crown pneumonia epidemic data set of the data website Kaggle (download the data set from Xuetong - the end-of-term assignment), the data set is organized in the data table us-counties. Relevant data of confirmed cases of pneumonia up to 2020-05-19. Data contains the following fields:

Field Name Field Meaning Example
date date 2020/1/21; 2020/1/22;
county district and county (subordinate unit of the state) Snohomish;
state state Washington
cases the cumulative number of confirmed cases in the district and county as of this date1,2, 3…
deaths As of the date, the cumulative death toll in this district and county is 1, 2, 3…

Part of the data is shown in the figure:
insert image description here

Figure 1-us-counties.csv file data Figure
1.2 Specific requirements and goals
1) The original data set is organized as a .csv file. In order to facilitate spark reading to generate RDD or DataFrame, first convert the csv to a .txt format file. The conversion operation is implemented using python.
2) Upload the file to the HDFS file system, the path is: "/user/hadoop/us-counties.txt"
3) Read us-counties.txt programmatically to generate a DataFrame.
4) Use Spark to analyze the data. The following indicators are mainly counted, and all results are saved as .json files, which are:
(1) The cumulative number of confirmed cases, deaths and fatality rates of each state in the United States is counted, and the results are saved in the Mysql database.
(2) Count the ten states with the largest number of confirmed cases in the United States.
(3) Count the ten states with the highest number of deaths in the United States.
(4) Count the ten states with the least number of confirmed cases in the United States.
5) Download the Spark calculation result.json file to a local folder. Data visualization of the results.
6) Program source code requires line comments for key codes, IPO comments for functions, attribute comments and method comments for classes and objects.

2. Overall design

2.1 The environment used in this experiment

(1) Oracle VM VirtualBox virtual machine
(2) Ubuntu system
(3) Hadoop2.10.0, MySQL
(4) Python: 3.8
(5) Spark: 2.4.7
(6) Anaconda and Jupyter Notebook

2.2 Implementation process

insert image description here

Figure 2 - Flowchart

3. Detailed design

3.1 Use python to convert file types

Code:
import pandas as pd
data = pd.read_csv(“us-counties.csv”)
file_dir = './'
data.to_csv(file_dir + 'us-counties.txt', sep='\t',index=False , header = True)
After running, there is an additional us-counties.txt file in the same directory

3.2 Upload files to hdfs

Upload the us-counties.txt file of Windows to the Ubuntu system by setting up a shared folder
Code:
sudo mount -t vboxsf sharefile /home/hadoop/download
where sharefile is the shared folder path, and /home/hadoop/ the download path is Virtual machine path
Switch to the hadoop directory and start the hdfs
code:
cd /usr/local/hadoop/
./sbin/start-all.sh
Open a new terminal and upload the us-counties.txt file to the hdfs
code:
cd /usr/ local/hadoop/
./bin/hdfs dfs -put /home/hadoop/documents/us-counties.txt /user/hadoop/
./bin/hdfs dfs -ls /user/hadoop/

3.3 start mysql and pyspark

start mysql
mysql -u root -p
start pyspark
cd /usr/local/spark
./bin/pyspark

3.4 pyspark reads data and analyzes

(1) Count the cumulative number of confirmed cases, deaths and fatality rate of each state in the United States, and save the results in the Mysql database

code:

read txt file

rdd = spark.sparkContext.textFile(“/user/hadoop/us-counties.txt”)

Convert data to DataFrame format

df = rdd.map(lambda x: x.split(“\t”))
schemaString = “date country state cases deaths”
fields = [StructField(field_name,StringType(),True) for field_name in schemaString.split(" " )]
schema = StructType(fields)
df = df.map(lambda p:Row(p[0],p[1],p[2],p[3],p[4])) df =
spark.createDataFrame (df,schema) #Connect the header and data

Convert the data type to the corresponding type

df = df.withColumn(“cases”, df[“cases”].cast(“int”))
df = df.withColumn(“deaths”, df[“deaths”].cast(“int”))
df = df.withColumn(“date”, df[“date”].cast(“date”))

Statistics of the cumulative number of confirmed cases, deaths and case fatality rate of each state in the United States

from pyspark.sql.functions import when
from pyspark.sql.functions import format_string
result1 = df.groupBy(“state”).sum(“cases”, “deaths”)

Convert values ​​with a denominator of 0 to 0

result1 = result1.select("state", "sum(cases)", "sum(deaths)").withColumn( "
mortality_rate",
when(result1["sum(cases)"] == 0, 0).otherwise (result1[“sum(deaths)”]/result1[“sum(cases)”]))
#Write data into the database, the database name is spark, and the table name is result1 (the table does not need to be created)
result1.write.format(“ jdbc").options(
url="jdbc:mysql://localhost:3306/spark",
driver="com.mysql.jdbc.Driver",
dbtable="result1",
user="root",
password="1 "
).mode("overwrite").save()
#Save data to Ubuntu local
result1.repartition(1).write.format("csv").save("file:///usr/local/test/ result1.csv")

(2) Statistics of the ten states with the largest number of confirmed cases in the United States

Code:
#Group according to the state field, sum the cases data, and then sort in descending order according to the summed cumulative diagnosis data, and take out the top 10
result2 = df.groupBy("state").sum("cases").orderBy("sum (cases)", ascending=False).limit(10) #Set the number of partitions to 1, the file type to json, and write result2.repartition(1).write.format("json")
by overwriting .
mode("overwrite").save("file:///usr/local/test/quezhentop10.json")

(3) Count the ten states with the most deaths in the United States

Code:
#Group according to the state field, sum the deaths data, and then sort in descending order according to the summed data of the ten states with the most deaths, and take out the top 10
result3 = df.groupBy("state").sum("deaths"). orderBy("sum(deaths)", ascending=False).limit(10)
result3.repartition(1).write.format("json").mode("overwrite").save("file:///usr /local/test/deathstop10.json")

(4) Statistics of the ten states with the least number of confirmed cases in the United States

Code:
#Group according to the state field, sum the cases data, then sort in ascending order according to the summed diagnosis data, and take out the top 10
result4 = df.groupBy(“state”).sum(“cases”).orderBy(“sum( cases)")).limit(10)
result4.repartition(1).write.format("json").mode("overwrite").save("file:///usr/local/test/quezhenbot10.json" )

4. Test and analysis of program running results

4.1 Use python to convert the result of txt file type, as shown in Figure 3

insert image description here

Figure 3-txt file diagram

4.2 Start hdfs, upload files

insert image description here

Figure 4 - Start hdfs
insert image description here

Figure 5 - Uploading files to hdfs diagram

4.3 Start mysql and pyspark

Figure 6, 7
insert image description here

Figure 6 - Start mysql diagram
insert image description here

Figure 7 - Starting pyspark

4.4 Statistics of the cumulative number of confirmed cases, deaths and case fatality rate in each state in the United States

View the results on pyspark, as shown in Figure 8
insert image description here

Figure 8 - Cumulative Diagnosis, Death, and Fatality Chart
View the data on mysql, as shown in Figure 9
insert image description here

Figure 9-mysql Cumulative Diagnosis, Death, and Fatality Chart

4.5 Count the results of the ten states with the largest number of confirmed cases in the United States

Figure 10
insert image description here

Figure 10 - Ten states with the largest number of confirmed cases

4.6 Statistics of the ten states with the highest number of deaths in the United States

Figure 11
insert image description here

Figure 11 - Ten states with the highest number of deaths

4.7 Statistics of the ten states with the least number of confirmed cases in the United States

Figure 12
insert image description here

Figure 12 - The ten states with the fewest confirmed cases

4.8 pyecharts draws a line chart of the cumulative number of confirmed cases and deaths

As shown in Figure 13:
insert image description here

Figure 13 - Line Chart of Total Confirmed Cases and Deaths
In March 2020, as time went by, the cumulative number of confirmed cases also increased, and the number of deaths also remained high after the beginning of April.

4.9 Line chart of the cumulative number of confirmed cases and deaths in each state

As shown in Figure 14 and Figure 15:
insert image description here

Figure 14 - Line Chart 1 of Cumulative Diagnosis and Deaths
insert image description here

Figure 15 - Line Chart 2 of Cumulative Number of Confirmed Cases and Deaths
It can be seen that the number of confirmed cases in each state has been increasing over time.

4.10 Cumulative Number of Confirmed Cases and Deaths by State

Figure 16
insert image description here

Figure 16 - The cumulative number of confirmed cases and the number of deaths in each state
Note: The cumulative number of cases and the number of deaths are highly positively correlated. This means that if a state has a higher cumulative case count, its death toll is likely to be higher as well. The cumulative number of cases and deaths is unevenly distributed. For example, the cumulative number of cases and deaths in New York State is much higher than other states, while the cumulative number of cases and deaths in South Dakota is relatively low.

20 states with the most confirmed cases in 4.11 – word cloud map

Figure 17
insert image description here

Figure 17-word cloud diagram
Explanation: 1. The font size and color in the word cloud diagram reflect the number of confirmed diagnoses in each state. The larger the font size, the higher the number of confirmed cases in the state, and the darker the color, the higher the number of confirmed cases in the state. As can be seen from the word cloud map, the number of confirmed cases in New York State is significantly higher than other states. The number of confirmed diagnoses in other states is relatively small.
2. The font names and layouts in the word cloud are randomly generated, so the word cloud generated each time may be different, but the information reflected is consistent.
3. By making a word cloud map, we can intuitively display and compare the differences in the number of confirmed cases among states, and help us better understand and analyze the development trend of the epidemic.

4.12 The ten states with the highest death toll in the United States

Figure 18
insert image description here

Figure 18 - Ten States with the Most Deaths
Note: The horizontal axis represents the names of the states, and the vertical axis represents the number of deaths in each state.
As can be seen from the histogram, New York State has the largest number of deaths, reaching nearly 1 million people, while the remaining states have relatively few deaths. The histogram can give us a better understanding of the impact of the epidemic on different states, and help policy makers and the public make corresponding decisions and actions. On this basis, targeted measures can be taken according to the epidemic situation in different states to effectively curb the spread of the virus and protect the health and safety of the public.

4.13 Mortality Rate by State in the United States

Figure 19
insert image description here

Figure 19-Death Rate Chart of Each State
Each sector in the pie chart represents a different state, and the size of the area corresponds to the death rate of the state.
As can be seen from the pie chart, New York State has the highest death rate at 7%, and other states have relatively low death rates.

4.14 Cumulative death toll and death rate pie chart in the United States

Figure 20
insert image description here

Figure 20 - Cumulative Deaths and Mortality Pie Chart The
above graph shows the ratio of total deaths to non-deaths. The two slices in the pie chart represent the number of deaths and the number of non-deaths, and the size of the area corresponds to the number of people in this category. It can be seen from the pie chart that as of now, the cumulative number of deaths from new coronary pneumonia in the United States accounts for about 4.7% of the total number of cases, while the number of people who have not died accounts for about 95.3%.

4.15 Top 10 states with cumulative cases in the United States – funnel chart

Figure 21
insert image description here

Figure 21 - Funnel Chart
The above chart shows a comparison of the top 10 states by case count. Each link in the funnel diagram represents a state, and its size corresponds to the cumulative number of cases in that state. As can be seen from the funnel chart, the top three states are New York, New Jersey, and Massachusetts, with cumulative cases of 13.23 million, 4.88 million, and 2.47 million, respectively.

4.16 Ten states with the least cumulative cases

Figure 22
insert image description here

Figure 22 - Ten states with the fewest cumulative cases
As can be seen from the histogram, the top three states with the fewest cumulative cases are Northern Mariana Islands, Virgin Islands and Alaska, with cumulative cases of 689 and 3028 respectively and 16291.

5. Conclusion and experience

Through the processing and analysis of the data of the new crown pneumonia epidemic in the United States, the following conclusions and insights have been drawn:
1. The new crown pneumonia epidemic in the United States will spread rapidly in early 2021, especially in the east coast. But in the summer, the outbreak began to ease in some states.
2. The degree of impact of the epidemic on different states varies greatly, with some states having significantly higher death rates and infection rates than others. This is related to factors such as the population density, economic level, and medical resources of each state.
3. Protective measures such as social distance and masks can effectively slow down the spread of the virus and reduce the death rate and infection rate.
4. Timely monitoring and early warning of epidemic data can help the government and the public make more effective prevention and control decisions. At the same time, data processing and analysis technologies are playing an increasingly important role in epidemic prevention and control.
In short, the data analysis of the new crown pneumonia epidemic in the United States can not only provide us with a more comprehensive understanding of the spread of the epidemic in the country, but also provide data support and guidance for epidemic prevention and control, and can provide practical cases and technical applications in the field of data analysis .
I have learned the following points:
1. Data cleaning and preprocessing are very important. Before data analysis, the original data needs to be cleaned and preprocessed, including removing missing values, duplicate values, outliers, etc., to ensure accurate and reliable results.
2. Data visualization can show data more clearly. By visualizing the data and presenting it in a chart, the data can be displayed more intuitively, the relationship and trend between the data can be found, and it can help further analysis and reasoning.
3. The choice of analysis methods and techniques is also very important. When conducting data analysis, it is necessary to select appropriate analysis methods and techniques according to the characteristics of the data and the needs of the problem in order to obtain more accurate and meaningful conclusions.
During the experiment, we encountered some difficulties. One of the main problems is that the scale of the data set is relatively large, which leads to a long time for data processing and calculation, which affects the accuracy of the results and the efficiency of the experiment.
To solve this problem, we use Spark technology for data processing and analysis, and use its distributed computing capabilities to accelerate computing and improve efficiency. Finally, we successfully completed the experiment and obtained meaningful results.

references

[1] Yang Weimin, Yao Yuhua, Liu Xianpeng. "COVID-19 Data Analysis and Visualization Based on Spark." Computer Application Research. 2020, Volume 37, Issue 12. [2] Lin Ziyu, Zheng Haishan, Lai Yongxuan. "
Spark Programming Basics (Python Edition) "[M]. Beijing: People's Posts and Telecommunications Press, 2020.
[3] Lin Ziyu. Principles and Applications of Big Data Technology [M]. Beijing: People's Posts and Telecommunications Press, 2017.
[4] https: //blog.csdn.net/weixin_43385372/article/details/117608253

If you have any questions about learning, you can add me on WeChat to communicate bmt1014

insert image description here

Guess you like

Origin blog.csdn.net/weixin_48676558/article/details/130965274