Python application examples (2) data visualization (4)


Download data from the web and visualize it. There is an incredible amount of data online, mostly unchecked. If you can analyze these data, you can discover the rules and associations that others have not discovered.

The data accessed and visualized is stored in two common formats: CSV and JSON. We will use the Python module csv to process weather data stored in CSV format to find the maximum and minimum temperature for two regions over a period of time. Then, use Matplotlib to create a graph from the downloaded data showing temperature changes in two different locations: Sitka, Alaska, and Death Valley, California. Then, use the module json to access the earthquake data stored in JSON format, and use Plotly to draw a scatterplot showing the location and magnitude of these earthquakes.

1. CSV file format

To store data in a text file, an easy way is to write the data to the file as a series of comma-separated values. Such files are called CSV files. For example, here is a row of weather data in CSV format:

"USW00025333","SITKA AIRPORT, AK US","2018-01-01","0.45",,"48","38"

Here's the weather data for January 1, 2018 in Sitka, Alaska, which includes the maximum and minimum temperatures for the day, along with a host of other data. CSV files are cumbersome for humans to read, but programs can easily extract and manipulate values ​​from them, helping to speed up the data analysis process.

We will first process a small amount of Beijing weather data in CSV format, copy the file beijing_weather_07-2018_simple.csv to the folder where the program of this chapter is stored.

1.1 Analysis CSV file header'

The csv module is included in the Python standard library and can be used to analyze rows of data in CSV files, allowing us to quickly extract values ​​of interest. Let's first look at the first line of this file, where a series of file headers indicate what information is contained in subsequent lines: sitka_highs.py

  import csv

  filename = 'data/sitka_weather_07-2018_simple.csv'with open(filename) as f:
❷     reader = csv.reader(f)
❸     header_row = next(reader)
      print(header_row)

After importing the module csv, assign the name of the file to be used to filename. Next, open the file and assign the returned file object to f (see ❶). Then, create a reader object associated with the file by calling csv.reader() and passing it the previously stored file object as an argument (see ❷). The reader object is assigned to reader.

The module csv contains the function next() which, when called and passed in a reader object, returns the next line in the file. In the above code, next() is called only once, so what is obtained is the first line of the file, which contains the file header (see ❸). Store the returned data into header_row. As you can see, header_row contains weather-related headers, indicating what data each row contains:

['STATION', 'NAME', 'DATE', 'PRCP', 'TAVG', 'TMAX', 'TMIN']

The reader processes the comma-separated first line of data in the file and stores each item as an element in a list. The file header STATION indicates the code of the weather station where the data is recorded. The location of this file header indicates that the first value on each line is the weather station code. The file header NAME indicates that the second value on each line is the name of the weather station where the data was recorded. Other file headers indicate what information is recorded. Currently, we are most concerned with the date (DATE), maximum temperature (TMAX) and minimum temperature (TMIN). This is a simple dataset containing only precipitation and temperature-related data. When you download weather data yourself, you can choose to include numerous measurements such as wind speed and direction as well as detailed precipitation data.

1.2 Print file header and its location

To make the file header data easier to understand, print out each file header and its position in the list: sitka_highs.py

  --snip--
  with open(filename) as f:
      reader = csv.reader(f)
      header_row = next(reader)for index, column_header in enumerate(header_row):
          print(index, column_header)

In the loop, enumerate() (see ❶) is called on the list to get the index of each element and its value. (Note that we removed the line print(header_row) in favor of showing this more verbose version.) The output is as follows, indicating the index of each file header:

0 STATION
1 NAME
2 DATE
3 PRCP
4 TAVG
5 TMAX
6 TMIN

1.3 Extract and read data

Now that we know which columns we need data in, let's read some data. First, read the daily maximum temperature: sitka_highs.py

  --snip--
  with open(filename) as f:
      reader = csv.reader(f)
      header_row = next(reader)

      # 从文件中获取最高温度。
❶     highs = []for row in reader:
❸         high = int(row[5])
          highs.append(high)

  print(highs)

Create an empty list called highs (see ❶), and iterate through the remaining lines in the file (see ❷). The reader object continues reading the CSV file from where it left off, each time automatically returning to the next line from where it is currently. Since the header line of the file has already been read, the loop will start at the second line - which contains the actual data. Each time the loop is executed, the data at index 5 (column TMAX) is appended to the end of highs (see ❸). In the file, this data is stored in string format, so before appending to the end of highs, use the function int() to convert it to a numeric format for use.

The data currently stored in highs is as follows:

[62, 58, 70, 70, 67, 59, 58, 62, 66, 59, 56, 63, 65, 58, 56, 59, 64, 60, 60,
 61, 65, 65, 63, 59, 64, 65, 68, 66, 64, 67, 65]

After extracting the maximum temperature for each day and storing it in a list, it's time to visualize the data.

1.4 Draw a temperature graph

To visualize these temperature data, first use Matplotlib to create a simple graph showing daily maximum temperatures, as follows: sitka_highs.py

  import csv

  import matplotlib.pyplot as plt

  filename = 'data/sitka_weather_07-2018_simple.csv'
  with open(filename) as f:
      --_snip_—

  # 根据最高温度绘制图形。
  plt.style.use('seaborn')
  fig, ax = plt.subplots()
❶ ax.plot(highs, c='red')

  # 设置图形的格式。
❷ ax.set_title("2018年7月每日最高温度", fontsize=24)
❸ ax.set_xlabel('', fontsize=16)
  ax.set_ylabel("温度 (F)", fontsize=16)
  ax.tick_params(axis='both', which='major', labelsize=16)

  plt.show()

Pass the list of maximum temperatures to plot() (see ❶), and pass c='red' to plot the data points in red. (Here, red is used to show the highest temperature, and blue is used to show the lowest temperature.) Next, some other formatting is set, such as the name and font size (see ❷), which are introduced in Chapter 15. Since no dates have been added yet, no labels have been added to the [illustration] axis, but ax.set_xlabel() does modify the font size to make the default labels easier to see ❸. Figure shows the graph plotted: a simple line graph showing the daily maximum temperature for July 2018 in Sitka, Alaska.

insert image description here

1.5 Adding dates to charts

Now, the temperature graph can be improved by extracting the date and high temperature and passing it to plot(), as follows: sitka_highs.py

  import csv
  from datetime import datetime

  import matplotlib.pyplot as plt

  filename = 'data/sitka_weather_07-2018_simple.csv'
  with open(filename) as f:
      reader = csv.reader(f)
      header_row = next(reader)

      # 从文件中获取日期和最高温度。
❶     dates, highs = [], []
      for row in reader:
❷         current_date = datetime.strptime(row[2], '%Y-%m-%d')
          high = int(row[5])
          dates.append(current_date)
          highs.append(high)

  # 根据最高温度绘制图形。
  plt.style.use('seaborn')
  fig, ax = plt.subplots()
❸ ax.plot(dates, highs, c='red')

  # 设置图形的格式。
  ax.set_title("2018年7月每日最高温度", fontsize=24)
  ax.set_xlabel('', fontsize=16)
❹ fig.autofmt_xdate()
  ax.set_ylabel("温度 (F)", fontsize=16)
  ax.tick_params(axis='both', which='major', labelsize=16)

  plt.show()

We create two empty lists to store the date and maximum temperature extracted from the file (see ❶). Then, convert the data (row[2]) containing date information into a datetime object (see ❷), and append it to the end of the list dates. At ❸, pass the date and maximum temperature value to plot(). At ❹ , call fig.autofmt_xdate() to draw the date labels skewed so they don't overlap each other. Figure shows the improved graph.

insert image description here

Guess you like

Origin blog.csdn.net/qq_41600018/article/details/131749290