Detailed explanation of Python data file and network data serialization storage

1. Introduction to ETL

Most of the available data is stored in text files. These data can be unstructured text (such as a tweet or a literary work), or more structured data, each line of which is a record, and multiple fields are separated by special characters, such as commas, tabs or the pipe symbol "|".

Text files may be very large, and a data set may be distributed in dozens or even hundreds of files, and the data in them may be incomplete or full of dirty data (dirty data). With all these variables, it's almost inevitable that there will be a need to read and use text file data.

As long as data files exist, the data needs to be fetched from the files, parsed and converted into a useful format, and then perform some operations. In fact, there is a standard term for this process, which is "extract-transform-load" (extract-transform-load, ETL).

Extraction refers to the process of reading and parsing data sources on demand. Transformation is cleaning and normalizing data, as well as combining, decomposing, or reorganizing its internal records. Loading means storing the transformed data into a new location, which can be another file or a database.

2. Text file reading

The first part of ETL is "extraction", which involves opening the file and reading the contents. This process seems simple, but even such a simple process can encounter difficulties, such as file size issues. If the file is too large to fit into memory for manipulation, the code needs to be carefully structured to process only a small portion of the file at a time, possibly one row at a time.

1. Text encoding: ASCII, Unicode, etc.

Another possible pitfall is character encoding, in fact most of the data exchanged in the real world is in text files. However, the exact meaning of the text may vary from application to application, person to person, and of course, country to country.

Sometimes text represents characters encoded in ASCII, which contains 128 characters, only 95 of which are printable. The good news about the ASCII encoding is that it is the lowest common denominator for most data interchange operations. The bad news is that there are so many alphabets and writing systems in the world that the ASCII encoding doesn't set out to deal with their complexity. If you read a file in ASCII, you're almost guaranteed to get into trouble, throwing errors for unintelligible character values. It could be German ü, or Portuguese ç, or pretty much anything else but English.

The reason for the error is because ASCII is based on 7 bits. The bytes in a typical file are all 8 bits, allowing 256 possible values ​​instead of the 128 possible values ​​of 7 bits. Usually these extra values ​​are used to store additional characters, including extra punctuation (such as dashes and dashes on printers), extra symbols (such as trademarks, copyrights, and angle symbols), letters with accents wait.

When reading a text file, if you encounter a character that belongs to 128 values ​​outside the ASCII range, then its encoding cannot be determined, and this problem will always exist. For example, if the character value is 214, is it a division sign, Ö, or something else? Without the code that created the file, there's no way of knowing.

There is a solution to reduce the above confusion, and that is Unicode. The Unicode encoding called UTF-8 not only accepts the basic ASCII characters without modification, but also allows the acceptance of a nearly unlimited set of other characters and symbols, as long as it conforms to the Unicode standard. Because of its flexibility, UTF-8 has been used in more than 85% of web pages.

This means that it is best to assume UTF-8 encoding when reading text files. If the file contains only ASCII characters, it can still be read correctly, and if there are other characters encoded in UTF-8, it can also be read normally. Python 3's string type is designed to handle Unicode by default, which is really good news.

Even with Unicode encoding, sometimes text contains character values ​​that cannot be encoded successfully. Fortunately, Python's open function can accept an optional parameter errors, which determines how the function handles encoding errors when reading and writing files. The default option is 'strict', which will raise an error if an encoding error is encountered.

There are several other useful options, for example, 'ignore' will cause the character that caused the error to be skipped, 'replace' will cause the wrong character to be replaced by a mark character (usually ?), 'backslashreplace' will use a backslash 'surrogateescape' converts unusual characters to special Unicode code points on read and back to their original sequence of bytes on write. How strict the processing or coding scheme needs to be adopted depends on the specific usage scenario.

Below is a short file example that contains an invalid UTF-8 character, let's see how the different options handle it.

First, write to the file in byte and binary mode:

>>> open('test.txt', 'wb').write(bytes([65, 66, 67, 255, 192,193]))

The above code will generate a file containing "ABC" and 3 non-ASCII characters. Depending on the encoding used, these non-ASCII characters may be represented as different characters.

If you view the file with vim, you will see the following:

ABCÿÀÁ
~

Now that the text file is available, try reading it with the default error handling option 'strict':

>>> x = open('test.txt').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 3: 
      invalid start byte

The value of the fourth byte is 255, which is not a valid UTF-8 character at this position, so the error handling parameter of 'strict' will raise an exception. Let's see what the other error handling options would do with this file, remembering that the last 3 characters will throw an error:

>>> open('test.txt', errors='ignore').read()
'ABC'
>>> open('test.txt', errors='replace').read()
'ABC���'
>>> open('test.txt', errors='surrogateescape').read()
'ABC\udcff\udcc0\udcc1'
>>> open('test.txt', errors='backslashreplace').read()
'ABC\\xff\\xc0\\xc1'
>>>

If you don't want the problematic characters to appear, you can use the 'ignore' option. The 'replace' option will only mark the position of invalid characters, other options will try to leave invalid characters uninterpreted in various ways.

2. Unstructured text

Unstructured text files are the easiest type of data to read, but also the hardest to extract information from. Different approaches to unstructured text processing can vary widely, depending on the nature of the text and what it will be used for. A short example can be used here to demonstrate some fundamental issues and lay the groundwork for a discussion of structured text data files.

One of the simplest problems is to determine the format of the basic logical unit in the file. If you're dealing with a collection of news stories, you need to be able to break them down into cohesive units.

Many times, each story or news piece as a whole may not be considered a data item. But if it is really necessary, you have to determine the required data units, and then propose a corresponding file splitting strategy. Maybe you need to process the text in paragraphs, so you have to determine how the paragraphs are separated in the file and create the code accordingly. If the paragraphs correspond to the lines of the text file, the job is easy. However, lines in general text files are shorter than paragraphs, so some additional work is required.

Let's look at some examples:

Call me Ishmael.  Some years ago--never mind how long precisely--
having little or no money in my purse, and nothing particular
to interest me on shore, I thought I would sail about a little
and see the watery part of the world.  It is a way I have
of driving off the spleen and regulating the circulation.
Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I
find myself involuntarily pausing before coffin warehouses,
and bringing up the rear of every funeral I meet;
and especially whenever my hypos get such an upper hand of me,
that it requires a strong moral principle to prevent me from
deliberately stepping into the street, and methodically knocking
people's hats off--then, I account it high time to get to sea
as soon as I can.  This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword;
I quietly take to the ship.  There is nothing surprising in this.
If they but knew it, almost all men in their degree, some time
or other, cherish very nearly the same feelings towards
the ocean with me.

There now is your insular city of the Manhattoes, belted round by wharves
as Indian isles by coral reefs--commerce surrounds it with her surf.
Right and left, the streets take you waterward.  Its extreme downtown
is the battery, where that noble mole is washed by waves, and cooled
by breezes, which a few hours previous were out of sight of land.
Look at the crowds of water-gazers there.

Lines of text are more or less broken across the layout, and each paragraph is marked with a blank line. If you want to treat each paragraph as a processing unit, you need to split the text based on blank lines.

Fortunately, the task is easily accomplished with the string split() method. Each newline character in the string can be represented by "\n". The last line of each paragraph ends with a newline, and if the next line is a blank line, it is obviously followed by a second newline to indicate a blank line:

>>> moby_text = open("moby_01.txt").read()     ⇽---  将全部文件内容读入一个字符串
>>> moby_paragraphs = moby_text.split("\n\n")     ⇽---  按照两个连续换行符进行拆分
>>> print(moby_paragraphs[1])
There now is your insular city of the Manhattoes, belted round by wharves
as Indian isles by coral reefs--commerce surrounds it with her surf.
Right and left, the streets take you waterward.  Its extreme downtown
is the battery, where that noble mole is washed by waves, and cooled
by breezes, which a few hours previous were out of sight of land.
Look at the crowds of water-gazers there.

Splitting text into paragraphs is a very simple first step when dealing with unstructured text. Additional normalization operations may be required on the text prior to processing.

Suppose you want to count the number of occurrences of each word in a text file. Just split the file based on spaces to get a list of words in the file. However, it will be more difficult to accurately count the number of occurrences of words, because "This", "this", "this." and "this," are split into different words. The solution for the code to work is to normalize the text, removing all punctuation and making all words consistent before processing.

For the example text above, the code to generate a normalized word list might look like this:

>>> moby_text = open("moby_01.txt").read()     ⇽---  将全部文件内容读入一个字符串
>>> moby_paragraphs = moby_text.split("\n\n")
>>> moby = moby_paragraphs[1].lower()     ⇽---  全部变成小写
>>> moby = moby.replace(".", "")     ⇽---  删除句点
>>> moby = moby.replace(",", "")     ⇽---  删除逗号
>>> moby_words = moby.split()
>>> print(moby_words)    
['there', 'now', 'is', 'your', 'insular', 'city', 'of', 'the', 'manhattoes,',
     'belted', 'round', 'by', 'wharves', 'as', 'indian', 'isles', 'by', 
     'coral', 'reefs--commerce', 'surrounds', 'it', 'with', 'her', 'surf', 
     'right', 'and', 'left,', 'the', 'streets', 'take', 'you', 'waterward', 
     'its', 'extreme', 'downtown', 'is', 'the', 'battery,', 'where', 'that', 
     'noble', 'mole', 'is', 'washed', 'by', 'waves,', 'and', 'cooled', 'by', 
     'breezes,', 'which', 'a', 'few', 'hours', 'previous', 'were', 'out', 
     'of', 'sight', 'of', 'land', 'look', 'at', 'the', 'crowds', 'of', 
     'water-gazers', 'there']

3. Plain text files with delimiters

While unstructured text files are easy to read, the downside is that they lack structure. It is often much more useful if the file is organized so that individual data values ​​can be extracted. The simplest solution is to split the file into lines, with each line containing one piece of data. It might be a list of file names to process, it might be a list of people's names to print (eg for printing on name tags), or it might be a series of temperature readings generated by a remote monitoring device. Data parsing in these cases is as simple as reading each row and converting to the correct type if necessary. The file is then ready for backup.

In most cases, however, things are not that simple. Usually it is necessary to group the related information, and then use the code to read the related information together. A common practice is to put several pieces of related information on the same line, separated by special characters. In this way, when the file is read line by line, the file can be split into fields by using special characters, and the field values ​​can be put into variables for subsequent processing.

The following file is a simple example of a delimited format containing temperature data:

State|Month Day, Year Code|Avg Daily Max Air Temperature (F)|Record Count for 
     Daily Max Air Temp (F)
Illinois|1979/01/01|17.48|994
Illinois|1979/01/02|4.64|994
Illinois|1979/01/03|11.05|994
Illinois|1979/01/04|9.51|994
Illinois|1979/05/15|68.42|994
Illinois|1979/05/16|70.29|994
Illinois|1979/05/17|75.34|994
Illinois|1979/05/18|79.13|994
Illinois|1979/05/19|74.94|994

The data here is separated by the pipe character, that is, each field in the row is separated by the pipe character "|".

4 fields are given here: Observation Status, Observation Date, Average Maximum Temperature, Number of Observation Stations Sending Reports. Other commonly used separators are tabs and commas. A comma is probably the most common delimiter, but the delimiter can be any character that does not appear in the data. The comma separator is so common that this format is often referred to as the CSV (comma-separated values) format. CSV-type files often have a .csv extension to indicate their format.

No matter what character is used as the delimiter, as long as you know it in advance, you can write Python code to split each line into multiple fields and return it as a list.

For the above example, the string split() method can be used to split each row into a list of data values:

>>> line = "Illinois|1979/01/01|17.48|994"
>>> print(line.split("|"))
['Illinois', '1979/01/01', '17.48', '994']

Note that the above technique is very simple, but all data values ​​will be saved as strings, which may cause inconvenience for subsequent processing.

4. csv module

If you need to do more processing with delimited data files, you should be familiar with the csv module and its options.

When asked what is my favorite module in the Python standard library, I mentioned the csv module more than once. Not because the csv module is glamorous (it isn't), but because it has saved me a lot of effort, and I've avoided more bugs in my career relying on the csv module than any other module.

The csv module is a perfect example of Python's "feature ready" philosophy. While it's entirely possible to stumble and write the code yourself to read delimited files, it's not even particularly difficult in many cases, but it's much easier and more reliable to use Python modules. The csv module has been tested and optimized, and has many features. If you have to write it yourself, these features may not be too troublesome to write, but if you can use it, it is really convenient and time-saving.

Please observe the above data first, and then decide how to read the data with the csv module. The code that parses the data has to do two things: read each line and strip the trailing newline; then split each line based on the pipe character and add the data list to the line list. A solution might look like this:

>>> results = []
>>> for line in open("temp_data_pipes_00a.txt"):
...     fields = line.strip().split("|")
...     results.append(fields)
... 
>>> results
[['State', 'Month Day, Year Code', 'Avg Daily Max Air Temperature (F)', 
     'Record Count for Daily Max Air Temp (F)'], ['Illinois', '1979/01/01', 
     '17.48', '994'], ['Illinois', '1979/01/02', '4.64', '994'], ['Illinois', 
     '1979/01/03', '11.05', '994'], ['Illinois', '1979/01/04', '9.51', 
     '994'], ['Illinois', '1979/05/15', '68.42', '994'], ['Illinois', '1979/
     05/16', '70.29', '994'], ['Illinois', '1979/05/17', '75.34', '994'], 
     ['Illinois', '1979/05/18', '79.13', '994'], ['Illinois', '1979/05/19', 
     '74.94', '994']]

If the csv module were used to accomplish the same job, the code might look like this:

>>> import csv
>>> results = [fields for fields in 
     csv.reader(open("temp_data_pipes_00a.txt", newline=''), delimiter="|")]
>>> results
[['State', 'Month Day, Year Code', 'Avg Daily Max Air Temperature (F)', 
     'Record Count for Daily Max Air Temp (F)'], ['Illinois', '1979/01/01', 
     '17.48', '994'], ['Illinois', '1979/01/02', '4.64', '994'], ['Illinois', 
     '1979/01/03', '11.05', '994'], ['Illinois', '1979/01/04', '9.51', 
     '994'], ['Illinois', '1979/05/15', '68.42', '994'], ['Illinois', '1979/
     05/16', '70.29', '994'], ['Illinois', '1979/05/17', '75.34', '994'], 
     ['Illinois', '1979/05/18', '79.13', '994'], ['Illinois', '1979/05/19', 
     '74.94', '994']]

As far as this simple example is concerned, compared with the code written by myself, the benefits obtained by using the csv module do not seem to be much. Still, the code is two lines shorter and a bit cleaner, and I don't have to bother with stripping newlines anymore. The real advantage comes when it comes time to tackle more challenging cases.

The data in the above example is real, but it has been simplified and cleaned. The real data obtained from the data source will be more complex, the real data contains more fields, some fields are enclosed in quotes and others are not, and the first field is empty.

The original file is tab-delimited, but for illustration purposes, it will be comma-delimited here:

"Notes","State","State Code","Month Day, Year","Month Day, Year Code",Avg 
     Daily Max Air Temperature (F),Record Count for Daily Max Air Temp 
     (F),Min Temp for Daily Max Air Temp (F),Max Temp for Daily Max Air Temp 
     (F),Avg Daily Max Heat Index (F),Record Count for Daily Max Heat Index 
     (F),Min for Daily Max Heat Index (F),Max for Daily Max Heat Index 
     (F),Daily Max Heat Index (F) % Coverage

,"Illinois","17","Jan 01, 1979","1979/01/
     01",17.48,994,6.00,30.50,Missing,0,Missing,Missing,0.00%
,"Illinois","17","Jan 02, 1979","1979/01/02",4.64,994,-
     6.40,15.80,Missing,0,Missing,Missing,0.00%
,"Illinois","17","Jan 03, 1979","1979/01/03",11.05,994,-
     0.70,24.70,Missing,0,Missing,Missing,0.00%
,"Illinois","17","Jan 04, 1979","1979/01/
     04",9.51,994,0.20,27.60,Missing,0,Missing,Missing,0.00%
,"Illinois","17","May 15, 1979","1979/05/
     15",68.42,994,61.00,75.10,Missing,0,Missing,Missing,0.00%
,"Illinois","17","May 16, 1979","1979/05/
     16",70.29,994,63.40,73.50,Missing,0,Missing,Missing,0.00%
,"Illinois","17","May 17, 1979","1979/05/
     17",75.34,994,64.00,80.50,82.60,2,82.40,82.80,0.20%
,"Illinois","17","May 18, 1979","1979/05/
     18",79.13,994,75.50,82.10,81.42,349,80.20,83.40,35.11%
,"Illinois","17","May 19, 1979","1979/05/
     19",74.94,994,66.90,83.10,82.87,78,81.60,85.20,7.85%

Note that some fields also contain commas. The rule here is to quote the field to indicate that commas within it should not be interpreted as separators. As shown above, it is more common to quote only certain fields, especially those that may contain delimiters. Of course, some fields that are unlikely to contain delimiters will also be quoted here.

In this case, the code developed by yourself may become more complicated and bloated. Now it is no longer possible to split each line based on delimiter, you need to ensure that only delimiters that are not inside quoted strings are recognized. Also, you have to remove quotes around the string, which may appear at any time or not at all. With the csv module, there is no need to modify the code at all. In fact, since the comma is the default delimiter, you don't even need to specify it:

>>> results2 = [fields for fields in csv.reader(open("temp_data_01.csv", 
     newline=''))]
>>> results2
 [['Notes', 'State', 'State Code', 'Month Day, Year', 'Month Day, Year Code', 
      'Avg Daily Max Air Temperature (F)', 'Record Count for Daily Max Air 
      Temp (F)', 'Min Temp for Daily Max Air Temp (F)', 'Max Temp for Daily 
      Max Air Temp (F)', 'Avg Daily Min Air Temperature (F)', 'Record Count 
      for Daily Min Air Temp (F)', 'Min Temp for Daily Min Air Temp (F)', 'Max 
      Temp for Daily Min Air Temp (F)', 'Avg Daily Max Heat Index (F)', 
      'Record Count for Daily Max Heat Index (F)', 'Min for Daily Max Heat 
      Index (F)', 'Max for Daily Max Heat Index (F)', 'Daily Max Heat Index 
       (F) % Coverage'], ['', 'Illinois', '17', 'Jan 01, 1979', '1979/01/01', 
      '17.48', '994', '6.00', '30.50', '2.89', '994', '-13.60', '15.80', 
      'Missing', '0', 'Missing', 'Missing', '0.00%'], ['', 'Illinois', '17', 
      'Jan 02, 1979', '1979/01/02', '4.64', '994', '-6.40', '15.80', '-9.03', 
      '994', '-23.60', '6.60', 'Missing', '0', 'Missing', 'Missing', '0.00%'], 
       ['', 'Illinois', '17', 'Jan 03, 1979', '1979/01/03', '11.05', '994', '-
      0.70', '24.70', '-2.17', '994', '-18.30', '12.90', 'Missing', '0', 
      'Missing', 'Missing', '0.00%'], ['', 'Illinois', '17', 'Jan 04, 1979', 
      '1979/01/04', '9.51', '994', '0.20', '27.60', '-0.43', '994', '-16.30', 
      '16.30', 'Missing', '0', 'Missing', 'Missing', '0.00%'], ['', 
      'Illinois', '17', 'May 15, 1979', '1979/05/15', '68.42', '994', '61.00', 
      '75.10', '51.30', '994', '43.30', '57.00', 'Missing', '0', 'Missing', 
      'Missing', '0.00%'], ['', 'Illinois', '17', 'May 16, 1979', '1979/05/
      16', '70.29', '994', '63.40', '73.50', '48.09', '994', '41.10', '53.00', 
      'Missing', '0', 'Missing', 'Missing', '0.00%'], ['', 'Illinois', '17', 
      'May 17, 1979', '1979/05/17', '75.34', '994', '64.00', '80.50', '50.84', 
      '994', '44.30', '55.70', '82.60', '2', '82.40', '82.80', '0.20%'], ['', 
      'Illinois', '17', 'May 18, 1979', '1979/05/18', '79.13', '994', '75.50', 
      '82.10', '55.68', '994', '50.00', '61.10', '81.42', '349', '80.20', 
      '83.40', '35.11%'], ['', 'Illinois', '17', 'May 19, 1979', '1979/05/19', 
      '74.94', '994', '66.90', '83.10', '58.59', '994', '50.90', '63.20', 
      '82.87', '78', '81.60', '85.20', '7.85%']]

Note that the redundant quotation marks have been removed, and the field values ​​with commas are completely preserved without any redundant characters.

5. Read the csv file and save it as a list of dictionaries

In the above example, each row of data is returned as a list of fields. This result is adequate in many cases, but sometimes it is more convenient to return each row as a dictionary, where the field names can be used as the keys of the dictionary.

For this scenario, the csv library provides a DictReader object, which can take a list of fields as an argument, or read the field names from the first line of the data file. If you want to open a data file with DictReader, the code will look like this:

>>> results = [fields for fields in csv.DictReader(open("temp_data_01.csv", 
     newline=''))]
>>> results[0]
OrderedDict([('Notes', ''), ('State', 'Illinois'), ('State Code', '17'), 
     ('Month Day, Year', 'Jan 01, 1979'), ('Month Day, Year Code', '1979/01/
     01'), ('Avg Daily Max Air Temperature (F)', '17.48'), ('Record Count for 
     Daily Max Air Temp (F)', '994'), ('Min Temp for Daily Max Air Temp (F)', 
     '6.00'), ('Max Temp for Daily Max Air Temp (F)', '30.50'), ('Avg Daily 
     Min Air Temperature (F)', '2.89'), ('Record Count for Daily Min Air Temp 
     (F)', '994'), ('Min Temp for Daily Min Air Temp (F)', '-13.60'), ('Max 
     Temp for Daily Min Air Temp (F)', '15.80'), ('Avg Daily Max Heat Index 
     (F)', 'Missing'), ('Record Count for Daily Max Heat Index (F)', '0'), 
     ('Min for Daily Max Heat Index (F)', 'Missing'), ('Max for Daily Max Heat Index (F)', 
     'Missing'), ('Daily Max Heat Index (F) % Coverage',  '0.00%')])

Note that csv.DictReader returns an OrderedDicts object, so each field will still maintain the original order. Although the fields look slightly different, they behave like dictionaries.

>>> results[0]['State']
'Illinois'

If the data is particularly complex and requires operations on specific fields, DictReader makes it easier to ensure that the correct fields are read, and to some extent makes the code easier to understand. On the contrary, if the data set is very large, it is important to keep in mind that the same amount of data DictReader may take twice as long to read.

3. Excel file processing

Since Excel can read and write CSV files, the quickest and easiest way to extract data from an Excel spreadsheet file is often to open it in Excel and save it as a CSV file.

However, extracting with Excel doesn't always make sense, especially if there are a lot of files to process. Even though it is theoretically possible to automate the process of opening and saving CSV files, it may be faster to process Excel files directly in this case.

An in-depth discussion of spreadsheet files with features such as multiple sheets in the same file, macros, multiple cell formats, etc. The fact is that Python's standard library does not contain modules for reading and writing Excel files, and an external module needs to be installed to read the Excel format. Fortunately, there are several modules that can do the job.

This example uses a module called OpenPyXL, available from the Python package repository. Execute the following command on the command line to install:

$pip install openpyxl

The following shows the data in the spreadsheet:

Reading the file is fairly simple, but still a bit more tedious than a CSV file. First, load the workbook, then find the specified table, and then traverse each row to start extracting the data in each cell. Here is sample code to read a spreadsheet: 

>>> from openpyxl import load_workbook
>>> wb = load_workbook('temp_data_01.xlsx')
>>> results = []
>>> ws = wb.worksheets[0]
>>> for row in ws.iter_rows():
...     results.append([cell.value for cell in row])
...
>>> print(results)
[['Notes', 'State', 'State Code', 'Month Day, Year', 'Month Day, Year Code', 
     'Avg Daily Max Air Temperature (F)', 'Record Count for Daily Max Air 
     Temp (F)', 'Min Temp for Daily Max Air Temp (F)', 'Max Temp for Daily
     Max Air Temp (F)', 'Avg Daily Max Heat Index (F)', 'Record Count for 
     Daily Max Heat Index (F)', 'Min for Daily Max Heat Index (F)', 'Max for 
     Daily Max Heat Index (F)', 'Daily Max Heat Index (F) % Coverage'], 
     [None, 'Illinois', 17, 'Jan 01, 1979', '1979/01/01', 17.48, 994, 6, 
     30.5, 'Missing', 0, 'Missing', 'Missing', '0.00%'], [None, 'Illinois', 
     17, 'Jan 02, 1979', '1979/01/02', 4.64, 994, -6.4, 15.8, 'Missing', 0, 
     'Missing', 'Missing', '0.00%'], [None, 'Illinois', 17, 'Jan 03, 1979', 
     '1979/01/03', 11.05, 994, -0.7, 24.7, 'Missing', 0, 'Missing', 
     'Missing', '0.00%'], [None, 'Illinois', 17, 'Jan 04, 1979', '1979/01/
     04', 9.51, 994, 0.2, 27.6, 'Missing', 0, 'Missing', 'Missing', '0.00%'], 
     [None, 'Illinois', 17, 'May 15, 1979', '1979/05/15', 68.42, 994, 61, 
     75.1, 'Missing', 0, 'Missing', 'Missing', '0.00%'], [None, 'Illinois', 
     17, 'May 16, 1979', '1979/05/16', 70.29, 994, 63.4, 73.5, 'Missing', 0, 
     'Missing', 'Missing', '0.00%'], [None, 'Illinois', 17, 'May 17, 1979', 
     '1979/05/17', 75.34, 994, 64, 80.5, 82.6, 2, 82.4, 82.8, '0.20%'], 
     [None, 'Illinois', 17, 'May 18, 1979', '1979/05/18', 79.13, 994, 75.5, 
     82.1, 81.42, 349, 80.2, 83.4, '35.11%'], [None, 'Illinois', 17, 'May 19, 
     1979', '1979/05/19', 74.94, 994, 66.9, 83.1, 82.87, 78, 81.6, 85.2, 
     '7.85%']]

The above code achieves the same result as the much simpler csv file processing code. Spreadsheets are themselves much more complex objects, so it's not surprising that the code to read them is even more complex. You should also have a clear understanding of how data is stored in spreadsheets. If the spreadsheet contains formatting that has some significant meaning, if labels need to be ignored or handled separately, if formulas and references need to be dealt with, then you have to delve into how those parts are handled and write more complex code.

Spreadsheets tend to have other problems too, and spreadsheets are usually limited to about a million rows. Although the upper limit sounds large, the need to process larger datasets will become more frequent. Also, spreadsheets sometimes do some annoying formatting automatically. In one company I worked for, the part number consisted of a number and at least one letter followed by some combination of numbers and letters. It's quite possible to see part numbers like 1E20.

Most spreadsheets will automatically interpret 1E20 as scientific notation and save it as 1.00E+20 (1×1020), and 1F20 as a string. Preventing this from happening is quite difficult for various reasons. Especially in the face of large data sets, when the problem is discovered, it may have gone through many processing steps, or even all of them have been completed.

Therefore, it is recommended to use CSV or delimited files whenever possible. Often users can save spreadsheets in CSV format, so there is usually no need to tolerate the extra complexity and formatting headaches that spreadsheets bring.

4. Data cleaning

One problem that is often encountered when dealing with text-based data files is dirty data. "Dirty" means that there are all kinds of unexpected information in the data, such as null values, illegal values ​​for the current encoding, extra white space characters, etc.

The data may also be out of order, or in an intractable order. The process of dealing with these situations is called "data cleaning".

1. Cleaning

To give a very simple example of data cleaning, for example, to process a file exported from a spreadsheet or other financial program, the data column corresponding to the currency may contain percentages and currency symbols, such as %, $, £ and ? etc., with additional data grouping with periods or commas. Data generated from other sources may have other contingencies that can make the process tricky if not caught in advance. Let's review the previous temperature data again.

The first row of data looks like this:

[None, 'Illinois', 17, 'Jan 01, 1979', '1979/01/01', 17.48, 994, 6, 30.5, 
    2.89, 994, -13.6, 15.8, 'Missing', 0, 'Missing', 'Missing', '0.00%']

Certain data columns are obviously text, such as 'State' (field 2) and 'Notes' (field 1), and not much processing is done to them. There are also two date fields in different formats. You may want to use these two dates for calculations. It may be to sort and group data by month or day, or to calculate the time difference between two rows of data.

The rest of the fields seem to be various types of numbers, the temperature is a decimal, and the number of records is an integer. Note however that the heat index temperature (heat index temperature) varies a bit, when the value of the 'Max Temp for Daily Max Air Temp (F)' field is below 80, the value of the heat index field is not reported, Instead, it is marked as 'Missing' and the number of records is also 0. Also note that the 'Daily Max Heat Index (F) % Coverage' field indicates the percentage of records with a heat index out of the total number of temperature records. Both of these will cause problems if you want to do math on the values ​​in these fields, because 'Missing' and numbers ending in % will both be parsed as strings, not numbers.

Data cleaning operations similar to the above can be completed in different steps during the processing. Typically, I tend to do data cleaning when reading file data, so most likely replace 'Missing' with a None value or an empty string as each line is processed. It is also possible to leave the 'Missing' string untouched, and then write code that does not perform math on the 'Missing' value.

2. Sort

As mentioned earlier, it is often useful to sort the data in a text file before processing it. Sorting data makes it easier to spot and deal with duplicate values, and helps to group related rows together for faster and easier processing. I once received a file with 20 million lines, containing many attributes and values, and needed to match an unknown number of lines with data items in the main SKU list.

Simply sorting the rows by item ID can greatly speed up the collection of attributes for each item. The sorting scheme depends on the size of the data file relative to available memory, and the complexity of the sort. If all the lines of the file fit comfortably into available memory, then the simplest solution is to read all the lines into a list and use the list sorting method:

>>> lines = open("datafile").readlines()
>>> lines.sort()

The sorted() function can also be used, such as sorted_lines = sorted(lines). This function preserves the order of the lines in the original list, which is usually not necessary. The disadvantage of using the sorted() function is that it creates a new copy of the list. This process will slightly increase the processing time, and will consume twice the memory, the memory problem may be more worrying.

If the data set exceeds the memory size and the sorting criteria are very simple, for example, just sort by an easy-to-grab field, it may be easier to preprocess the data first with an external utility such as the UNIX sort command :

$ sort data > data.srt

No matter which scheme is used, the sorting can be performed in reverse order, and the key can be established from the data instead of searching from the beginning of each row. This is where you have to study the documentation of the sorting tool you are using. As a simple example in Python, sort lines of text case-insensitively.

To do this, provide a key function to the sort method, which will convert the data items to lowercase before doing the data comparison:

>>> lines.sort(key=str.lower)

The following example uses a lambda function to ignore the first 5 characters of each string:

>>> lines.sort(key=lambda x: x[5:])

It is indeed very convenient to use the key function to determine the sorting behavior in Python. But be careful, the key function is called a lot during the sorting process, so a key function that is too complex can mean real performance degradation, especially for large datasets.

3. Problems and pitfalls in data cleaning

It seems that there are as many data sources and usage scenarios as there are types of dirty data. There will always be a lot of quirks in the data, and anything that can result in data processing not being done accurately, or even data not being loaded at all. Therefore, it is impossible to list all possible problems and ways to deal with them here, but some general tips can be given.

  • Be careful with whitespace and null characters. The problem with whitespace is that it's invisible to the human eye, but that doesn't mean it won't cause trouble. Extra whitespace characters at the beginning and end of data lines, extra whitespace before and after each field, tabs instead of spaces (and vice versa) can make data loading and processing more cumbersome, and it doesn't Always so obvious. Similarly, a text file containing null characters (ASCII 0) might look fine on inspection, but break when loaded and processed.
  • Be careful with punctuation. Punctuation can also be an issue. Extra commas or periods can break the formatting of the CSV file and mess up the processing of numeric fields. Unescaped or unpaired quotes can also mess things up.
  • Break down the steps and debug them step by step. It is easier to debug if each step is independent, which means that each step is on its own line, which is more cumbersome and uses more variables. But it's worth it. This can make the exception that has been thrown easier to understand. And whether it is using print statements, logging, or using a Python debugger, it can make debugging easier. It may also help to save the data after each step and reduce the file size to just the line of data that caused the error.

5. Data file writing

The final part of the ETL process may involve saving the transformed data to a database, but usually involves writing the data to a file. These files may be used as input by other applications for analysis. Usually, there should be a file description, which lists which data fields, field names, field formats, field constraints, etc. should be included.

1. CSV and other delimited files

Perhaps the easiest way to do this is to write the data to a CSV file. Because the data loading, parsing, cleaning, and transformation process has already been completed, it is very unlikely that the data itself will have any unresolved issues. Using the csv module in the Python standard library can also make life easier.

The process of writing a delimited file with the csv module is almost the inverse of the reading process. It is also necessary to specify the delimiter to be used, and the csv module will also handle all cases where the delimiter is inside the field:

>>> temperature_data = [['State', 'Month Day, Year Code', 'Avg Daily Max Air 
     Temperature (F)', 'Record Count for Daily Max Air Temp (F)'], 
     ['Illinois', '1979/01/01', '17.48', '994'], ['Illinois', '1979/01/02', 
     '4.64', '994'], ['Illinois', '1979/01/03', '11.05', '994'], ['Illinois', 
     '1979/01/04', '9.51', '994'], ['Illinois', '1979/05/15', '68.42', 
     '994'], ['Illinois', '1979/05/16', '70.29', '994'], ['Illinois', '1979/
     05/17', '75.34', '994'], ['Illinois', '1979/05/18', '79.13', '994'], 
     ['Illinois', '1979/05/19', '74.94', '994']]
>>> csv.writer(open("temp_data_03.csv", "w", 
     newline='')).writerows(temperature_data)

The above code will generate the following file:

State,"Month Day, Year Code",Avg Daily Max Air Temperature (F),Record Count 
    for Daily Max Air Temp (F)
Illinois,1979/01/01,17.48,994
Illinois,1979/01/02,4.64,994
Illinois,1979/01/03,11.05,994 
Illinois,1979/01/04,9.51,994
Illinois,1979/05/15,68.42,994
Illinois,1979/05/16,70.29,994
Illinois,1979/05/17,75.34,994
Illinois,1979/05/18,79.13,994
Illinois,1979/05/19,74.94,994

Just like reading CSV files, if you use DictWriter, you can write dictionaries instead of lists. If you do use DictWriter, pay attention to the following points: you must specify the name of each field in the form of a list when creating DictWriter, and you can also use the writeheader method of DictWriter to write the header line at the beginning of the file.

Therefore, the following assumes that the data is the same as above, but in dictionary format:

{'State': 'Illinois', 'Month Day, Year Code': '1979/01/01', 'Avg Daily Max
    Air Temperature (F)': '17.48', 'Record Count for Daily Max Air Temp
    (F)': '994'}

At this time, using the DictWriter object of the csv module, all dictionary data records can be written line by line into the corresponding fields of the CSV file.

>>> fields = ['State', 'Month Day, Year Code', 'Avg Daily Max Air Temperature 
     (F)', 'Record Count for Daily Max Air Temp (F)']
>>> dict_writer = csv.DictWriter(open("temp_data_04.csv", "w"), 
     fieldnames=fields)
>>> dict_writer.writeheader()
>>> dict_writer.writerows(data)
>>> del dict_writer

2. Writing Excel files

Writing to spreadsheet files is similar to reading, which is to be expected. First you need to create a workbook or spreadsheet file, then create one or more sheets, and finally write the data into the appropriate cells.

Of course, you can create a spreadsheet file from a CSV data file, as follows:

>>> from openpyxl import Workbook
>>> data_rows = [fields for fields in csv.reader(open("temp_data_01.csv"))]
>>> wb = Workbook()
>>> ws = wb.active
>>> ws.title = "temperature data"
>>> for row in data_rows:
...     ws.append(row)
...
>>> wb.save("temp_data_02.xlsx")

You can also add formatting to cells when writing them to a spreadsheet file.

3. Data file packaging

If there are multiple interrelated data files, or if the files are large in size, it may make sense to pack them into a compressed archive. Although there are many archive formats in use today, zip files are still popular and available to users of almost all platforms.

6. Obtain network data

Python implements the transfer of data files over the network, which in some cases may be text or spreadsheet files. But in other cases it might be in a more structured format, provided by a REST or SOAP application programming interface (API). Sometimes, getting data can mean scraping data from a website.

1. Get the file

Before performing any operations on a data file, the file must first be fetched. Sometimes the acquisition process is as simple as manually downloading a zip archive, or the file has been pushed to the computer from elsewhere. But often the process of obtaining files involves more work. You may need to retrieve large numbers of files from remote servers, or you may need to retrieve files periodically. Or the retrieval process is complex enough to make manual operations painful. In these cases, it is desirable to use Python to automatically obtain the data files.

The first thing to clarify is that using a Python script is neither the only way nor the best way to retrieve files. More explanation is given below, which is what I consider when deciding whether to use a Python script to retrieve the file.

While retrieving files in Python works perfectly, it's not always the best option. When making a decision, there are probably two things to consider.

  • Is there an easier option? Depending on the operating system and personal experience, you may find that simple shell scripts and command-line tools are easier to configure and use. If the tools are not available or comfortable with them, or the people who maintain them are not comfortable with them, you may be willing to consider Python scripts.
  • Is the process of retrieving files complex or closely tied to processing? While this situation is never desirable, it can happen. My current rule of thumb is that if you need to write a multi-line shell script, or if you have to figure out how to do something with a shell script, it's probably time to switch to Python.

2. Use Python to get files from FTP server

The File Transfer Protocol (FTP) has been around for a long time, and is still an easy way to share files if you don't have to worry too much about security. If you want to use Python to access the FTP server, you can use the ftplib module in the standard library.

The steps to follow are quite simple, create an FTP object, connect to the server, and then log in with a username and password, or an "anonymous" username and a blank password is also fairly common.

To continue working with weather data, you can connect to the National Oceanic and Atmospheric Administration (NOAA) FTP server as follows:

>>> import ftplib
>>> ftp = ftplib.FTP('tgftp.nws.noaa.gov')
>>> ftp.login()
'230 Login successful.'

Once connected, you can use the ftp object to list and change directories:

>>> ftp.cwd('data')
'250 Directory successfully changed.'
>>> ftp.nlst()
['climate', 'fnmoc', 'forecasts', 'hurricane_products', 'ls_SS_services', 
     'marine', 'nsd_bbsss.txt', 'nsd_cccc.txt', 'observations', 'products', 
     'public_statement', 'raw', 'records', 'summaries', 'tampa', 
     'watches_warnings', 'zonecatalog.curr', 'zonecatalog.curr.tar']

Files can then be fetched, for example, weather report (METAR) files:

>>> x = ftp.retrbinary('RETR observations/metar/decoded/KORD.TXT',
     open('KORD.TXT', 'wb').write)
'226 Transfer complete.'

Here, the remote server file path and local data processing method are passed to the ftp.retrbinary method, where the write method of the file object is opened with the binary write mode and the same file name.

If you open KORD.TXT to view, you will see the download data in it:

CHICAGO O'HARE INTERNATIONAL, IL, United States (KORD) 41-59N 087-55W 200M
Jan 01, 2021 - 09:51 PM EST / 2021.01.02 0251 UTC
Wind: from the E (090 degrees) at 6 MPH (5 KT):0
Visibility: 10 mile(s):0
Sky conditions: mostly cloudy
Temperature: 33.1 F (0.6 C)
Windchill: 28 F (-2 C):1
Dew Point: 21.9 F (-5.6 C)
Relative Humidity: 63%
Pressure (altimeter): 30.14 in. Hg (1020 hPa)
Pressure tendency: 0.01 inches (0.2 hPa) lower than three hours ago 
ob: KORD 020251Z 09005KT 10SM SCT150 BKN250 01/M06 A3014 RMK AO2 SLP214 
     T00061056 58002
cycle: 3

With ftplib, you can also use FTP_TLS instead of FTP protocol to connect to the server with TLS encryption:

ftp = ftplib.FTPTLS('tgftp.nws.noaa.gov')

3. Obtain files through SFTP protocol

If the data requires higher security, for example, business data is transmitted over the network in a corporate environment, it is more common to use the SFTP protocol. SFTP is a full-featured protocol that allows file access, transfer, and management over a Secure Shell (SSH) connection. Although SFTP means "SSH File Transfer Protocol" (SSH File Transfer Protocol), FTP means File Transfer Protocol (File Transfer Protocol), but the two are actually unrelated. SFTP is not a reimplementation of FTP over SSH, but a new design dedicated to SSH.

Because SSH has become the de facto standard for accessing remote servers, and SFTP support can be easily enabled on the server side (often enabled by default), it is attractive to use SSH-based transmission.

Python does not include the SFTP/SCP client module in the standard library, but the community-developed library paramiko can implement SFTP operations and SSH connection management. If you want to use paramiko, the easiest way is to install it through pip. If the NOAA site used SFTP (it doesn't, so the code below won't work!), the SFTP equivalent implementation of the above code would look like this:

>>> import paramiko
>>> t = paramiko.Transport((hostname, port))
>>> t.connect(username, password)
>>> sftp = paramiko.SFTPClient.from_transport(t)

It's worth noting that while paramiko supports running commands on remote servers and receiving their output, it doesn't include scp functionality just like a direct ssh session. scp is a good feature not to be missed. If you only want to transfer one or two files through ssh connection, it is often easier and simpler to use the command line tool scp to complete it.

4. Obtain files via HTTP/HTTPS protocol

The final way to retrieve data files is to fetch the files over an HTTP or HTTPS connection. This is probably the easiest way to get it, it's actually retrieving data from a web server, and support for accessing web servers is very common. Again, Python may not be necessary. There are many command line tools for retrieving files over HTTP/HTTPS connections, and most of the functions you may need are also available. Two of the most commonly used tools are wget and curl. However, if there is a reason to retrieve data in Python code, the process is actually not that difficult.

By far the easiest and most reliable way to access HTTP/HTTPS servers from within Python code is the requests library. pip install requests is still the easiest way to install requests.

After the installation of requests is completed, the acquisition of files is very simple, first import requests, then use the correct HTTP operation (usually GET) to connect to the server and return the data.

The following routine will retrieve monthly temperature data for Heathrow Airport since 1948, as a text file served from a web server. You can put the URL into the browser, load the page, and save the file.

If the page is large, or if there are many pages to fetch, it will be easier to use the following code:

>>> import requests 
>>> response = requests.get("http://www.epubit.com:8083/quickpythonbook?heathrowdata.txt")

The page response object response will carry quite a lot of information, including the header information (Header) returned by the web server. If the file acquisition process is abnormal, this information can help debugging. But the most interesting part of the response object is often the returned data.

In order to retrieve these data, you need to access the text property of the response object, which stores the response body in the form of a string. Or you can also access its content property, which stores the response body in bytes:

>>> print(response.text)
Heathrow (London Airport)
Location 507800E 176700N, Lat 51.479 Lon -0.449, 25m amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by  ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, 
     otherwise sunshine data taken from a Campbell Stokes recorder.
   yyyy  mm   tmax    tmin     af     rain     sun
              degC    degC   days       mm   hours
   1948   1    8.9     3.3    ---     85.0    ---
   1948   2    7.9     2.2    ---     26.0    ---
   1948   3   14.2     3.8    ---     14.0    ---
   1948   4   15.4     5.1    ---     35.0    ---
   1948   5   18.1     6.9    ---     57.0    ---

Usually the response text should be written to a file for later processing. But according to the needs, you can do some cleaning operations first, or even directly process the data.

5. Get data through API

Providing data through APIs has become fairly common, and this follows the trend of decoupling applications from services that communicate through web APIs. APIs can work in a variety of ways, but typically use standard HTTP operations such as GET, POST, PUT, and DELETE over regular HTTP/HTTPS protocols.

Getting data in this way is very similar to the file retrieval process, but the data is not in a static file. Applications do not directly serve static files containing data, but dynamically query, assemble, and serve data from some other data source upon request.

Although there are great differences in the various establishment methods of the API, the most common one is the REST-style (REpresentational State Transfer) interface, which runs on the same HTTP/HTTPS protocol as the Web. APIs work in a variety of ways, but the most common way to get data is through GET requests, which is what web browsers use to request web pages. When fetching data via a GET request, the parameters used to select the data are usually appended to the URL in the query string.

To get the current weather on Mars from the Curiosity rover, use http://marsweather.ingenology.com/v1/latest/?format=json as the URL. ?format=json is a query string parameter specifying that the data is to be returned in JSON format. If you need to get the Mars weather of a specified Mars day, or specify the number of Mars days since the mission starts, such as the 155th Mars day, please use http://marsweather.ingenology.com/v1/archive/?sol=155&format=json for the URL.

If you want to get the Mars weather for a certain earth date range, such as the whole of October 2021, please use http://marsweather.ingenology.com/v1/archive/?terrestrial_date_start=2021-10-01&terrestrial_date_end=2021-10-31.

Note that each data item in the query string should be separated by "&" symbol. If the URL to be used is known, you can use the requests library to obtain data through the API, and then either process it in real time or save it to a file for subsequent processing.

The easiest way to do this is to retrieve the file:

>>> import requests
>>> response = requests.get("http://marsweather.ingenology.com/v1/latest/?format=json")
>>> response.text
'{"report": {"terrestrial_date": "2021-01-08", "sol": 1573, "ls": 295.0, 
     "min_temp": -74.0, "min_temp_fahrenheit": -101.2, "max_temp": -2.0, 
     "max_temp_fahrenheit": 28.4, "pressure": 872.0, "pressure_string": 
     "Higher", "abs_humidity": null, "wind_speed": null, "wind_direction": "-
     -", "atmo_opacity": "Sunny", "season": "Month 10", "sunrise": "2021-01-
     08T12:29:00Z", "sunset": "2021-01-09T00:45:00Z"}}'
>>> response = requests.get("http://marsweather.ingenology.com/v1/archive/?sol=155&format=json")
>>> response.text
'{"count": 1, "next": null, "previous": null, "results": 
     [{"terrestrial_date": "2021-01-18", "sol": 155, "ls": 243.7, "min_temp": 
     -64.45, "min_temp_fahrenheit": -84.01, "max_temp": 2.15, 
     "max_temp_fahrenheit": 35.87, "pressure": 9.175, "pressure_string": 
     "Higher", "abs_humidity": null, "wind_speed": 2.0, "wind_direction": 
     null, "atmo_opacity": null, "season": "Month 9", "sunrise": null, 
     "sunset": null}]}'

Remember that spaces and most punctuation should be escaped in query parameters, as these are not allowed in URLs. Many browsers automatically escape URLs.

Finally, to give another example, assume that data between 12:00 and 1:00 pm on January 10, 2021 needs to be captured. Using the API, the query string parameters for specifying the date range are "$where=date between <start datetime>" and "<end datetime>", where the start and end times are in ISO format with quotation marks.

Therefore, the URL to get 1 hour of data will be https://data.cityofchicago.org/resource/6zsd86xi.json?$where=date between '2021-01-10T12:00:00' and '2021-01 -10T13:00:00'.

There are many characters in the above example that are not acceptable for URLs, such as quotes and spaces. The requests library properly escapes URLs (quotes) before sending them. So this also fully reflects the goal of the requests library, which can simplify operations for users. The URL actually sent by the request object will be https://data.cityofchicago.org/resource/6zsd-86xi.json?$where=date%20between%20%222021-01-10T12:00:00%22%20and %20%222021-01-10T14:00:00%22.

Note that without even bothering too much, all single quotes are escaped as "%22" and all spaces are "%20".

7. Network data serialization processing

Data provided by the API is more often in structured file format, although plain text format is sometimes available.

The two most common file formats are JSON and XML. Both formats are based on plain text, but the content is structured so that it is more flexible and capable of storing more complex information. 

1. JSON data

JSON stands for JavaScript Object Notation, and its history can be traced back to 1999. JSON contains only two structures: one is a key/value pair called structure, which is very similar to a Python dictionary; the other is an ordered list of values, called an array, which is very similar to Python list of.

The key can only be a string wrapped in double quotes, and the value can be a string wrapped in double quotes, a number, true, false, null, an array or an object. These elements make JSON a lightweight solution for representing most data in a way that's easy to transfer over the web, yet reasonably easy for humans to understand.

JSON has become so ubiquitous that most programming languages ​​have the ability to convert JSON to a native data type. The function provided by Python is the json module, which has become part of the standard library of version 2.6. The original externally maintained version of this module, currently still available as simplejson. But in Python 3, the version of the standard library will be much more common.

The data retrieved by the rover API is in JSON format. If you want to send JSON over the network, you need to serialize the JSON object, that is, convert it to a sequence of bytes.

So, while the bulk data retrieved by the rover API looks like JSON, it is actually just a byte string representation of a JSON object. To convert that byte string into a real JSON object, and convert it into a Python dictionary, the JSON object's loads() function is used.

For example, to get Mars weather reports, do as before, but this time convert the data into a Python dictionary:

>>> import json
>>> import requests
>>> response = requests.get("http://marsweather.ingenology.com/v1/latest/
     ?format=json")
>>> weather = json.loads(response.text)
>>> weather
{'report': {'terrestrial_date': '2021-01-10', 'sol': 1575, 'ls': 296.0,
     'min_temp': -58.0, 'min_temp_fahrenheit': -72.4, 'max_temp': 0.0, 
     'max_temp_fahrenheit': None, 'pressure': 860.0, 'pressure_string': 
     'Higher', 'abs_humidity': None, 'wind_speed': None, 'wind_direction': '-
     -', 'atmo_opacity': 'Sunny', 'season': 'Month 10', 'sunrise': '2021-01-
     10T12:30:00Z', 'sunset': '2021-01-11T00:46:00Z'}}
>>> weather['report']['sol']
1575

Note that json.loads() is called to get the string representation of the JSON object and convert or load it into a Python dictionary. Likewise, the json.load() function can read data from any file-like object that supports the read method.

Going to look at the display form of the previous dictionary, the content is incomprehensible. Improved formatting, also known as pretty printing, makes data structures much easier to understand. Let's use Python's prettyprint module to view the contents of the example dictionary:

>>> from pprint import pprint as pp
>>> pp(weather)
{'report': {'abs_humidity': None,
            'atmo_opacity': 'Sunny',
            'ls': 296.0,
            'max_temp': 0.0,
            'max_temp_fahrenheit': None,
            'min_temp': -58.0,
            'min_temp_fahrenheit': -72.4,
            'pressure': 860.0,
            'pressure_string': 'Higher',
            'season': 'Month 10',
            'sol': 1575,
            'sunrise': '2021-01-10T12:30:00Z',
            'sunset': '2021-01-11T00:46:00Z',
            'terrestrial_date': '2021-01-10',
            'wind_direction': '--',
            'wind_speed': None}}

Both loading functions can be configured to control the parsing and decoding process of raw JSON to Python objects. The following table lists the default conversion configuration.

The default decoding relationship for converting JSON to Python objects:

JSON Python
object dict
array list
string str
number (int) int
number (real) float
true True
false False
null None

Retrieve data in JSON format through the requests library, and then use the json.loads() method to parse it into a Python object. There is no problem with this technique at all, but because the requests library is often only used for this purpose, a shortcut is provided. The response object actually has a json() method, which can directly complete the conversion.

So in this example, instead of:

>>> weather = json.loads(response.text)

Instead use:

>>> weather = response.json()

The result is the same, but the code is simpler, readable, and more Pythonic.

If you want to write JSON to a file, or serialize to a string, the inverses of load() and loads() are dump() and dumps(). json.dump() takes one argument, a file object with a write() method, and returns a string. In both cases, the encoding of JSON-formatted strings is highly customizable.

So, if you want to write a Mars weather report to a JSON file, you can do it like this:

>>> outfile = open("mars_data_01.json", "w")
>>> json.dump(weather, outfile)
>>> outfile.close()
>>> json.dumps(weather)
'{"report": {"terrestrial_date": "2021-01-11", "sol": 1576, "ls": 296.0, 
    "min_temp": -72.0, "min_temp_fahrenheit": -97.6, "max_temp": -1.0,
    "max_temp_fahrenheit": 30.2, "pressure": 869.0, "pressure_string": 
    "Higher", "abs_humidity": null, "wind_speed": null, "wind_direction": "-
    -", "atmo_opacity": "Sunny", "season": "Month 10", "sunrise": "2021-01-
    11T12:31:00Z", "sunset": "2021-01-12T00:46:00Z"}}'

As you can see, the entire object has been encoded into a string. Again, like with the pprint module, it might be handy at this point to format the string in a more readable way.

This is easy to do with the dump or dumps functions with the indent parameter:

>>> print(json.dumps(weather, indent=2))
{
  "report": {
    "terrestrial_date": "2021-01-10",
    "sol": 1575,
    "ls": 296.0,
    "min_temp": -58.0,
    "min_temp_fahrenheit": -72.4,
    "max_temp": 0.0,
    "max_temp_fahrenheit": null,
    "pressure": 860.0,
    "pressure_string": "Higher",
    "abs_humidity": null,
    "wind_speed": null,
    "wind_direction": "--",
    "atmo_opacity": "Sunny",
    "season": "Month 10",
    "sunrise": "2021-01-10T12:30:00Z",
    "sunset": "2021-01-11T00:46:00Z"
  }
}

However, it should be clear that if you repeatedly call json.dump() to write a series of objects to a file, the result will be a series of legal JSON format objects, but the content of the entire file is not a legal JSON format object. Attempts to read and parse the entire file with a single call to json.load() will fail. To encode multiple objects into one JSON object, you need to put them all in a list, or preferably one object, and then encode that object into a JSON file.

If you need to save two days or more of Martian weather data in JSON format, you must choose an operation plan. You can call json.dump() once for each object, which will generate a file containing multiple objects in JSON format.

Assuming weather_list is a list of weather report objects, the code might look like this:

>>> outfile = open("mars_data.json", "w")
>>> for report in weather_list:
...     json.dump(weather, outfile) 
>>> outfile.close()

So when reading, you need to load each row as a separate JSON format object:

>>> for line in open("mars_data.json"):
...     weather_list.append(json.loads(line))

Or you can put the list into a JSON object. Because JSON's top-level arrays may be vulnerable, the recommended solution is to put the array into a dictionary:

>>> outfile = open("mars_data.json", "w")
>>> weather_obj = {"reports": weather_list, "count": 2} 
>>> json.dump(weather, outfile)
>>> outfile.close()

With this approach, a JSON-formatted object can be loaded from a file in one step:

>>> with open("mars_data.json") as infile:
>>> weather_obj = json.load(infile)

If the size of the JSON file is manageable, then the second method will be better. But for very large files, it might not be ideal, as error handling can be a bit difficult, and memory can be exhausted.

2. XML data

Extensible Markup Language (eXtensible Markup Language, XML) began to appear at the end of the 20th century. XML adopts the angle bracket tag (tag) writing method similar to HTML, and the elements are nested to form a tree structure. XML was intended to be read by both machines and humans, but is often too long and complex for humans to understand.

Nonetheless, because XML is an established standard, finding data in XML format is a fairly common need. Although the XML format is machine-readable, it's likely that you'll want to convert to a more manipulable format.

Let's look at some examples of XML data, here is the XML version of the weather data:

<dwml xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://
     www.w3.org/2001/XMLSchema-instance" version="1.0" 
     xsi:noNamespaceSchemaLocation="http://www.nws.noaa.gov/forecasts/xml/
     DWMLgen/schema/DWML.xsd">
  <head>
    <product srsName="WGS 1984" concise-name="glance" operationalmode="official">
      <title>
        NOAA's National Weather Service Forecast at a Glance
      </title>
      <field>meteorological</field>
      <category>forecast</category>
      <creation-date refresh-frequency="PT1H">2021-01-08T02:52:41Z</creationdate>
    </product>
    <source>
      <more-information>http://www.nws.noaa.gov/forecasts/xml/</moreinformation>
      <production-center>
        Meteorological Development Laboratory
        <sub-center>Product Generation Branch</sub-center>
      </production-center>
      <disclaimer>http://www.nws.noaa.gov/disclaimer.html</disclaimer>
      <credit>http://www.weather.gov/</credit>
      <credit-logo>http://www.weather.gov/images/xml_logo.gif</credit-logo>
      <feedback>http://www.weather.gov/feedback.php</feedback>
    </source>
  </head>
  <data>
    <location>
      <location-key>point1</location-key>
      <point latitude="41.78" longitude="-88.65"/>
    </location>
    ...
  </data>
</dwml>

The above example is just the first part of the XML document, most of the data is omitted here. Even so, it exhibits some problems that are often found in XML data. Especially the verbose nature of the XML protocol, in some cases the label takes up more space than the data it contains. The example also shows the nested or tree-like structure commonly found in XML, and the large metadata headers that are often used in front of the actual data. If you rank data file formats from simple to complex, CSV or delimited files can be considered the easiest end, and XML is the most complex end.

The file also demonstrates another feature of XML that slightly increases the difficulty of data extraction. XML supports the use of attributes to store data and text values ​​within tags. So, if you look at the point element at the end of the example, you'll see that it doesn't contain a text value.

This element is inside a <point> tag, with only latitude and longitude values:

<point latitude="41.78" longitude="-88.65"/>

The above code is of course valid XML, suitable for storing data, but it is also possible to store the same data in the following format:

<point>
   <latitude>41.78</ latitude >
   <longitude>-88.65</longitude>
</point>

Without a close inspection of the data and a careful study of the XML specification document, it's really hard to know what scheme to use for the given data.

Because of this complexity, it will be very difficult to extract simple data from XML. There are several options for how to process XML. The Python standard library comes with modules for parsing and processing XML data, but it is not particularly convenient for simple data extraction.

For simple data extraction, the most user-friendly utility I could find is a library called xmltodict, which parses XML data and returns a dictionary corresponding to the tree structure.

In fact, the XML parser module expat in the standard library is used behind xmltodict to parse the XML document into an object tree and create a dictionary with the object tree. Thus, xmltodict can handle anything an XML parser can handle, and it can also read dictionaries and "unparse" them into XML if necessary, which makes xmltodict a very nice tool to use. After a few years of use, I found this solution to all my XML processing needs. If you want to get xmltodict, you can still use pip install xmltodict.

To convert XML to a dictionary, you can import xmltodict and use the parse method on XML-formatted strings: 

>>> import xmltodict
>>> data = xmltodict.parse(open("observations_01.xml").read())

For compactness, the contents of the file are passed directly to the parse method here. The data object after parsing is an ordered dictionary, and the data contained in it is the same as loaded from JSON: 

{
    "dwml": {
        "@xmlns:xsd": "http://www.w3.org/2001/XMLSchema",
        "@xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
        "@version": "1.0",
        "@xsi:noNamespaceSchemaLocation": "http://www.nws.noaa.gov/forecasts/ xml/DWMLgen/schema/DWML.xsd",
        "head": {
            "product": {
                "@srsName": "WGS 1984",
                "@concise-name": "glance",
                "@operational-mode": "official",
                "title": "NOAA's National Weather Service Forecast at a Glance",
                "field": "meteorological",
                "category": "forecast",
                "creation-date": {
                    "@refresh-frequency": "PT1H",
                    "#text": "2021-01-08T02:52:41Z"
                }
            },
            "source": {
                "more-information": "http://www.nws.noaa.gov/forecasts/xml/",
                "production-center": {
                    "sub-center": "Product Generation Branch",
                    "#text": "Meteorological Development Laboratory"
                },
                "disclaimer": "http://www.nws.noaa.gov/disclaimer.html",
                "credit": "http://www.weather.gov/",
                "credit-logo": "http://www.weather.gov/images/xml_logo.gif",
                "feedback": "http://www.weather.gov/feedback.php"
            }
        },
        "data": {
            "location": {
                "location-key": "point1",
                "point": {
                    "@latitude": "41.78",
                    "@longitude": "-88.65"
                }
            }
        }
    }
}

Note that all attributes have been extracted from the tag, prefixed with @ to indicate that they were originally attributes of the parent tag.

If the XML node contains both text values ​​and nested elements, note that the key of the text value is "#text", such as the "sub-center" element under "production-center".

As mentioned before, the result of parsing is an ordered dictionary, the official term is OrderedDict. So if you print it out, the code should look like this:

OrderedDict([('dwml', OrderedDict([('@xmlns:xsd', 'http://www.w3.org/2001/
    XMLSchema'), ('@xmlns:xsi', 'http://www.w3.org/2001/XMLSchema- 
    instance'), ('@version', '1.0'), ('@xsi:noNamespaceSchemaLocation', 
    'http://www.nws.noaa.gov/forecasts/xml/DWMLgen/schema/DWML.xsd'), 
    ('head', OrderedDict([('product', OrderedDict([('@srsName', 'WGS 1984'), 
    ('@concise-name', 'glance'), ('@operational-mode', 'official'), 
    ('title', "NOAA's National Weather Service Forecast at a Glance"),
    ('field', 'meteorological'), ('category', 'forecast'), ('creation-date', 
    OrderedDict([('@refresh-frequency', 'PT1H'), ('#text', '2021-01-
    08T02:52:41Z')]))])), ('source', OrderedDict([('more-information', 
    'http://www.nws.noaa.gov/forecasts/xml/'), ('production-center', 
    OrderedDict([('sub-center', 'Product Generation Branch'), ('#text', 
    'Meteorological Development Laboratory')])), ('disclaimer', 'http://
    www.nws.noaa.gov/disclaimer.html'), ('credit', 'http://www.weather.gov/
    '), ('credit-logo', 'http://www.weather.gov/images/xml_logo.gif'), 
    ('feedback', 'http://www.weather.gov/feedback.php')]))])), ('data', 
    OrderedDict([('location', OrderedDict([('location-key', 'point1'), 
    ('point', OrderedDict([('@latitude', '41.78'), ('@longitude', '-
    88.65')]))])), ('#text', '…')]))]))])

Although the display form of OrderedDict is indeed a bit weird, that is, multiple lists composed of tuples, it behaves exactly like a normal dictionary, except that the order of the elements is guaranteed to be maintained, which is useful in the current situation.

If elements are repeated, it becomes a list. In the complete file given before, there is also a part where the following elements appear (some elements are omitted here):

<time-layout >
    <start-valid-time period-name="Monday">2021-01-09T07:00:00-06:00</start-
     valid-time>
    <end-valid-time>2021-01-09T19:00:00-06:00</end-valid-time>
    <start-valid-time period-name="Tuesday">2021-01-10T07:00:00-06:00</start-
     valid-time>
    <end-valid-time>2021-01-10T19:00:00-06:00</end-valid-time>
    <start-valid-time period-name="Wednesday">2021-01-11T07:00:00-06:00</
     start-valid-time>
    <end-valid-time>2021-01-11T19:00:00-06:00</end-valid-time>
</time-layout>

Note that the two elements "start-valid-time" and "end-valid-time" appear alternately and repeatedly. These two repeated elements will be converted into lists in the dictionary respectively, and the appropriateness of each group of elements will be kept. The order does not change:

            "time-layout": 
                {
                    "start-valid-time": [
                        {
                            "@period-name": "Monday",
                            "#text": "2021-01-09T07:00:00-06:00"
                        },
                        {
                            "@period-name": "Tuesday",
                            "#text": "2021-01-10T07:00:00-06:00"
                        },
                        {
                            "@period-name": "Wednesday",
                            "#text": "2021-01-11T07:00:00-06:00"
                        }
                    ],
                    "end-valid-time": [
                        "2021-01-09T19:00:00-06:00",
                        "2021-01-10T19:00:00-06:00",
                        "2021-01-11T19:00:00-06:00"
                    ]
                },

Dictionaries and lists, and even nested dictionaries and lists, are fairly easy to handle in Python, so using xmltodict is an efficient way to handle most XML. In fact, in the past few years, I have used it in the production environment of various XML documents, and never had a problem.

8. Network data crawling

In some cases, the data resides on one website and is not available anywhere else for some reason. At this point, it may make sense to collect data from the web page itself, through the process of crawling or scraping.

Before going into the details of scraping, allow a disclaimer. Scraping or scraping websites that you don't own or control is at best a legal gray area with many inconclusive and conflicting issues to consider regarding terms of use of the site, how to access the site, and getting caught The use of the data, etc. Unless you have control over the site you want to crawl, the answer to "Is crawling this site legal?" tends to be "it depends".

If you decide to scrape a production website, you also need to keep an eye on the load on the website. While established high-traffic sites may be able to handle whatever is thrown at them, smaller, less active sites may be brought to a standstill by a series of consecutive requests. At the very least, be careful not to let scraping become an inadvertent denial-of-service attack.

However, I've had situations where some data is actually easier to crawl through our own site than through corporate channels.

The crawling of the website includes two parts of operations: the acquisition of web pages and the extraction of data. The acquisition of web pages can be done through the requests module, and it is quite simple.

Let's consider a very simple webpage code, only a few lines of text, no CSS or JavaScript, the code is as follows.

test.html file:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html> <head>
<title>Title</title>
</head>

<body>
<h1>Heading 1</h1>

This is plan text, and is boring
<span class="special">this is special</span>

Here is a <a href="http://bitbucket.dev.null">link</a>
<hr>
<address>Ann Address, Somewhere, AState 00000
</address>
</body> </html>

Assume that you are only interested in a few types of data in the above page, namely element content with a "special" class name, and web page links. You can process files by searching for the strings class="special" and "<a href", and then write code to extract data from them. But even with regular expressions, this process can be tedious, error-prone, and difficult to maintain. Things would be a lot easier if you used a library that knew how to parse HTML (like Beautiful Soup). If you want to try the following code and try the parsing process of the HTML page, you can use pip install bs4 to install the library.

Once Beautiful Soup is installed, parsing HTML pages is simple. For the above sample page, assuming that the web page has been retrieved (probably using the requests library), then just parse the HTML.

The first step is to load the text and create the Beautiful Soup parser:

>>> import bs4
>>> html = open("test.html").read()
>>> bs = bs4.BeautifulSoup(html, "html.parser")

The above is all the code for parsing HTML into the parser object bs. The Beautiful Soup parser object has a lot of cool tricks, and if you do deal with HTML, it's really worth spending some time experimenting a bit to get a feel for what it's capable of. This example only cares about two things: extracting content based on HTML tags, and getting data based on CSS classes.

The first thing to do is to find the web link. The HTML tag of a web page link is <a>, and Beautiful Soup converts all tags to lowercase by default. So to find all link tags, you can call the bs object itself with "a" as an argument:

>>> a_list = bs("a")
>>> print(a_list)
[<a href="http://bitbucket.dev.null">link</a>]

Now you get a list of all HTML link tags, in this case there is only one web page link. If it's just getting this list, it's not too bad.

But in fact, the elements returned in the list are still parser objects, which can be used to complete the remaining links and text operations:

>>> a_item = a_list[0]
>>> a_item.text
'link'
>>> a_item["href"]
'http://bitbucket.dev.null'

Another thing to look for are sections with the CSS class "special", which can be extracted with the parser's select method. As follows:

>>> special_list = bs.select(".special")
>>> print(special_list)
[<span class="special">this is special</span>]
>>> special_item = special_list[0]
>>> special_item.text
'this is special'
>>> special_item["class"]
['special']

Because the data item returned by the tag or select method is the parser object itself, it can be nested so that you can extract anything from HTML or even XML.

Summarize:

  • ETL is the process of obtaining data, reading from one format, and then converting it into another format that can be used by oneself. During the conversion process, data consistency must be ensured. ETL is a fundamental step in most data processing processes.
  • Character encoding can be an issue with text files, but Python can handle some encoding issues when loading the file.
  • Delimited or CSV files are very common and the best solution is to use the csv module.
  • Spreadsheet files can be significantly more complex than CSV, but can be handled largely the same way.
  • Currency symbols, punctuation marks, and null characters are the most common data cleaning issues, so be careful.
  • Presorting the data files can speed up other processing steps.
  • Using a Python script might not be the best option for getting the files. Be sure to consider multiple options.
  • Using the requests module is the best choice for fetching files via HTTP/HTTPS and Python.
  • The method of obtaining files by API is very similar to obtaining static files.
  • The parameters of an API request usually need to be transcoded and added as a query string after the request URL.
  • The data provided by the API is quite commonly a string in JSON format, and XML format is also used.
  • It may be illegal or unethical to scrape websites that you do not control, and also consider avoiding overloading your server.

Guess you like

Origin blog.csdn.net/qq_35029061/article/details/130138064