[100 days proficient in python] Day27: file and IO operation_CSV file processing

CSV (Comma Separated Values) is a common text file format used to store tabular data. Each row represents a record, and each field is separated by a comma or other specific delimiter. CSV files can be opened with a plain text editor or edited with spreadsheet software (eg Microsoft Excel, Google Sheets).

2 How to use the csv module

Module in Python csvprovides functions for working with CSV files. It contains various methods and objects for reading and writing CSV files, such as csv.reader, csv.writer, csv.DictReaderand csv.DictWriteretc.

3 Example of reading and writing CSV files

3.1 Example of reading CSV files

Suppose we have a data.csvCSV file named as follows:

Name,Age,City
John,30,New York
Jane,25,San Francisco
Mike,35,Chicago

We can csv.readerread and process this CSV file using

import csv

# 读取CSV文件并处理数据
with open('data.csv', 'r', newline='') as file:
    csv_reader = csv.reader(file)
    
    # 遍历每一行数据
    for row in csv_reader:
        print(row)

output:

['Name', 'Age', 'City']
['John', '30', 'New York']
['Jane', '25', 'San Francisco']
['Mike', '35', 'Chicago']

3.2 Example of writing CSV file

Now, suppose we have a set of dictionary data that we want to write to a new CSV file output.csv:

import csv

# 要写入的数据
data = [
    {"Name": "Alice", "Age": 28, "City": "London"},
    {"Name": "Bob", "Age": 32, "City": "Paris"},
    {"Name": "Eve", "Age": 24, "City": "Berlin"}
]

# 写入CSV文件
with open('output.csv', 'w', newline='') as file:
    fieldnames = ['Name', 'Age', 'City']
    csv_writer = csv.DictWriter(file, fieldnames=fieldnames)

    # 写入表头
    csv_writer.writeheader()

    # 写入数据
    csv_writer.writerows(data)

print("Data has been written to output.csv.")

output:

Name,Age,City
Alice,28,London
Bob,32,Paris
Eve,24,Berlin

4 Common data processing of CSV files

4.1 Read a specific column of a CSV file

After passing through csv.readeror csv.DictReaderreading a CSV file, only the required column data is kept for processing. We can specify a specific column by column index or column name.

Example : Suppose we have a data.csvCSV file named as follows:

Name,Age,City
John,30,New York
Jane,25,San Francisco
Mike,35,Chicago

We will show two methods to read specific columns of CSV files:

Method 1: Use column indexes

import csv

# 读取CSV文件并获取特定列数据
with open('data.csv', 'r', newline='') as file:
    csv_reader = csv.reader(file)
    
    # 将列索引设为1（第二列Age）
    column_index = 1
    
    # 初始化存储特定列数据的列表
    specific_column_data = []
    
    # 遍历每一行数据
    for row in csv_reader:
        # 获取特定列的值，并添加到列表中
        specific_column_data.append(row[column_index])

print("Specific column data:", specific_column_data)

output:

Specific column data: ['Age', '30', '25', '35']

Method 2: Use column names

import csv

# 读取CSV文件并获取特定列数据
with open('data.csv', 'r', newline='') as file:
    csv_reader = csv.DictReader(file)
    
    # 将列名设为'Age'
    column_name = 'Age'
    
    # 初始化存储特定列数据的列表
    specific_column_data = []
    
    # 遍历每一行数据
    for row in csv_reader:
        # 获取特定列的值，并添加到列表中
        specific_column_data.append(row[column_name])

print("Specific column data:", specific_column_data)

output

Specific column data: ['30', '25', '35']

In the above example, we read the CSV file through csv.readerand csv.DictReaderrespectively, and extract the required column data according to the specific column index or column name. We then store the data for the specific column in a list for later processing.

Note: When used csv.DictReader, each row of data will be parsed as a dictionary where the keys are the column names of the first row (header) of the CSV file . This way we can access the value of a specific column by column name. When used csv.reader, each row of data will be parsed as a list , and we can access the value of a specific column through the column index.

4.2 Read a specific line of a CSV file

To read a specific line of the CSV file, we can use csv.readeror csv.DictReaderto read the CSV file line by line, and determine whether the line number meets a specific condition during the reading process. Here is an example of using csv.readerand csv.DictReaderreading a specific line of a CSV file:

Example 1: Read a specific row using csv.reader

Suppose we have a data.csvCSV file named as follows:

Name,Age,City
John,30,New York
Jane,25,San Francisco
Mike,35,Chicago

We can use csv.readerto read the CSV file and get the corresponding row data according to the specific row number:

import csv

# 读取CSV文件的特定行
def read_specific_row(csv_file, row_number):
    with open(csv_file, 'r', newline='') as file:
        csv_reader = csv.reader(file)
        for i, row in enumerate(csv_reader):
            if i == row_number:
                return row

# 读取第二行（索引为1）的数据
specific_row = read_specific_row('data.csv', 1)
print("Specific row data:", specific_row)

output

Specific row data: ['Jane', '25', 'San Francisco']

Example 2: Using csv.DictReader to read a specific row

If the first line of the CSV file is the column names, we can use csv.DictReaderto read the CSV file and get the data of a specific row based on certain conditions:

import csv

# 读取CSV文件的特定行
def read_specific_row(csv_file, row_number):
    with open(csv_file, 'r', newline='') as file:
        csv_reader = csv.DictReader(file)
        for i, row in enumerate(csv_reader):
            if i == row_number:
                return row

# 读取第二行（索引为1）的数据
specific_row = read_specific_row('data.csv', 1)
print("Specific row data:", specific_row)

output

Specific row data: {'Name': 'Jane', 'Age': '25', 'City': 'San Francisco'}

In the above example, we used csv.readerand csv.DictReaderto read the CSV file respectively, and obtained the corresponding row data through a specific row number (index). Note that the line numbers are 0-based because indexes in Python are counted from 0. row_numberParameters can be adjusted to read different rows as needed .

5 Special handling of csv files

When dealing with CSV files, there are some common special cases that require special handling. Here are some common special handling cases

5.1 Handling fields containing commas, newlines, and quotes

To process CSV files containing commas, quotes, and newlines, you can use Python's csvmodules to read and write data. csvThe module provides automatic handling of special characters, including wrapping fields containing commas, quotes, and newlines in quotes, and escaping quotes within quotes.

Example:

Suppose we want to process the following CSV file containing special characters named data.csv:

Name,Age,Description
John,30,"A software, ""guru"" with 5 years of experience. Fluent in English and Español."
Jane,25,"A data analyst with ""extensive"" skills. 
Passionate about data visualization."
Mike,35,"Project manager with experience leading international teams.
Deutsch sprechen."

We can use the following code to read and process this CSV file containing special characters:

import csv

# 读取包含特殊字符的CSV文件并输出内容
with open('data.csv', 'r', newline='') as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        print(row)

output result

['Name', 'Age', 'Description']
['John', '30', 'A software, "guru" with 5 years of experience. Fluent in English and Español.']
['Jane', '25', 'A data analyst with "extensive" skills.\nPassionate about data visualization.']
['Mike', '35', 'Project manager with experience leading international teams.\nDeutsch sprechen.']

In the output, we can see that csv.readerthe module correctly handles fields containing commas, quotes, and newlines and parses them into the correct data.

If you want to write data to a CSV file that contains special characters, you can use the following sample code:

import csv

# 要写入的数据，包含特殊字符的字段
data = [
    ["Name", "Age", "Description"],
    ["John", 30, 'A software, "guru" with 5 years of experience. Fluent in English and Español.'],
    ["Jane", 25, 'A data analyst with "extensive" skills.\nPassionate about data visualization.'],
    ["Mike", 35, 'Project manager with experience leading international teams.\nDeutsch sprechen.']
]

# 写入CSV文件，并设置引号限定符为双引号
with open('output.csv', 'w', newline='') as file:
    csv_writer = csv.writer(file, quoting=csv.QUOTE_MINIMAL)

    # 写入数据
    csv_writer.writerows(data)

print("CSV file with fields containing special characters has been created.")

When writing data, we use csv.writerand set the quote qualifier to csv.QUOTE_MINIMAL, which means that the field is wrapped with quotes only when necessary to ensure the correctness of the data.

Output file content:

Name,Age,Description
John,30,A software, "guru" with 5 years of experience. Fluent in English and Español.
Jane,25,A data analyst with "extensive" skills.\nPassionate about data visualization.
Mike,35,Project manager with experience leading international teams.\nDeutsch sprechen.

In the output file, csvthe module automatically handles fields containing special characters and writes them to the CSV file.

When reading CSV files, use csv.readerand specify appropriate parameters to correctly parse data containing special characters. When writing to a CSV file, use csv.writerand set the appropriate quote qualifiers to ensure that the data is written to the CSV file correctly.

5.2 Handling non-ASCII characters

When reading and writing CSV files, you can use encodingthe parameter to specify the encoding format of the file.

CSV files typically use UTF-8 encoding to support text data that contains non-ASCII characters.

import csv

# 读取包含非ASCII字符的CSV文件
with open("data.csv", "r", encoding="utf-8") as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        print(row)

# 写入包含非ASCII字符的CSV文件
data = [["中文", "English"], ["数据", "Data"]]
with open("data.csv", "w", newline="", encoding="utf-8") as file:
    csv_writer = csv.writer(file)
    csv_writer.writerows(data)

5.3 Handling empty fields

If there is an empty field in the CSV file, you can use an empty string or a specific value (such as "NA" or "None") to represent the empty field

When reading CSV files, you can use the parameter csv.readerof skipinitialspaceto handle leading spaces

5.3.1 Reading empty fields

Suppose we have a data.csvCSV file named as follows:

Name,Age,City,Description
John,30,New York,"Software engineer with 5 years of experience. Fluent in English and Español."
Jane,,San Francisco,"Data analyst with a passion for data visualization. Speaks français."
Mike,35,, "Project manager with experience leading international teams. Deutsch sprechen."

Note the presence of empty fields in the CSV file above.

We can still use csv.readerand csv.DictReaderto read the CSV file with empty fields and process the empty fields:

Example 1:

import csv

# 读取CSV文件并输出内容
with open('data.csv', 'r', newline='') as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        # 处理空字段
        processed_row = [field.strip() if field.strip() else None for field in row]
        print(processed_row)

output:

['Name', 'Age', 'City', 'Description']
['John', '30', 'New York', 'Software engineer with 5 years of experience. Fluent in English and Español.']
['Jane', None, 'San Francisco', 'Data analyst with a passion for data visualization. Speaks français.']
['Mike', '35', None, 'Project manager with experience leading international teams. Deutsch sprechen.']

explain:

The first line is the header line of the CSV file, which is output directly.

The field in the second row Ageis empty, which we handle as a null value (None).

The field in the third row Cityis empty, which we handle as a null value (None).

The fields in the fourth row Descriptionare not empty, the output is unchanged.

When dealing with empty fields, we use a list comprehension to iterate over the fields in each row. field.strip()It is used to remove blank characters (including line breaks, spaces, etc.) on both sides of the field, and then we use conditional expressions to determine whether it is an empty field. If the field is not empty, it keeps the original value; if the field is empty, it is treated as Nonerepresenting a null value. Finally, we get each row of data after processing.

Example 2:

csv.readerThis CSV file, which contains empty fields and leading spaces, can be read with and handled skipinitialspace=Truewith

import csv

# 读取CSV文件并输出内容
with open('data.csv', 'r', newline='') as file:
    csv_reader = csv.reader(file, skipinitialspace=True)
    for row in csv_reader:
        print(row)

output

['Name', 'Age', 'City', 'Description']
['John', '30', 'New York', 'Software engineer with 5 years of experience.']
['Jane', '', 'San Francisco', 'Data analyst with a passion for data visualization.']
['Mike', '35', '', 'Project manager with experience leading international teams.']

In the example, we csv.readerread a CSV file with , and handle skipinitialspace=Trueleading whitespace with . The results show that spaces before field values have been stripped automatically, which allows for better handling of data containing leading spaces. In the second and third lines, the values for the fields "Age" and "City" contain leading spaces, but these leading spaces have been stripped in the output.

5.3.2 Specifying parameters to handle empty fields

Handling empty fields in CSV files is usually a case-by-case decision. Empty fields in a CSV file can be represented by an empty string ('') or by a specific value such as "NA" or "None" . When dealing with empty fields, you need to decide the most appropriate way based on the organization and requirements of your data.

In Python csvmodules, you can use the csv.writerand parameters to specify how to handle empty fields.csv.DictWriterquoting

Options for handling empty fields:

csv.QUOTE_MINIMAL(default): If the field is empty, the field will be written as an empty string (''). When reading CSV files, empty strings are interpreted as null values.

csv.QUOTE_ALL: If the field is empty, the field will be written as an empty string wrapped in double quotes (""). When reading CSV files, empty strings are interpreted as null values.

csv.QUOTE_NONNUMERIC: If the field is empty, the field will be written as an empty string (''). When reading CSV files, empty strings are interpreted as None or empty values.

csv.QUOTE_NONE: If the field is empty, the field will be written as an empty string (''). When reading CSV files, empty strings are parsed as empty strings themselves, not as null values.

Example:

Assume we have a CSV file with empty fields named data.csvas follows:

Name,Age,City,Description
John,30,New York,
Jane,,San Francisco,"Data analyst with a passion for data visualization."
Mike,35,,Project manager

We will use csv.writerand csv.DictWriterto process this CSV file with empty fields and demonstrate the effect of different options.

import csv

# CSV文件处理选项
quoting_options = [csv.QUOTE_MINIMAL, csv.QUOTE_ALL, csv.QUOTE_NONNUMERIC, csv.QUOTE_NONE]
output_files = ['output_minimal.csv', 'output_all.csv', 'output_nonnumeric.csv', 'output_none.csv']

# 处理CSV文件
for quoting, output_file in zip(quoting_options, output_files):
    # 要写入的数据，包含空字段
    data = [
        ["John", 30, "New York", ""],
        ["Jane", "", "San Francisco", "Data analyst with a passion for data visualization."],
        ["Mike", 35, "", "Project manager"]
    ]

    # 写入CSV文件
    with open(output_file, 'w', newline='') as file:
        csv_writer = csv.writer(file, quoting=quoting)

        # 写入数据
        csv_writer.writerows(data)

print("CSV files with different quoting options have been created.")

In the above examples, we use different quotingoptions to process CSV files with empty fields and write the processed data to different output files.

We created four output files with different quotingoptions namely, csv.QUOTE_MINIMAL, csv.QUOTE_ALL, csv.QUOTE_NONNUMERICand csv.QUOTE_NONE. You can look at the individual output files to see how different options handle empty fields.