Detailed explanation of the method of reading and writing CSV files in Python

One of the most popular data interchange formats is the CSV format. A program that needs to input and output information other than the keyboard and console, and exchanging information through text files is a common way to share information between programs.

Here's a recap with me on learning how to read, process, and parse CSV from text files using Python.
insert image description here

What is a CSV file?

A CSV file (Comma Separated Values ​​file) is a plain text file that uses a specific structure to arrange tabular data. Because it's a plain text file, it can only contain actual text data, in other words printable ASCII or Unicode characters.

The structure of a CSV file is given by its name. Typically CSV files use commas to separate each specific data value.

column 1 name,column 2 name, column 3 name
1st row data 1,1st row data 2,1st row data 3
2nd row data 1,2nd row data 2,2nd row data 3

Notice how each piece of data is separated by commas. Usually the first row identifies each piece of data, in other words, the name of the data column. Each line after that is actual data and is limited by the file size.

Usually the delimiter (,) comma is not the only one used. Other popular delimiters include the tab ( \t ), colon ( : ), and semicolon ( ; ) characters.

Correct parsing of a CSV file requires knowing which delimiter is being used.

Where did the CSV file come from?

CSV files are often created by programs that process large amounts of data. They are a convenient way to export data from spreadsheets and databases, and import or use data in other programs. For example, the results of a data mining program can be exported as a CSV file, which can then be imported into a spreadsheet to analyze data, generate graphs for presentations, or prepare reports for publication.

CSV files are very easy to work with programmatically in Python, and CSV files can be processed directly.

Built-in CSV library to parse CSV files

The csv library is designed to work out-of-the-box with CSV files generated with Excel and accommodates a variety of CSV formats.
insert image description here

Read CSV file csv

The CSV file is opened as a text file using Python's built-in open() function, which returns a file object, which is then passed to the reader to perform the reading.

# employee_birthday.txt
name,department,birthday
John,IT,November
Tom,IT,March

Read the operation code, each line returned by reader is a list of elements, String containing the data found by removing the delimiter. The first row returned contains column names that are treated in a special way.

import csv

with open('employee_birthday.txt') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'names are {
      
      ", ".join(row)}')
            line_count += 1
        else:
            print(f'\t{
      
      row[0]} works in the {
      
      row[1]} department, and was born in {
      
      row[2]}.')
            line_count += 1
    print(f'Processed {
      
      line_count} lines.')

names are name, department, birthday
	John works in the IT department, and was born in November.
	Tom works in the IT department, and was born in March.
Processed 3 lines.

Read CSV file into dictionary csv

In addition to working with lists of individual String elements, it is also possible to read CSV data directly into a dictionary.

import csv

with open('employee_birthday.txt', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {
      
      ", ".join(row)}')
            line_count += 1
        print(f'\t{
      
      row["name"]} works in the {
      
      row["department"]} department, and was born in {
      
      row["birthday month"]}.')
        line_count += 1
    print(f'Processed {
      
      line_count} lines.')


Column names are name, department, birthday
	John works in the IT department, and was born in November.
	Tom works in the IT department, and was born in March.
Processed 3 lines.

Optional Python CSV reader parameter

  • delimiter specifies the character used to separate each field. The default is a comma ( ' , ').
  • quotechar Specifies the character used to surround fields that contain delimiters. The default is double quotes ( ' " ').
  • escapechar specifies the character used to escape the delimiter, in case quotes are not used. The default is no escape characters.
name,address,date joined
john,1132 Anywhere Lane Hoboken NJ, 07030,Jan 4
erica,1234 Smith Lane Hoboken NJ, 07030,March 2

This CSV file contains three fields: name, address, and date joined, separated by commas. The problem is that the data for the address field also contains a comma to represent the zip code.

There are three ways to handle this.

  • To use a different delimiter, use the delimiter optional parameter to specify the new delimiter.
  • Enclose the data in quotes, the special properties of the chosen delimiter are ignored in quoted strings. quotechar An optional parameter can be used to specify the character to use for quoting.
  • Escapes the delimiter in the data, the escape characters work just like they do in the format string, invalidating the interpretation of the character being escaped (in this case the delimiter). If escape characters are used, they must be specified with the escapechar optional parameter.

Write to file using csv

CSV files can be written using the writer object and the .write_row() method.

import csv

with open('employee_file.csv', mode='w') as employee_file:
    employee_writer = csv.writer(employee_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    employee_writer.writerow(['John Smith', 'Accounting', 'November'])
    employee_writer.writerow(['Erica Meyers', 'IT', 'March'])

csv.QUOTE_MINIMAL means only when required, for example, when a field contains either the quotechar or the delimiter
csv.QUOTE_ALL means that quotes are always placed around fields.
csv.QUOTE_NONNUMERIC means that quotes are always placed around
fields which do not parse as integers or floating point numbers.
csv.QUOTE_NONE means that quotes are never placed around fields.

  • csv.QUOTE_MINIMAL: Writer objects only quote those containing special characters.
  • csv.QUOTE_ALL: The writer object quotes all fields such as field delimiter, quotechar or any character lineterminator.
  • csv.QUOTE_NONNUMERIC: The writer object references all non-numeric fields, instructing the reader to convert all non-referenced fields to float type.
  • csv.QUOTE_NONE: The writer object does not quote fields. If escapechar is not set, an error is thrown; instructs the reader to not perform special processing on quote characters.
John Smith,Accounting,November
Erica Meyers,IT,March

Write CSV file csv from dictionary

The DictReader parameter is required when writing a dictionary.

import csv

with open('employee_file2.csv', mode='w') as csv_file:
    fieldnames = ['emp_name', 'dept', 'birth_month']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({
    
    'emp_name': 'John Smith', 'dept': 'Accounting', 'birth_month': 'November'})
    writer.writerow({
    
    'emp_name': 'Erica Meyers', 'dept': 'IT', 'birth_month': 'March'})

Parse CSV file using pandas library

insert image description here

You can install the pandas library first.

pip install pandas

pandas read CSV file

# hrdata.csv
Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8

Read csv file using pandas.

import pandas as pd
df = pd.read_csv('hrdata.csv')
print(df)
             Name Hire Date   Salary  Sick Days remaining
0  Graham Chapman  03/15/14  50000.0                   10
1     John Cleese  06/01/15  65000.0                    8
2       Eric Idle  05/12/14  45000.0                   10
3     Terry Jones  11/01/13  70000.0                    3
4   Terry Gilliam  08/12/14  48000.0                    7
5   Michael Palin  05/23/13  66000.0                    8

Increase the index column to read the csv file, so that the index serial number is gone.

import pandas as pd
df = pd.read_csv('hrdata.csv', index_col='Name')
print(df)
               Hire Date   Salary  Sick Days remaining
Name                                                  
Graham Chapman  03/15/14  50000.0                   10
John Cleese     06/01/15  65000.0                    8
Eric Idle       05/12/14  45000.0                   10
Terry Jones     11/01/13  70000.0                    3
Terry Gilliam   08/12/14  48000.0                    7
Michael Palin   05/23/13  66000.0                    8

Fix Hire Date field data type to date data.

import pandas as pd
df = pd.read_csv('hrdata.csv', index_col='Name', parse_dates=['Hire Date'])
print(df)
                Hire Date   Salary  Sick Days remaining
Name                                                   
Graham Chapman 2014-03-15  50000.0                   10
John Cleese    2015-06-01  65000.0                    8
Eric Idle      2014-05-12  45000.0                   10
Terry Jones    2013-11-01  70000.0                    3
Terry Gilliam  2014-08-12  48000.0                    7
Michael Palin  2013-05-23  66000.0                    8

can also be processed uniformly.

import pandas as pd
df = pd.read_csv('hrdata.csv', 
        index_col='Employee', 
        parse_dates=['Hired'], 
        header=0, 
        names=['Employee', 'Hired','Salary', 'Sick Days'])
print(df)
                    Hired   Salary  Sick Days
Employee                                     
Graham Chapman 2014-03-15  50000.0         10
John Cleese    2015-06-01  65000.0          8
Eric Idle      2014-05-12  45000.0         10
Terry Jones    2013-11-01  70000.0          3
Terry Gilliam  2014-08-12  48000.0          7
Michael Palin  2013-05-23  66000.0          8

pandas write to CSV file

Write operations are as simple as read operations.

import pandas as pd
df = pd.read_csv('hrdata.csv', 
        index_col='Employee', 
        parse_dates=['Hired'],
        header=0, 
        names=['Employee', 'Hired', 'Salary', 'Sick Days'])
df.to_csv('hrdata_modified.csv')

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/123688215