Detailed explanation of CSV read and write operations necessary for data science

This article walks you through the basics of CSV files, so when it comes to handling imported data, most CSV reading, processing, and writing tasks can be easily handled by the basic Python csv library. If large amounts of data are to be read and processed, the pandas library also provides fast and easy CSV processing.
insert image description here

What is a CSV file

A CSV file (Comma Separated Values ​​file) is a plain text file that uses a specific structure to arrange tabular data. Because it is a plain text file, it can only contain actual text data, in other words printable ASCII or Unicode characters.

Typically, the structure of a CSV file is given by its name, with commas separating each specific data value.

column 1 name,column 2 name, column 3 name
first row data 1,first row data 2,first row data 3
second row data 1,second row data 2,second row data 3
...

How each piece of data is separated by commas. The first row is the name of the data column, and sometimes it can be empty. The first row is the actual data. Each line after that is the actual data, limited only by the file size.

Where did the CSV file come from?

CSV files are often created by programs that process large amounts of data. Export data from spreadsheets and databases and import in other programs. For example, the results of a data mining program can be exported as a CSV file, which can then be imported into a spreadsheet to analyze data, generate graphs for presentations, or prepare reports for publication.

CSV files are very easy to work with programmatically. Any language that supports text file input and string manipulation, such as Python, can process CSV files directly.

CSV library to parse CSV files

The csv library provides functions to read and write CSV files. Designed to work out-of-the-box with CSV files generated with Excel, adapt to a variety of CSV formats. The csv library contains objects and other code for reading, writing, and manipulating data from CSV files.
insert image description here

Read CSV file

CSV files are opened as text files using Python's built-in open() function, which returns a file object. Reading from the CSV file is then done using the reader object.

employee_birthday.txt

name,department,birthday month
John Smith,Accounting,November
Erica Meyers,IT,March

Direct read method.

import csv

with open('employee_birthday.txt') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {
      
      ", ".join(row)}')
            line_count += 1
        else:
            print(f'\t{
      
      row[0]} works in the {
      
      row[1]} department, and was born in {
      
      row[2]}.')
            line_count += 1
    print(f'Processed {
      
      line_count} lines.')

The method for reading in dictionary mode.

import csv

with open('employee_birthday.txt', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {
      
      ", ".join(row)}')
            line_count += 1
        print(f'\t{
      
      row["name"]} works in the {
      
      row["department"]} department, and was born in {
      
      row["birthday month"]}.')
        line_count += 1
    print(f'Processed {
      
      line_count} lines.')

The final output is the same.

Column names are name, department, birthday month
    John Smith works in the Accounting department, and was born in November.
    Erica Meyers works in the IT department, and was born in March.
Processed 3 lines.

CSV reader parameter

The reader object can handle different styles of CSV files by specifying additional parameters.

  • delimiter specifies the character used to separate each field, the default is a comma (',').
  • quotechar specifies the character used to surround fields containing delimiters, the default is double quotes ( ' " ').
  • escapechar specifies the character used to escape the delimiter in case quotes are not used, the default is no escape character.

employee_addresses.txt

name,address,date joined
john smith,1132 Anywhere Lane Hoboken NJ, 07030,Jan 4
erica meyers,1234 Smith Lane Hoboken NJ, 07030,March 2

This CSV file contains three fields: name, address, and date joined, separated by commas. The problem is that the data for the address field also contains a comma to represent the zip code.

Think about how to deal with this?

Writing of CSV files

Writing of CSV files can be done using the .write_row() method.

import csv

with open('employee_file.csv', mode='w') as employee_file:
    employee_writer = csv.writer(employee_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    employee_writer.writerow(['John Smith', 'Accounting', 'November'])
    employee_writer.writerow(['Erica Meyers', 'IT', 'March'])

quotechar is used to enclose fields containing special characters to eliminate ambiguity.

Several cases of quoting control quoting behavior:

  • csv.QUOTE_NONNUMERIC) # non-numeric quotes
  • csv.QUOTE_ALL # All fields are quoted
  • csv.QUOTE_MINIMAL # Quote special fields
  • csv.QUOTE_NONE # without quotes

Dictionaries are written.

import csv

with open('employee_file2.csv', mode='w') as csv_file:
    fieldnames = ['emp_name', 'dept', 'birth_month']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({
    
    'emp_name': 'John Smith', 'dept': 'Accounting', 'birth_month': 'November'})
    writer.writerow({
    
    'emp_name': 'Erica Meyers', 'dept': 'IT', 'birth_month': 'March'})

employee_file2.csv

emp_name,dept,birth_month
John Smith,Accounting,November
Erica Meyers,IT,March

Parse CSV file using pandas library

pandas is an open source Python library that provides high-performance data analysis tools and easy-to-use data structures for sharing data, code, analysis results, visualizations, and narrative text.
insert image description here

pandas read CSV file

hrdata.csv

Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8

Use pandas to read quickly.

import pandas
df = pandas.read_csv('hrdata.csv')
print(df)

             Name Hire Date   Salary  Sick Days remaining
0  Graham Chapman  03/15/14  50000.0                   10
1     John Cleese  06/01/15  65000.0                    8
2       Eric Idle  05/12/14  45000.0                   10
3     Terry Jones  11/01/13  70000.0                    3
4   Terry Gilliam  08/12/14  48000.0                    7
5   Michael Palin  05/23/13  66000.0                    8

Date formats can be formatted when reading data with pandas.

import pandas
df = pandas.read_csv('hrdata.csv', index_col='Name', parse_dates=['Hire Date'])
print(df)
                Hire Date   Salary  Sick Days remaining
Name                                                   
Graham Chapman 2014-03-15  50000.0                   10
John Cleese    2015-06-01  65000.0                    8
Eric Idle      2014-05-12  45000.0                   10
Terry Jones    2013-11-01  70000.0                    3
Terry Gilliam  2014-08-12  48000.0                    7
Michael Palin  2013-05-23  66000.0                    8

pandas write to CSV file

Content read into pandas can be written directly to a new csv file.

import pandas
df = pandas.read_csv('hrdata.csv', 
            index_col='Employee', 
            parse_dates=['Hired'],
            header=0, 
            names=['Employee', 'Hired', 'Salary', 'Sick Days'])
df.to_csv('hrdata_modified.csv')

print(df)
Employee,Hired,Salary,Sick Days
Graham Chapman,2014-03-15,50000.0,10
John Cleese,2015-06-01,65000.0,8
Eric Idle,2014-05-12,45000.0,10
Terry Jones,2013-11-01,70000.0,3
Terry Gilliam,2014-08-12,48000.0,7
Michael Palin,2013-05-23,66000.0,8

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124097783