This article walks you through the basics of CSV files, so when it comes to handling imported data, most CSV reading, processing, and writing tasks can be easily handled by the basic Python csv library. If large amounts of data are to be read and processed, the pandas library also provides fast and easy CSV processing.
Article directory
What is a CSV file
A CSV file (Comma Separated Values file) is a plain text file that uses a specific structure to arrange tabular data. Because it is a plain text file, it can only contain actual text data, in other words printable ASCII or Unicode characters.
Typically, the structure of a CSV file is given by its name, with commas separating each specific data value.
column 1 name,column 2 name, column 3 name
first row data 1,first row data 2,first row data 3
second row data 1,second row data 2,second row data 3
...
How each piece of data is separated by commas. The first row is the name of the data column, and sometimes it can be empty. The first row is the actual data. Each line after that is the actual data, limited only by the file size.
Where did the CSV file come from?
CSV files are often created by programs that process large amounts of data. Export data from spreadsheets and databases and import in other programs. For example, the results of a data mining program can be exported as a CSV file, which can then be imported into a spreadsheet to analyze data, generate graphs for presentations, or prepare reports for publication.
CSV files are very easy to work with programmatically. Any language that supports text file input and string manipulation, such as Python, can process CSV files directly.
CSV library to parse CSV files
The csv library provides functions to read and write CSV files. Designed to work out-of-the-box with CSV files generated with Excel, adapt to a variety of CSV formats. The csv library contains objects and other code for reading, writing, and manipulating data from CSV files.
Read CSV file
CSV files are opened as text files using Python's built-in open() function, which returns a file object. Reading from the CSV file is then done using the reader object.
employee_birthday.txt
name,department,birthday month
John Smith,Accounting,November
Erica Meyers,IT,March
Direct read method.
import csv
with open('employee_birthday.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
if line_count == 0:
print(f'Column names are {
", ".join(row)}')
line_count += 1
else:
print(f'\t{
row[0]} works in the {
row[1]} department, and was born in {
row[2]}.')
line_count += 1
print(f'Processed {
line_count} lines.')
The method for reading in dictionary mode.
import csv
with open('employee_birthday.txt', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
line_count = 0
for row in csv_reader:
if line_count == 0:
print(f'Column names are {
", ".join(row)}')
line_count += 1
print(f'\t{
row["name"]} works in the {
row["department"]} department, and was born in {
row["birthday month"]}.')
line_count += 1
print(f'Processed {
line_count} lines.')
The final output is the same.
Column names are name, department, birthday month
John Smith works in the Accounting department, and was born in November.
Erica Meyers works in the IT department, and was born in March.
Processed 3 lines.
CSV reader parameter
The reader object can handle different styles of CSV files by specifying additional parameters.
- delimiter specifies the character used to separate each field, the default is a comma (',').
- quotechar specifies the character used to surround fields containing delimiters, the default is double quotes ( ' " ').
- escapechar specifies the character used to escape the delimiter in case quotes are not used, the default is no escape character.
employee_addresses.txt
name,address,date joined
john smith,1132 Anywhere Lane Hoboken NJ, 07030,Jan 4
erica meyers,1234 Smith Lane Hoboken NJ, 07030,March 2
This CSV file contains three fields: name, address, and date joined, separated by commas. The problem is that the data for the address field also contains a comma to represent the zip code.
Think about how to deal with this?
Writing of CSV files
Writing of CSV files can be done using the .write_row() method.
import csv
with open('employee_file.csv', mode='w') as employee_file:
employee_writer = csv.writer(employee_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
employee_writer.writerow(['John Smith', 'Accounting', 'November'])
employee_writer.writerow(['Erica Meyers', 'IT', 'March'])
quotechar is used to enclose fields containing special characters to eliminate ambiguity.
Several cases of quoting control quoting behavior:
- csv.QUOTE_NONNUMERIC) # non-numeric quotes
- csv.QUOTE_ALL # All fields are quoted
- csv.QUOTE_MINIMAL # Quote special fields
- csv.QUOTE_NONE # without quotes
Dictionaries are written.
import csv
with open('employee_file2.csv', mode='w') as csv_file:
fieldnames = ['emp_name', 'dept', 'birth_month']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({
'emp_name': 'John Smith', 'dept': 'Accounting', 'birth_month': 'November'})
writer.writerow({
'emp_name': 'Erica Meyers', 'dept': 'IT', 'birth_month': 'March'})
employee_file2.csv
emp_name,dept,birth_month
John Smith,Accounting,November
Erica Meyers,IT,March
Parse CSV file using pandas library
pandas is an open source Python library that provides high-performance data analysis tools and easy-to-use data structures for sharing data, code, analysis results, visualizations, and narrative text.
pandas read CSV file
hrdata.csv
Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8
Use pandas to read quickly.
import pandas
df = pandas.read_csv('hrdata.csv')
print(df)
Name Hire Date Salary Sick Days remaining
0 Graham Chapman 03/15/14 50000.0 10
1 John Cleese 06/01/15 65000.0 8
2 Eric Idle 05/12/14 45000.0 10
3 Terry Jones 11/01/13 70000.0 3
4 Terry Gilliam 08/12/14 48000.0 7
5 Michael Palin 05/23/13 66000.0 8
Date formats can be formatted when reading data with pandas.
import pandas
df = pandas.read_csv('hrdata.csv', index_col='Name', parse_dates=['Hire Date'])
print(df)
Hire Date Salary Sick Days remaining
Name
Graham Chapman 2014-03-15 50000.0 10
John Cleese 2015-06-01 65000.0 8
Eric Idle 2014-05-12 45000.0 10
Terry Jones 2013-11-01 70000.0 3
Terry Gilliam 2014-08-12 48000.0 7
Michael Palin 2013-05-23 66000.0 8
pandas write to CSV file
Content read into pandas can be written directly to a new csv file.
import pandas
df = pandas.read_csv('hrdata.csv',
index_col='Employee',
parse_dates=['Hired'],
header=0,
names=['Employee', 'Hired', 'Salary', 'Sick Days'])
df.to_csv('hrdata_modified.csv')
print(df)
Employee,Hired,Salary,Sick Days
Graham Chapman,2014-03-15,50000.0,10
John Cleese,2015-06-01,65000.0,8
Eric Idle,2014-05-12,45000.0,10
Terry Jones,2013-11-01,70000.0,3
Terry Gilliam,2014-08-12,48000.0,7
Michael Palin,2013-05-23,66000.0,8