This section mainly to talk about specific screening line when reading and writing CSV file.
Sometimes, we do not need all the data files. For example, we may only need a subset containing a specific word or number of lines, or a subset of rows associated with a specific date. In these cases, we can filter out specific lines in Python to use.
The following mainly in terms of the input file selected a particular row of three methods:
1. The values in the row satisfies a condition;
values in row 2. In the set;
the value in row 3 in a pattern matching (regular expressions).
In fact, the code of these three methods of screening are consistent with the structure. The following general structure:
for row in filereader:
***if value in row meets some business rule or set of rules:***
do something
else:
do something else
Let's discuss in detail the above-mentioned three methods.
Value Line satisfies a condition
1. Basic Python
In the previous examples, if we want to keep the name as a supplier Supplier Z or line cost is greater than $ 600.00, and the result is written to the output file. code show as below:
#!/usr/bin/env python3
import csv
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
with open(input_file, 'r', newline='') as csv_in_file:
with open(output_file, 'w', newline='') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewriter = csv.writer(csv_out_file)
header = next(filereader)
filewriter.writerow(header)
for row_list in filereader:
supplier = str(row_list[0]).strip()
cost = str(row_list[3]).strip('$').replace(',', '')
if supplier == 'Supplier Z' or float(cost) > 600.0:
filewriter.writerow(row_list)
We explain the above code.
header = next(filereader)
filewriter.writerow(header)
These two lines with csv module next()
function reads the first line of the input file, the list of assigned variables header
, and use the writerow()
function to write output file header row.
supplier = str(row_list[0]).strip()
This line of code out the name of each line supplier data, assigned to the variable supplier
. This line of code uses a table index value for each extracted first data row row[0]
, using the str()
function to convert it to a string, and then use the strip()
function to remove the string ends spaces, tabs, and line breaks. Finally, a good deal with a string assigned to the variable supplier
.
cost = str(row_list[3]).strip('$').replace(',', '')
This removed the cost per line of line data, assigned to the variable cost
. This line of code list index extracted using the fourth data value for each row row[3]
, using str()
a function to convert it to a string, and then use the strip('$')
function to remove the string from $
the symbol, and then use the replace()
function to remove the comma from the string. Finally, a good deal with a string assigned to the variable cost
.
if supplier == 'Supplier Z' or float(cost) > 600.0:
filewriter.writerow(row_list)
These two lines of code by creating a if
sentence to test each row of the two values meets the conditions, if the conditions are met, use filewriter
the writerow()
function to write the output file.
We run this script on the command line window. No output in the window, we can open the output file to see the results.
2.pandas
pandas module provides a loc
function that can select specific rows and columns simultaneously. Pandas module using the following code:
#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_csv(input_file)
data_frame['Cost'] = data_frame['Cost'].str.strip('$').astype(float)
data_frame_value_meets_condition = data_frame.loc[(data_frame['Supplier Name'].str.contains('Z'))
| (data_frame['Cost'] > 600.0), :]
data_frame_value_meets_condition.to_csv(output_file, index=False)
We command line window run the script, likewise, on the screen we do not see any output, we can open the output file to view the results.
Here the output result will be omitted.
Values in a row is part of a collection
Sometimes, when the value of the row is part of a collection, only you need to leave those lines. For example, we want to keep those suppliers dataset name belongs to the set {Supplier X, Supplier Y}
line, or want to keep all belong to the set date of purchase {'1/20/2014', '1/30/2014'}
line. In this case, we can examine the value in the row is part of a collection belonging to the selected row with the set.
1. Basic Python
code show as below:
#!/usr/bin/env python3
import csv
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
important_dates = ['1/20/2014', '1/30/2014']
with open(input_file, 'r', newline='') as csv_in_file:
with open(output_file, 'w', newline='') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewriter = csv.writer(csv_out_file)
header = next(filereader)
filewriter.writerow(header)
for row_list in filereader:
a_date = row_list[4]
if a_date in important_dates:
filewriter.writerow(row_list)
Let's talk about the above code.
important_dates = ['1/20/2014', '1/30/2014']
This line of code creates a named important_dates
list of variables, which contains two specific dates. The list of variables is our collection.
a_date = row_list[4]
This line of code out the date of purchase of each line, and assigned to the variable a_date
.
if a_date in important_dates:
filewriter.writerow(row_list)
This line of code creates a if
statement to test variables a_date
in the set date of purchase whether important_date
, if the value of the variable in the collection, this line will be the next line of code written to the output file.
We run this script on the command line window, and open the output file to see the results.
2.pandas
Use pandas module code is as follows:
#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_csv(input_file)
important_dates = ['1/20/2014', '1/30/2014']
data_frame_value_in_set = data_frame.loc[data_frame['Purchase Date'].isin(important_dates), :]
data_frame_value_in_set.to_csv(output_file, index=False)
Output and the output of this method Python consistent basis of the above, is omitted.
Value in the line matches a pattern (regular expression)
Sometimes, when the value of the row match or contain a particular pattern (regular expression), only you need to leave those lines. For example, we want to keep all invoices in the data set begins with "001" line, or want to keep the line all suppliers whose names include the "Y". In this case, we can examine the value of the row matches or contains a pattern, and then filter out the lines that match or contain the pattern.
1. Basic Python
code show as below:
#!/usr/bin/env python3
import csv
import re
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
pattern = re.compile(r'(?P<my_pattern_group>^001-.*)', re.I)
with open(input_file, 'r', newline='') as csv_in_file:
with open(output_file, 'w', newline='') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewriter = csv.writer(csv_out_file)
header = next(filereader)
filewriter.writerow(header)
for row_list in filereader:
invoice_number = row_list[1]
if pattern.search(invoice_number):
filewriter.writerow(row_list)
Let's talk about the above code.
import re
This line of code to import the re module, which is the regular expression module.
pattern = re.compile(r'(?P<my_pattern_group>^001-.*)', re.I)
This line of code uses the re module compile()
functions to create variables pattern
. Wherein r
represents the mode between the single quotes treated as the original string. Metacharacter ?P<my_pattern_group>
capture named <my_pattern_group>
groups in a matching string, so that when needed they are printed to the screen or write to the file. The actual pattern here mean ^001-.*
. Caret ^
indicates that only at the beginning of the string search mode, *
represents repeat the previous character zero or more times, .*
together representing other line break \n
any character can appear any number of times than in the "001" behind. Finally, parameter re.I
tells the regular expression case to match.
Content about regular expressions, you can refer to my previous blog wrote: Python regular expression .
if pattern.search(invoice_number):
filewriter.writerow(row_list)
As used herein, the re module search()
function invoice_number
values to find mode, if the pattern occurs in the variable value, the line will be written to the output file.
We command line window run the script, get the output file as shown in the following figure.
2.pandas
Use pandas module code is as follows:
#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_csv(input_file)
data_frame_value_matches_pattern = data_frame.loc[data_frame['Invoice Number'].str.startswith("001-"), :]
data_frame_value_matches_pattern.to_csv(output_file, index=False)
The above code, the startswith()
function searches data, do not have a regular expression.
Output and the output of this method Python consistent basis of the above, is omitted.