Python data analysis of the CSV file (3)

  This section mainly to talk about specific screening line when reading and writing CSV file.
  Sometimes, we do not need all the data files. For example, we may only need a subset containing a specific word or number of lines, or a subset of rows associated with a specific date. In these cases, we can filter out specific lines in Python to use.
  The following mainly in terms of the input file selected a particular row of three methods:
  1. The values in the row satisfies a condition;
  values in row 2. In the set;
  the value in row 3 in a pattern matching (regular expressions).
  In fact, the code of these three methods of screening are consistent with the structure. The following general structure:

for row in filereader:
    ***if value in row meets some business rule or set of rules:***
        do something
    else:
        do something else

  Let's discuss in detail the above-mentioned three methods.

Value Line satisfies a condition

1. Basic Python

  In the previous examples, if we want to keep the name as a supplier Supplier Z or line cost is greater than $ 600.00, and the result is written to the output file. code show as below:

#!/usr/bin/env python3

import csv
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(input_file, 'r', newline='') as csv_in_file:
    with open(output_file, 'w', newline='') as csv_out_file:
        filereader = csv.reader(csv_in_file)
        filewriter = csv.writer(csv_out_file)
        header = next(filereader)
        filewriter.writerow(header)
        for row_list in filereader:
            supplier = str(row_list[0]).strip()
            cost = str(row_list[3]).strip('$').replace(',', '')
            if supplier == 'Supplier Z' or float(cost) > 600.0:
                filewriter.writerow(row_list)

  We explain the above code.

        header = next(filereader)
        filewriter.writerow(header)

  These two lines with csv module next()function reads the first line of the input file, the list of assigned variables header, and use the writerow()function to write output file header row.

            supplier = str(row_list[0]).strip()

  This line of code out the name of each line supplier data, assigned to the variable supplier. This line of code uses a table index value for each extracted first data row row[0], using the str()function to convert it to a string, and then use the strip()function to remove the string ends spaces, tabs, and line breaks. Finally, a good deal with a string assigned to the variable supplier.

            cost = str(row_list[3]).strip('$').replace(',', '')

  This removed the cost per line of line data, assigned to the variable cost. This line of code list index extracted using the fourth data value for each row row[3], using str()a function to convert it to a string, and then use the strip('$')function to remove the string from $the symbol, and then use the replace()function to remove the comma from the string. Finally, a good deal with a string assigned to the variable cost.

            if supplier == 'Supplier Z' or float(cost) > 600.0:
                filewriter.writerow(row_list)

  These two lines of code by creating a ifsentence to test each row of the two values meets the conditions, if the conditions are met, use filewriterthe writerow()function to write the output file.
  We run this script on the command line window. No output in the window, we can open the output file to see the results.
Here Insert Picture Description

2.pandas

  pandas module provides a locfunction that can select specific rows and columns simultaneously. Pandas module using the following code:

#!/usr/bin/env python3

import pandas as pd
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
data_frame['Cost'] = data_frame['Cost'].str.strip('$').astype(float)
data_frame_value_meets_condition = data_frame.loc[(data_frame['Supplier Name'].str.contains('Z'))
                                                  | (data_frame['Cost'] > 600.0), :]
data_frame_value_meets_condition.to_csv(output_file, index=False)

  We command line window run the script, likewise, on the screen we do not see any output, we can open the output file to view the results.
  Here the output result will be omitted.

Values ​​in a row is part of a collection

  Sometimes, when the value of the row is part of a collection, only you need to leave those lines. For example, we want to keep those suppliers dataset name belongs to the set {Supplier X, Supplier Y}line, or want to keep all belong to the set date of purchase {'1/20/2014', '1/30/2014'}line. In this case, we can examine the value in the row is part of a collection belonging to the selected row with the set.

1. Basic Python

  code show as below:

#!/usr/bin/env python3

import csv
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

important_dates = ['1/20/2014', '1/30/2014']
with open(input_file, 'r', newline='') as csv_in_file:
    with open(output_file, 'w', newline='') as csv_out_file:
        filereader = csv.reader(csv_in_file)
        filewriter = csv.writer(csv_out_file)
        header = next(filereader)
        filewriter.writerow(header)
        for row_list in filereader:
            a_date = row_list[4]
            if a_date in important_dates:
                filewriter.writerow(row_list)

  Let's talk about the above code.

important_dates = ['1/20/2014', '1/30/2014']

  This line of code creates a named important_dateslist of variables, which contains two specific dates. The list of variables is our collection.

            a_date = row_list[4]

  This line of code out the date of purchase of each line, and assigned to the variable a_date.

            if a_date in important_dates:
                filewriter.writerow(row_list)

  This line of code creates a ifstatement to test variables a_datein the set date of purchase whether important_date, if the value of the variable in the collection, this line will be the next line of code written to the output file.
  We run this script on the command line window, and open the output file to see the results.
Here Insert Picture Description

2.pandas

  Use pandas module code is as follows:

#!/usr/bin/env python3

import pandas as pd
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
important_dates = ['1/20/2014', '1/30/2014']
data_frame_value_in_set = data_frame.loc[data_frame['Purchase Date'].isin(important_dates), :]
data_frame_value_in_set.to_csv(output_file, index=False)

  Output and the output of this method Python consistent basis of the above, is omitted.

Value in the line matches a pattern (regular expression)

  Sometimes, when the value of the row match or contain a particular pattern (regular expression), only you need to leave those lines. For example, we want to keep all invoices in the data set begins with "001" line, or want to keep the line all suppliers whose names include the "Y". In this case, we can examine the value of the row matches or contains a pattern, and then filter out the lines that match or contain the pattern.

1. Basic Python

  code show as below:

#!/usr/bin/env python3

import csv
import re
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

pattern = re.compile(r'(?P<my_pattern_group>^001-.*)', re.I)
with open(input_file, 'r', newline='') as csv_in_file:
    with open(output_file, 'w', newline='') as csv_out_file:
        filereader = csv.reader(csv_in_file)
        filewriter = csv.writer(csv_out_file)
        header = next(filereader)
        filewriter.writerow(header)
        for row_list in filereader:
            invoice_number = row_list[1]
            if pattern.search(invoice_number):
                filewriter.writerow(row_list)

  Let's talk about the above code.

import re

  This line of code to import the re module, which is the regular expression module.

pattern = re.compile(r'(?P<my_pattern_group>^001-.*)', re.I)

  This line of code uses the re module compile()functions to create variables pattern. Wherein rrepresents the mode between the single quotes treated as the original string. Metacharacter ?P<my_pattern_group>capture named <my_pattern_group>groups in a matching string, so that when needed they are printed to the screen or write to the file. The actual pattern here mean ^001-.*. Caret ^indicates that only at the beginning of the string search mode, *represents repeat the previous character zero or more times, .*together representing other line break \nany character can appear any number of times than in the "001" behind. Finally, parameter re.Itells the regular expression case to match.
  Content about regular expressions, you can refer to my previous blog wrote: Python regular expression .

            if pattern.search(invoice_number):
                filewriter.writerow(row_list)

  As used herein, the re module search()function invoice_numbervalues to find mode, if the pattern occurs in the variable value, the line will be written to the output file.
  We command line window run the script, get the output file as shown in the following figure.
Here Insert Picture Description

2.pandas

  Use pandas module code is as follows:

#!/usr/bin/env python3

import pandas as pd
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
data_frame_value_matches_pattern = data_frame.loc[data_frame['Invoice Number'].str.startswith("001-"), :]
data_frame_value_matches_pattern.to_csv(output_file, index=False)

  The above code, the startswith()function searches data, do not have a regular expression.
  Output and the output of this method Python consistent basis of the above, is omitted.

Published 21 original articles · won praise 6 · views 1645

Guess you like

Origin blog.csdn.net/qq_45554010/article/details/104056060