pandas learning: processing large CSV files with pandas

# -*-  coding:utf-8 -*-
'''
CSV common API

1) reader(csvfile[, dialect='excel'][, fmtparam]), mainly used for reading CSV files, returns a
The reader object is used to iterate over the rows of the CSV file content.
         parameter:
         csvfile, needs to be an object that supports iteration (Iterator), usually applicable to file (file) objects or list (list) objects, and the return value of each call to the next() method is a string (string);
         The default value of dialect is excel, which is compatible with excel;
         fmtparam is a list of parameters, mainly used in cases where you need to override the default Dialect settings
         
2) csv.writer(csvfile, dialect='excel', **fmtparams), used to write CSV files.

with open('data.csv', 'wb') as csvfile:
    csvwriter = csv.writer(csvfile, dialect='excel',delimiter="|",quotechar='"',
                  quoting=csv.QUOTE_MINIMAL)
    csvwriter .writerow(["1/3/09 14:44","'Product1'","1200''","Visa","Gouya"])
    # Incorporation line
            The output format is: 1/3/09 14:44|'Product1'|1200''|Visa|Gouya

3)csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None,
dialect='excel',*args, **kwds), similar to the reader() method, the difference is that the read information is mapped to a dictionary, where the word
             The dictionary key is specified by fieldnames. If this value is omitted, the data in the first row of the CSV file will be used as the key value. if
             The number of fields read into the row is greater than the number specified in filednames, the extra field names will be stored in restkey, and
    restval is mainly used when the number of fields in the read line is less than fieldnames, its value will be used as the value corresponding to the remaining key.
    
4)csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise',
dialect='excel', *args,**kwds), used to support dictionary writing.    
    
'''
import csv
#DictWriter
with open('C:\\test.csv', 'wb') as csv_file:
    # set column names
    FIELDS = ['Transaction_date', 'Product', 'Price', 'Payment_Type']
    writer = csv.DictWriter(csv_file, fieldnames=FIELDS)
    # write column names
    writer.writerow(dict(zip(FIELDS, FIELDS)))
    d = {'Transaction_date':'1/2/09 6:17','Product':'Product1','Price':'1200',\
           'Payment_Type':'Mastercard'}
    # write a line
    writer.writerow(d)
    
with open('C:\\test.csv', 'rb') as csv_file:
    for d in csv.DictReader(csv_file):
        print d
        
'''
Pandas, the Python Data Analysis Library, is a third-party tool created to solve data analysis.
Not only provides a rich data model, but also supports a variety of file format processing, including CSV, HDF5, HTML, etc.,
Can provide efficient large-scale data processing. The two data structures it supports - Series and DataFrame - are data processing
the basis of rationale. The following two data structures are introduced first.

Series: It is an array-like indexed one-dimensional data structure with supported types compatible with NumPy. Such as
If no index is specified, it defaults to 0 to N-1. Through obj.values() and obj.index(), you can get the value and index respectively
lead. When passing a dictionary to Series, the indices of the Series will be sorted according to the keys in the dictionary. If passed
When entering the dictionary, the index parameter is re-specified at the same time. When the index does not match the key in the dictionary, it will
Condition that occurs when data is lost, marked as NaN.

import pandas
#Use the functions isnull() and notnull() in pandas to detect data loss.

>>> obj1 = Series([1, 'a', (1,2), 3], index=['a', 'b', 'c', 'd'])
>>> obj1#value and index match one by one
a 1
b a
c (1, 2)
d 3
dtype: object
>>> obj2=Series({"Book":"Python","Author":"Dan","ISBN":"011334","Price":25},inde
x=['book','Author','ISBM','Price'])
>>> obj2.isnull()
book True # The specified index does not match the key of the dictionary, and data loss occurs
Author False
ISBM True # The specified index does not match the key of the dictionary, and data loss occurs
Price False
dtype: bool

DataFrame : Similar to a spreadsheet, its data is a collection of ordered columns of data, each of which can be
A different data type, it resembles a two-dimensional data structure that supports row and column indexing. and Series One
As such, indexes are automatically allocated and can be sorted according to the specified column. The most used way is through a long
to construct a dictionary of lists of equal degrees. The most common way to construct a DataFrame is to use an equal-length
A dictionary or NumPy array of lists. DataFrames can also be processed by specifying the order of the sequences by columns
sort.

>>> data = {'OrderDate': ['1-6-10', '1-23-10', '2-9-10', '2-26-10', '3-15-10'],
... 'Region': ['East', 'Central', 'Central', 'West', 'E ast'],
... 'Rep': ['Jones', 'Kivell', 'Jardine', 'Gill', 'Sorv ino']}
>>>
>>> DataFrame(data,columns=['OrderDate','Region','Rep'])# Built from a dictionary, sorted in the order specified by cloumns
OrderDate Region Rep
1-6-10 East Jones
1-23-10 Central Kivell
2-9-10 Central Jardine
2-26-10 West Gill
3-15-10 East Sorvino

The functions for processing CSV files in #Pandas are mainly read_csv() and to_csv(). Among them, read_csv() reads the contents of the CSV file and returns a DataFrame, and to_csv() is its inverse process.

1) Specify the number of lines to read partial columns and files. The specific implementation code is as follows:
df = pd.read_csv("SampleData.csv",nrows=5,usecols=['OrderDate','Item','Total'])

The parameter nrows of the method read_csv() specifies the number of rows to read the file, usecols specifies the column name of the column to be read,
If there is no column name, index 0, 1, ..., n-1 can be used directly. The above two parameters are very useful for large file processing, can
To avoid reading the entire file and only select the required parts to read

2) Set the CSV file to be compatible with excel. The dialect parameter can be either a string or an instance of csv.Dialect.
If the file format shown in Figure 4-2 is changed to use the "|" separator, you need to set dialect-related parameters. error_
bad_lines is set to False, when the record does not meet the requirements, such as the number of columns contained in the record does not match the file column setting
Isochronous can simply ignore these columns. The following code is used to set the CSV file to be compatible with excel, where the delimiter is "|",
And error_bad_lines=False will directly ignore the records that do not meet the requirements.

>>> dia = csv.excel()
>>> dia.delimiter="|" #Set the delimiter
>>> pd.read_csv("SD.csv")
OrderDate|Region|Rep|Item|Units|Unit Cost|Total
1-6-10|East|Jones|Pencil|95|1.99 |189.05
1-23-10 | Central | Kivell | Binder | 50 | 19.99 | 999.50 ...
>>> pd.read_csv("SD.csv",dialect = dia,error_bad_lines=False)
Skipping line 3: expected 7 fields, saw 10 # All columns that do not meet the format requirements will be ignored directly
OrderDate Region Rep Item Units Unit Cost Total
1-6-10 East Jones Pencil 95 1.99 189.05

3) Chunks the file and returns an iterable object. Chunked processing to avoid loading all files
Memory, only read in what you need when you're using it. The parameter chunksize sets the number of file lines in chunks, 10 means each chunk
Contains 10 records. When the parameter iterator is set to True, the return value is TextFileReader, which is an iterable pair
elephant. Look at the following example, when chunksize=10, iterator=True, each output is a block containing 10 records.
>>> reader = pd.read_table("SampleData.csv",chunksize=10,iterator=True)
>>> reader
<pandas.io.parsers.TextFileReader object at 0x0314BE70>
>>> iter(reader).next() # Convert TextFileReader to iterator and call next method
OrderDate,Region,Rep,Item,Units,Unit Cost,Total # Read 10 rows each time
1-6-10,East,Jones,Pencil,95, 1.99 , 189.05
1-23-10, Central, Kivell, Binder, 50, 19.99, 999.50
2-9-10,Central,Jardine,Pencil,36, 4.99 , 179.64
2-26-10,Central,Gill,Pen,27, 19.99 , 539.73
3-15-10,West,Sorvino,Pencil,56, 2.99 , 167.44
4-1-10,East,Jones,Binder,60, 4.99 , 299.40
4-18-10,Central,Andrews,Pencil,75, 1.99 , 149.25
5-5-10,Central,Jardine,Pencil,90, 4.99 , 449.10
5-22-10,West,Thompson,Pencil,32, 1.99 , 63.68

4) When the file formats are similar, it supports the merging of multiple files. The following example is used to combine 3 identical formats
The files are merged.

>>> filelst = os.listdir("test")
>>> print filelst # There are 3 files with the same format at the same time
['s1.csv', 's2.csv', 's3.csv']
>>> os.chdir("test")
>>> dfs =[pd.read_csv(f) for f in filelst]
>>> total_df = pd.concat(dfs) # merge files
>>> total_df
OrderDate Region Rep Item Units Unit Cost Total
1-6-10 East Jones Pencil 95 1.99 189.05
1-23-10 Central Kivell Binder 50 19.99 999.5


'''

Reprinted in: https://www.cnblogs.com/tychyg/p/4935987.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325589898&siteId=291194637