# -*- coding:utf-8 -*- ''' CSV common API 1) reader(csvfile[, dialect='excel'][, fmtparam]), mainly used for reading CSV files, returns a The reader object is used to iterate over the rows of the CSV file content. parameter: csvfile, needs to be an object that supports iteration (Iterator), usually applicable to file (file) objects or list (list) objects, and the return value of each call to the next() method is a string (string); The default value of dialect is excel, which is compatible with excel; fmtparam is a list of parameters, mainly used in cases where you need to override the default Dialect settings 2) csv.writer(csvfile, dialect='excel', **fmtparams), used to write CSV files. with open('data.csv', 'wb') as csvfile: csvwriter = csv.writer(csvfile, dialect='excel',delimiter="|",quotechar='"', quoting=csv.QUOTE_MINIMAL) csvwriter .writerow(["1/3/09 14:44","'Product1'","1200''","Visa","Gouya"]) # Incorporation line The output format is: 1/3/09 14:44|'Product1'|1200''|Visa|Gouya 3)csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel',*args, **kwds), similar to the reader() method, the difference is that the read information is mapped to a dictionary, where the word The dictionary key is specified by fieldnames. If this value is omitted, the data in the first row of the CSV file will be used as the key value. if The number of fields read into the row is greater than the number specified in filednames, the extra field names will be stored in restkey, and restval is mainly used when the number of fields in the read line is less than fieldnames, its value will be used as the value corresponding to the remaining key. 4)csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args,**kwds), used to support dictionary writing. ''' import csv #DictWriter with open('C:\\test.csv', 'wb') as csv_file: # set column names FIELDS = ['Transaction_date', 'Product', 'Price', 'Payment_Type'] writer = csv.DictWriter(csv_file, fieldnames=FIELDS) # write column names writer.writerow(dict(zip(FIELDS, FIELDS))) d = {'Transaction_date':'1/2/09 6:17','Product':'Product1','Price':'1200',\ 'Payment_Type':'Mastercard'} # write a line writer.writerow(d) with open('C:\\test.csv', 'rb') as csv_file: for d in csv.DictReader(csv_file): print d ''' Pandas, the Python Data Analysis Library, is a third-party tool created to solve data analysis. Not only provides a rich data model, but also supports a variety of file format processing, including CSV, HDF5, HTML, etc., Can provide efficient large-scale data processing. The two data structures it supports - Series and DataFrame - are data processing the basis of rationale. The following two data structures are introduced first. Series: It is an array-like indexed one-dimensional data structure with supported types compatible with NumPy. Such as If no index is specified, it defaults to 0 to N-1. Through obj.values() and obj.index(), you can get the value and index respectively lead. When passing a dictionary to Series, the indices of the Series will be sorted according to the keys in the dictionary. If passed When entering the dictionary, the index parameter is re-specified at the same time. When the index does not match the key in the dictionary, it will Condition that occurs when data is lost, marked as NaN. import pandas #Use the functions isnull() and notnull() in pandas to detect data loss. >>> obj1 = Series([1, 'a', (1,2), 3], index=['a', 'b', 'c', 'd']) >>> obj1#value and index match one by one a 1 b a c (1, 2) d 3 dtype: object >>> obj2=Series({"Book":"Python","Author":"Dan","ISBN":"011334","Price":25},inde x=['book','Author','ISBM','Price']) >>> obj2.isnull() book True # The specified index does not match the key of the dictionary, and data loss occurs Author False ISBM True # The specified index does not match the key of the dictionary, and data loss occurs Price False dtype: bool DataFrame : Similar to a spreadsheet, its data is a collection of ordered columns of data, each of which can be A different data type, it resembles a two-dimensional data structure that supports row and column indexing. and Series One As such, indexes are automatically allocated and can be sorted according to the specified column. The most used way is through a long to construct a dictionary of lists of equal degrees. The most common way to construct a DataFrame is to use an equal-length A dictionary or NumPy array of lists. DataFrames can also be processed by specifying the order of the sequences by columns sort. >>> data = {'OrderDate': ['1-6-10', '1-23-10', '2-9-10', '2-26-10', '3-15-10'], ... 'Region': ['East', 'Central', 'Central', 'West', 'E ast'], ... 'Rep': ['Jones', 'Kivell', 'Jardine', 'Gill', 'Sorv ino']} >>> >>> DataFrame(data,columns=['OrderDate','Region','Rep'])# Built from a dictionary, sorted in the order specified by cloumns OrderDate Region Rep 1-6-10 East Jones 1-23-10 Central Kivell 2-9-10 Central Jardine 2-26-10 West Gill 3-15-10 East Sorvino The functions for processing CSV files in #Pandas are mainly read_csv() and to_csv(). Among them, read_csv() reads the contents of the CSV file and returns a DataFrame, and to_csv() is its inverse process. 1) Specify the number of lines to read partial columns and files. The specific implementation code is as follows: df = pd.read_csv("SampleData.csv",nrows=5,usecols=['OrderDate','Item','Total']) The parameter nrows of the method read_csv() specifies the number of rows to read the file, usecols specifies the column name of the column to be read, If there is no column name, index 0, 1, ..., n-1 can be used directly. The above two parameters are very useful for large file processing, can To avoid reading the entire file and only select the required parts to read 2) Set the CSV file to be compatible with excel. The dialect parameter can be either a string or an instance of csv.Dialect. If the file format shown in Figure 4-2 is changed to use the "|" separator, you need to set dialect-related parameters. error_ bad_lines is set to False, when the record does not meet the requirements, such as the number of columns contained in the record does not match the file column setting Isochronous can simply ignore these columns. The following code is used to set the CSV file to be compatible with excel, where the delimiter is "|", And error_bad_lines=False will directly ignore the records that do not meet the requirements. >>> dia = csv.excel() >>> dia.delimiter="|" #Set the delimiter >>> pd.read_csv("SD.csv") OrderDate|Region|Rep|Item|Units|Unit Cost|Total 1-6-10|East|Jones|Pencil|95|1.99 |189.05 1-23-10 | Central | Kivell | Binder | 50 | 19.99 | 999.50 ... >>> pd.read_csv("SD.csv",dialect = dia,error_bad_lines=False) Skipping line 3: expected 7 fields, saw 10 # All columns that do not meet the format requirements will be ignored directly OrderDate Region Rep Item Units Unit Cost Total 1-6-10 East Jones Pencil 95 1.99 189.05 3) Chunks the file and returns an iterable object. Chunked processing to avoid loading all files Memory, only read in what you need when you're using it. The parameter chunksize sets the number of file lines in chunks, 10 means each chunk Contains 10 records. When the parameter iterator is set to True, the return value is TextFileReader, which is an iterable pair elephant. Look at the following example, when chunksize=10, iterator=True, each output is a block containing 10 records. >>> reader = pd.read_table("SampleData.csv",chunksize=10,iterator=True) >>> reader <pandas.io.parsers.TextFileReader object at 0x0314BE70> >>> iter(reader).next() # Convert TextFileReader to iterator and call next method OrderDate,Region,Rep,Item,Units,Unit Cost,Total # Read 10 rows each time 1-6-10,East,Jones,Pencil,95, 1.99 , 189.05 1-23-10, Central, Kivell, Binder, 50, 19.99, 999.50 2-9-10,Central,Jardine,Pencil,36, 4.99 , 179.64 2-26-10,Central,Gill,Pen,27, 19.99 , 539.73 3-15-10,West,Sorvino,Pencil,56, 2.99 , 167.44 4-1-10,East,Jones,Binder,60, 4.99 , 299.40 4-18-10,Central,Andrews,Pencil,75, 1.99 , 149.25 5-5-10,Central,Jardine,Pencil,90, 4.99 , 449.10 5-22-10,West,Thompson,Pencil,32, 1.99 , 63.68 4) When the file formats are similar, it supports the merging of multiple files. The following example is used to combine 3 identical formats The files are merged. >>> filelst = os.listdir("test") >>> print filelst # There are 3 files with the same format at the same time ['s1.csv', 's2.csv', 's3.csv'] >>> os.chdir("test") >>> dfs =[pd.read_csv(f) for f in filelst] >>> total_df = pd.concat(dfs) # merge files >>> total_df OrderDate Region Rep Item Units Unit Cost Total 1-6-10 East Jones Pencil 95 1.99 189.05 1-23-10 Central Kivell Binder 50 19.99 999.5 '''
Reprinted in: https://www.cnblogs.com/tychyg/p/4935987.html