Csv file read and write Python tutorial topic (3)

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/xo3ylAF9kGs/article/details/90586392

This is the first original 276

Complete maps

The first two parts:

Csv file read and write Python tutorial topic (1)

Csv file read and write Python tutorial topic (2)

2.5 Time-related

parse_dates

If the imported as time certain types, but not for this parameter assignment import, not time after introduction type, as follows:

In [5]: df = pd.read_csv('test.csv',sep='\s+',header=0,na_values=['#'])         

In [6]: df                                                                      
Out[6]: 
   id  id.1  age label       date
0   1  'gz'   10   YES  1989-12-1
1   2  'lh'   12    NO        NaN

In [7]: df.dtypes                                                               
Out[7]: 
id        int64
id.1     object
age       int64
label    object
date     object
dtype: object

At this date column of type object, find ways into time type:

In [8]: df = pd.read_csv('test.csv',sep='\s+',header=0,na_values=['#'],parse_dat
   ...: es=['date'])                                                            

In [9]: df                                                                      
Out[9]: 
   id  id.1  age label       date
0   1  'gz'   10   YES 1989-12-01
1   2  'lh'   12    NO        NaT

In [11]: df.dtypes                                                              
Out[11]: 
id                int64
id.1             object
age               int64
label            object
date     datetime64[ns]

At this time, the type datetime.

date_parser

date_parser certain time parameters to customize the type of use are summarized below in detail. Our data file:

In [82]: cat test.csv                                                           
id  id  age  label  date
1  'gz'  10  YES  26-MAY-2019
2  'lh'  12  NO  30-MAR-2019

If the time is converted into a standard format date, as follows:

In [83]: df = pd.read_csv('test.csv',sep='\s+',parse_dates=['date'],date_parser=
    ...: lambda dates: pd.datetime.strptime(dates,'%d-%b-%Y'))                  

In [84]: df                                                                     
Out[84]: 
   id  id.1  age label       date
0   1  'gz'   10   YES 2019-05-26
1   2  'lh'   12    NO 2019-03-30

Into timetuple format, as follows:

In [85]: df = pd.read_csv('test.csv',sep='\s+',parse_dates=['date'],date_parser=
    ...: lambda dates: pd.datetime.strptime(dates,'%d-%b-%Y').timetuple())      

In [86]: df                                                                     
Out[86]: 
   id  id.1  age label                                date
0   1  'gz'   10   YES  (2019, 5, 26, 0, 0, 0, 6, 146, -1)
1   2  'lh'   12    NO   (2019, 3, 30, 0, 0, 0, 5, 89, -1)

infer_datetime_format

infer_datetime_format parameter defaults to boolean, default False
If set to True and parse_dates available, pandas tries to convert a date type, if possible conversion, the conversion method and resolution. In some cases 5 to 10 times faster.

Read block 2.6

By fast read into memory

iterator

Value boolean, default False
returns a TextFileReader object to the file block by block.

When this file is large, the memory can not hold all data files, then read in batches, processed sequentially. To do the following presentation, we file data fields, a total of 2 lines.

First read the line, get_chunk parameter 1 indicates a read line

In [105]: chunk = pd.read_csv('test.csv',sep='\s+',iterator=True)               

In [106]: chunk.get_chunk(1)                                                    
Out[106]: 
   id  id.1  age label         date       date1
0   1  'gz'   10   YES  26-MAY-2019  4-OCT-2017

Read the next line,

In [107]: chunk.get_chunk(1)                                                    
Out[107]: 
   id  id.1  age label         date       date1
1   2  'lh'   12    NO  30-MAR-2019  2-SEP-2018

This time to the end of the file, read again reported abnormal membership,

In [108]: chunk.get_chunk(1)  

StopIteration                             Traceback (most recent call last)
<ipython-input-108-f294b07af62c> in <module>
----> 1 chunk.get_chunk(1)

chunksize

chunksize: int, default None
sized file blocks

2.7 references, compression, file format

compression

参数取值为 {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
直接使用磁盘上的压缩文件。如果使用infer参数，则使用 gzip, bz2, zip或者解压文件名中以‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’这些为后缀的文件，否则不解压。

如果使用zip，那么ZIP包中国必须只包含一个文件。设置为None则不解压。

In [119]: df = pd.read_csv('test.zip',sep='\s+',compression='zip')              

In [120]: df                                                                    
Out[120]: 
   id  id.1  age label         date       date1
0   1  'gz'   10   YES  26-MAY-2019  4-OCT-2017
1   2  'lh'   12    NO  30-MAR-2019  2-SEP-2018

thousands

thousands : str, default None
千分位分割符，如“，”或者“."

如下，显示数据文件 test.csv

In [122]: cat test.csv                                                          
id  id  age  label  date
1  'gz'  10  YES  1,090,001
2  'lh'  12  NO  20,010

其中date列为带千分位分隔符的整形，如果我们不显示指定thousands参数，则读入后的date列类型为object. 如下：

In [125]: df = pd.read_csv('test.csv',sep='\s+')                                

In [126]: df                                                                    
Out[126]: 
   id  id.1  age label       date
0   1  'gz'   10   YES  1,090,001
1   2  'lh'   12    NO     20,010

In [127]: df.dtypes                                                             
Out[127]: 
id        int64
id.1     object
age       int64
label    object
date     object
dtype: object

如果显示指定thousands为,,则读入后date列显示为正常的整型。

In [128]: df = pd.read_csv('test.csv',sep='\s+',thousands=',') 

In [132]: df                                                                    
Out[132]: 
   id  id.1  age label     date
0   1  'gz'   10   YES  1090001
1   2  'lh'   12    NO    20010


In [130]: df['date'].dtypes                                                     
Out[130]: dtype('int64')

decimal

decimal : str, default ‘.’
字符中的小数点 (例如：欧洲数据使用’，‘). 类别上面的thousands参数。

float_precision

float_precision : string, default None
指定c引擎的浮点数转化器，默认为普通，参数可能取值还包括：high-precision, round_trip.

lineterminator

lineterminator: str (length 1), default None
行分割符，只在C解析器下使用。

quotechar

quotechar: str (length 1), optional
引号，用作标识开始和解释的字符，引号内的分割符将被忽略。

quoting

quoting : int or csv.QUOTE_* instance, default 0

控制csv中的引号常量。可选 QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)

doublequote : boolean, default True
双引号，当单引号已经被定义，并且quoting 参数不是QUOTE_NONE的时候，使用双引号表示引号内的元素作为一个元素使用。

escapechar

escapechar: str (length 1), default None
当quoting 为QUOTE_NONE时，指定一个字符使的不受分隔符限值。

comment

comment: str, default None
标识着多余的行不被解析。如果该字符出现在行首，这一行将被全部忽略。

这个参数只能是一个字符，空行（就像skip_blank_lines=True），注释行被header和skiprows忽略一样。例如，如果指定comment='#' 解析‘#empty\na,b,c\n1,2,3’ 以header=0 那么返回结果将是以’a,b,c'作为header。

encoding

encoding: str, default None
指定字符集类型，通常指定为'utf-8'. List of Python standard encodings

dialect

dialect: str or csv.Dialect instance, default None
如果没有指定特定的语言，如果sep大于一个字符则忽略。具体查看csv.Dialect 文档

error_bad_lines

error_bad_lines: boolean, default True
if a line 包含太多的列, then the default will not return DataFrame, if set to false, it will 该行剔除（只能在C解析器下使用）.

We intentionally modify file test.csv a cell values (with two spaces, because our data file default delimiter is two spaces)

In [148]: cat test.csv                                                          
id  id  age  label  date
1  'gz'  10.8  YES  1,090,001
2  'lh'  12.31  NO  O  20,010

In this case, reading the data file, an exception will be reported:

ParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6

When reading the small sample size, this error soon discovered, but when reading large data files, if you read the one hour, the last few lines of the emergence of such errors, it is suck! So the safe side, we will generally be error_bad_lines set to False, that is, removing this line while using warn_bad_lines under section set to True, print reject this line.

In [150]: df = pd.read_csv('test.csv',sep='\s+',error_bad_lines=False)          
b'Skipping line 3: expected 5 fields, saw 6\n'

In [151]: df                                                                    
Out[151]: 
   id  id.1   age label       date
0   1  'gz'  10.8   YES  1,090,001

Can see the output alarm information: Skipping line 3: expected 5 fields, saw 6

warn_bad_lines

warn_bad_lines: boolean, default True
if error_bad_lines = False, and warn_bad_lines = True then all the "bad lines" will be output (available only in C parser).

Parameters tupleize_cols, not recommended.

These are all parameters and read the corresponding presentation csv file.

The topic for the Python community and algorithms public number produced, reproduced, please indicate the source.

640?wx_fmt=jpeg

Python and algorithms Community