This is the first original 276
Complete maps
The first two parts:
Csv file read and write Python tutorial topic (1)
Csv file read and write Python tutorial topic (2)
2.5 Time-related
parse_dates
If the imported as time certain types, but not for this parameter assignment import, not time after introduction type, as follows:
In [5]: df = pd.read_csv('test.csv',sep='\s+',header=0,na_values=['#'])
In [6]: df
Out[6]:
id id.1 age label date
0 1 'gz' 10 YES 1989-12-1
1 2 'lh' 12 NO NaN
In [7]: df.dtypes
Out[7]:
id int64
id.1 object
age int64
label object
date object
dtype: object
At this date column of type object, find ways into time type:
In [8]: df = pd.read_csv('test.csv',sep='\s+',header=0,na_values=['#'],parse_dat
...: es=['date'])
In [9]: df
Out[9]:
id id.1 age label date
0 1 'gz' 10 YES 1989-12-01
1 2 'lh' 12 NO NaT
In [11]: df.dtypes
Out[11]:
id int64
id.1 object
age int64
label object
date datetime64[ns]
At this time, the type datetime.
date_parser
date_parser certain time parameters to customize the type of use are summarized below in detail. Our data file:
In [82]: cat test.csv
id id age label date
1 'gz' 10 YES 26-MAY-2019
2 'lh' 12 NO 30-MAR-2019
If the time is converted into a standard format date, as follows:
In [83]: df = pd.read_csv('test.csv',sep='\s+',parse_dates=['date'],date_parser=
...: lambda dates: pd.datetime.strptime(dates,'%d-%b-%Y'))
In [84]: df
Out[84]:
id id.1 age label date
0 1 'gz' 10 YES 2019-05-26
1 2 'lh' 12 NO 2019-03-30
Into timetuple format, as follows:
In [85]: df = pd.read_csv('test.csv',sep='\s+',parse_dates=['date'],date_parser=
...: lambda dates: pd.datetime.strptime(dates,'%d-%b-%Y').timetuple())
In [86]: df
Out[86]:
id id.1 age label date
0 1 'gz' 10 YES (2019, 5, 26, 0, 0, 0, 6, 146, -1)
1 2 'lh' 12 NO (2019, 3, 30, 0, 0, 0, 5, 89, -1)
infer_datetime_format
infer_datetime_format parameter defaults to boolean, default False
If set to True and parse_dates available, pandas tries to convert a date type, if possible conversion, the conversion method and resolution. In some cases 5 to 10 times faster.
Read block 2.6
By fast read into memory
iterator
Value boolean, default False
returns a TextFileReader object to the file block by block.
When this file is large, the memory can not hold all data files, then read in batches, processed sequentially. To do the following presentation, we file data fields, a total of 2 lines.
First read the line, get_chunk parameter 1 indicates a read line
In [105]: chunk = pd.read_csv('test.csv',sep='\s+',iterator=True)
In [106]: chunk.get_chunk(1)
Out[106]:
id id.1 age label date date1
0 1 'gz' 10 YES 26-MAY-2019 4-OCT-2017
Read the next line,
In [107]: chunk.get_chunk(1)
Out[107]:
id id.1 age label date date1
1 2 'lh' 12 NO 30-MAR-2019 2-SEP-2018
This time to the end of the file, read again reported abnormal membership,
In [108]: chunk.get_chunk(1)
StopIteration Traceback (most recent call last)
<ipython-input-108-f294b07af62c> in <module>
----> 1 chunk.get_chunk(1)
chunksize
chunksize: int, default None
sized file blocks
2.7 references, compression, file format
compression
参数取值为 {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
直接使用磁盘上的压缩文件。如果使用infer参数,则使用 gzip, bz2, zip或者解压文件名中以‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’这些为后缀的文件,否则不解压。
如果使用zip,那么ZIP包中国必须只包含一个文件。设置为None则不解压。
In [119]: df = pd.read_csv('test.zip',sep='\s+',compression='zip')
In [120]: df
Out[120]:
id id.1 age label date date1
0 1 'gz' 10 YES 26-MAY-2019 4-OCT-2017
1 2 'lh' 12 NO 30-MAR-2019 2-SEP-2018
thousands
thousands : str, default None
千分位分割符,如“,”或者“."
如下,显示数据文件 test.csv
In [122]: cat test.csv
id id age label date
1 'gz' 10 YES 1,090,001
2 'lh' 12 NO 20,010
其中date列为带千分位分隔符的整形,如果我们不显示指定thousands参数,则读入后的date列类型为object. 如下:
In [125]: df = pd.read_csv('test.csv',sep='\s+')
In [126]: df
Out[126]:
id id.1 age label date
0 1 'gz' 10 YES 1,090,001
1 2 'lh' 12 NO 20,010
In [127]: df.dtypes
Out[127]:
id int64
id.1 object
age int64
label object
date object
dtype: object
如果显示指定thousands为,
,则读入后date列显示为正常的整型。
In [128]: df = pd.read_csv('test.csv',sep='\s+',thousands=',')
In [132]: df
Out[132]:
id id.1 age label date
0 1 'gz' 10 YES 1090001
1 2 'lh' 12 NO 20010
In [130]: df['date'].dtypes
Out[130]: dtype('int64')
decimal
decimal : str, default ‘.’
字符中的小数点 (例如:欧洲数据使用’,‘). 类别上面的thousands参数。
float_precision
float_precision : string, default None
指定c引擎的浮点数转化器,默认为普通,参数可能取值还包括:high-precision, round_trip.
lineterminator
lineterminator: str (length 1), default None
行分割符,只在C解析器下使用。
quotechar
quotechar: str (length 1), optional
引号,用作标识开始和解释的字符,引号内的分割符将被忽略。
quoting
quoting : int or csv.QUOTE_* instance, default 0
控制csv中的引号常量。可选 QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)
doublequote : boolean, default True
双引号,当单引号已经被定义,并且quoting 参数不是QUOTE_NONE的时候,使用双引号表示引号内的元素作为一个元素使用。
escapechar
escapechar: str (length 1), default None
当quoting 为QUOTE_NONE时,指定一个字符使的不受分隔符限值。
comment
comment: str, default None
标识着多余的行不被解析。如果该字符出现在行首,这一行将被全部忽略。
这个参数只能是一个字符,空行(就像skip_blank_lines=True),注释行被header和skiprows忽略一样。例如,如果指定comment='#' 解析‘#empty\na,b,c\n1,2,3’ 以header=0 那么返回结果将是以’a,b,c'作为header。
encoding
encoding: str, default None
指定字符集类型,通常指定为'utf-8'. List of Python standard encodings
dialect
dialect: str or csv.Dialect instance, default None
如果没有指定特定的语言,如果sep大于一个字符则忽略。具体查看csv.Dialect 文档
error_bad_lines
error_bad_lines: boolean, default True
if a line 包含太多的列
, then the default will not return DataFrame, if set to false, it will 该行剔除(只能在C解析器下使用)
.
We intentionally modify file test.csv a cell values (with two spaces, because our data file default delimiter is two spaces)
In [148]: cat test.csv
id id age label date
1 'gz' 10.8 YES 1,090,001
2 'lh' 12.31 NO O 20,010
In this case, reading the data file, an exception will be reported:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6
When reading the small sample size, this error soon discovered, but when reading large data files, if you read the one hour, the last few lines of the emergence of such errors, it is suck! So the safe side, we will generally be error_bad_lines set to False, that is, removing this line while using warn_bad_lines under section set to True, print reject this line.
In [150]: df = pd.read_csv('test.csv',sep='\s+',error_bad_lines=False)
b'Skipping line 3: expected 5 fields, saw 6\n'
In [151]: df
Out[151]:
id id.1 age label date
0 1 'gz' 10.8 YES 1,090,001
Can see the output alarm information: Skipping line 3: expected 5 fields, saw 6
warn_bad_lines
warn_bad_lines: boolean, default True
if error_bad_lines = False, and warn_bad_lines = True then all the "bad lines" will be output (available only in C parser).
Parameters tupleize_cols, not recommended.
These are all parameters and read the corresponding presentation csv file.
The topic for the Python community and algorithms public number produced, reproduced, please indicate the source.
Python and algorithms Community