pandas read csv

parameter

Reading CSV (Comma Separated) file to DataFrame

Also supports the portion of the file to import and select iterative

See more help: http://pandas.pydata.org/pandas-docs/stable/io.html

parameter:

filepath_or_buffer : str,pathlib。str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

It can be a URL, available URL types: http, ftp, s3 and files. For multi-file being prepared

Local file reading examples :: //localhost/path/to/table.csv

sep : str, default ‘,’

Specify the delimiter. If you do not specify parameters, it will try to use commas. Delimiter and not longer than one character '\ s +', the parser uses the python. And ignore the data in comma. Examples of the regular expression: '\ r \ t'

delimiter : str, default None

Delimiter, the delimiter alternatively (if this parameter is specified, then the parameter sep failure)

delim_whitespace : boolean, default False.

Designated space (e.g., '' or '') is used as a separator, it is equivalent to setting sep = '\ s +'. If this parameter is set then the delimiter argument fails to Ture.

In the new version 0.18.1 Support

header : int or list of ints, default ‘infer’

Specified number of rows used as the column name, start number of rows of data. If the file name is not listed, it defaults to 0, otherwise it is set to None. If you explicitly set header = 0 will replace the original column name exists. a header parameter may be, for example, list: [0,1,3], this document shows a list of these rows as column headings (means a plurality of titles for each column), intervening rows are ignored (e.g. present Example 2; in this case the data rows are 2,4 header appears as a multi-stage, the third line will be discarded, dataframe data starts on line 5).

Note: If skip_blank_lines = True parameter is ignored then the header comment lines and blank lines, so the first row header = 0 represents the first row of data rather than a file.

names : array-like, default None

The results list for the column name, if the data file is not a column header row, you need to perform header = None. The default list can not be duplicated, unless set parameters mangle_dupe_cols = True.

index_col : int or sequence or False, default None

Used as the row number or column index column name, if there is given a sequence of a plurality of row index.

If the file is irregular, there is the end of line separators, may be set to be the index_col = False pandas NA 1st row as a row index.

usecols : array-like, default None

Returns a subset of data values ​​in the list may correspond to the location of the file must be in the (number may correspond to a specified column) or character column name for the file transfer. For example: usecols parameter may be effective [0,1,2] or [ 'foo', 'bar', 'baz']. Using this parameter can load faster and reduces memory consumption.

as_recarray : boolean, default False

Deprecated: This parameter will be removed in a future version. Use pd.read_csv (...). To_records () instead.

Numpy a return of recarray instead DataFrame. If this parameter is set to True. Priority will squeeze parameter. And row index is no longer available, the index column will be ignored.

squeeze : boolean, default False

If the value contains a file, a return Series

prefix : str, default None

In the absence of the column headings, add a prefix to the column. For example: Adding 'X' to be X0, X1, ...

mangle_dupe_cols : boolean, default True

Duplicate column, the 'X' ... 'X' is expressed as 'X.0' ... 'X.N'. If set to false it will cover all the heavy ranked.

dtype : Type name or dict of column -> type, default None

Each column of data of the data type. E.g. { 'a': np.float64, 'b': np.int32}

engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

Analysis engine. C may be selected or python. C engine fast but Python engine function more complete.

converters : dict, default None

Column conversion dictionary function. key can be a column name or column number.

true_values : list, default None

Values to consider as True

false_values : list, default None

Values to consider as False

skipinitialspace : boolean, default False

Ignore blank after delimiter (default is False, that is not ignored).

skiprows : list-like or integer, default None

The number of rows to be ignored (counting from the beginning of the file), or need to skip a list of line numbers (starting from 0).

skipfooter : int, default 0

From the end of the file began to ignore. (C engine does not support)

skip_footer : int, default 0

Not recommended: It is recommended to use skipfooter, function the same.

nrows : int, default None

The number of lines to be read (start counting from the file header).

na_values : scalar, str, list-like, or dict, default None

A set of replacement values ​​NA / NaN's. If you pass parameters, the need for a null value for a particular column. The default is' 1. # IND ',' 1. # QNAN ',' N / A ',' NA ',' NULL ',' NaN ',' nan'`.

keep_default_na : bool, default True

If you specify na_values ​​parameters, and keep_default_na = False, then the default NaN will be overwritten, otherwise add.

na_filter : boolean, default True

Check whether the missing values ​​(empty string or null). For larger files, the data set is not a null value, set na_filter = False can improve reading speed.

verbose : boolean, default False

Whether to print a variety of information output resolver, for example: "the number of non missing values ​​in the Value column" and the like.

skip_blank_lines : boolean, default True

If True, skip blank lines; otherwise referred to as NaN.

parse_dates : boolean or list of ints or names or list of lists or dict, default False

  • boolean True -.> analytical index
  • . List of ints or names eg If [2, 1, 3] - Analytical column values ​​1,2,3> as a separate date column;
  • list of lists eg If [[1, 3]] -.> 3 columns were combined as a date column
  • dict, eg { 'foo': [1, 3]} -> The combined column 1, 3, and combined to the column named "foo"

infer_datetime_format : boolean, default False

If set to True and parse_dates available, the pandas will try to convert the type to date, if you can convert, conversion method and resolution. In some cases 5 to 10 times faster.

keep_date_col : boolean, default False

If you connect multiple columns to parse the date, it remains connected to the column involved. The default is False.

date_parser : function, default None

Function is used to parse the date, using the default dateutil.parser.parser do the conversion. Pandas are three different ways to try to resolve any issues under one way is to use.

1. Arrays using one or more (specified by parse_dates) as a parameter;

2. The connection string as a multi-column column as a parameter;

3. Each row date_parser called once or more a function to parse the string (designated by parse_dates) as a parameter.

dayfirst : boolean, default False

Date Type DD / MM format

iterator : boolean, default False

Returns a TextFileReader object to the file block by block.

chunksize : int, default None

The size of the file blocks, See docs for More Information Tools IO ON Iterator and chunkSize.

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

Direct use of compressed files on the disk. If you use the argument infer, using gzip, bz2, zip or unzip the file name to '.gz', '.bz2', '.zip', or 'xz' suffix these files, or do not unpack. If you use the zip, then China ZIP package must contain only one file. Decompression is not set to None.

The new version 0.18.1 version supports zip and unzip xz

thousands : str, default None

Thousandth delimiter, such as "" or "."

decimal : str, default ‘.’

The decimal character (eg: using the European data ',').

float_precision : string, default None

Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter.

Designation

lineterminator : str (length 1), default None

Line separator, only the C parser.

quotechar : str (length 1), optional

Quotes, and the start identification character is used as explained, delimiters within quotation marks are ignored.

quoting : int or csv.QUOTE_* instance, default 0

Control quotes constants in the csv. Alternatively QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)

doublequote : boolean, default True

Double quotes, when single quotes have been defined, and the parameter is not quoting QUOTE_NONE when a double quote marks represented as elements within an element.

escapechar : str (length 1), default None

When quoting is QUOTE_NONE, specifying a delimiter character is not so limit.

comment : str, default None

Identify the excess lines are not resolved. If the character appears at first, all the rows are ignored. This parameter can only be a character, a blank line (like skip_blank_lines = True) comment lines are ignored as header and skiprows. For example, if the specified comment = '#' parse '#empty \ na, b, c \ n1,2,3' returned to header = 0 then the result will be 'a, b, c' as the header.

encoding : str, default None

Specify the character set type, generally designated 'UTF-. 8'. List of the Python Standard Encodings

dialect : str or csv.Dialect instance, default None

If you do not specify a particular language, if more than one character is ignored sep. View csv.Dialect specific documents

tupleize_cols : boolean, default False

Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns)

error_bad_lines : boolean, default True

If a line contains too many columns, the default will not return DataFrame, if set to false, it will be diverted to reject (available only in C parser).

warn_bad_lines : boolean, default True

If error_bad_lines = False, and warn_bad_lines = True then all the "bad lines" will be output (available only in C parser).

low_memory : boolean, default True

Block is loaded into memory, and then parsed low memory consumption. But the type of confusion that may arise. Not to be confused ensure that type needs to be set to False. Dtype argument specifies the type or use. Note the use of the parameter block iterator chunksize or read the entire file is read into the entrance to a Dataframe, ignoring Type (valid only in the C parser)

buffer_lines : int, default None

Not recommended, this parameter will be removed in future versions, because his value is not recommended for use in the parser

compact_ints : boolean, default False

Not recommended, this parameter will be removed in future versions

If compact_ints = True, then there is any integer type column configuration to be stored in accordance with the smallest integer type, will depend on whether a signed parameter use_unsigned

use_unsigned : boolean, default False

Not recommended: this parameter will be removed in future versions

If the integer column is compressed (ie compact_ints = True), the compressed designated column is signed or unsigned.

memory_map : boolean, default False

If you are using a file in the memory, then directly map files. Use this way to avoid file IO operations again.

Error Handling

1, a plurality of read delimiter is repeated values ​​( '~')

document content

{商品编码}[分隔符]"~~"
// 每行格式 :
// 编码~~名称~~简码~~商品税目~~税率~~规格型号~~计量单位~~单价~~含税价标志~~隐藏标志~~中外合作油气田~~税收分类编码~~是否享受优惠政策~~税收分类编码名称~~优惠政策类型~~零税率标识~~编码版本号
001~~服务费~~~~~~0.06~~~~次~~0~~False~~0000000000~~False~~304060399~~否~~其他咨询服务~~~~~~33.0
002~~咨询服务费~~~~~~0.06~~~~次~~0~~False~~0000000000~~False~~304060299~~否~~其他鉴证服务~~~~~~33.0

parameter settings

df = pd.read_csv(path,sep='~~',  encoding='gbk', header=2,skipinitialspace =True,engine='python')

#该情况主要是设置 skipinitialspace =True

**skipinitialspace** : boolean, default False
忽略分隔符后的空白(默认为False,即不忽略).

result

// coding name Short Code Some items tax rate Specifications Model unit of measurement unit price Tax price mark Hide logo Sino-foreign cooperation in oil and gas fields Tax Classification Code Whether to enjoy preferential policies Tax classification code name The type of incentives Identifies zero tax rate Encoded version number
0 1 Service charges NaN NaN 0.06 NaN Secondary 0 False 0 False 304060399 no Other Consulting Services NaN NaN 33.0
1 2 Consulting services NaN NaN 0.06 NaN Secondary 0 False 0 False 304060299 no Other Assurance Services NaN NaN 33.0

Guess you like

Origin www.cnblogs.com/jokerBi/p/11314957.html