collections
Count the number of list data elements
from collections import Counter
test_list = [1, 2, 3, 3, 2, 1, 1, 1, 2, 2, 3, 1, 2, 1, 1]
counter = Counter(test_list)
# 返回值: Counter({1: 7, 2: 5, 3: 3})
value = counter[2]
# 返回值: 5
# 实际上以上引入等价于以下代码:
counter = {i: test_list.count(i) for i in set(test_list)}
# 返回值: {1: 7, 2: 5, 3: 3}
复制代码
TOP statistics on list data elements
from collections import Counter
test_list = [1, 2, 3, 3, 2, 1, 1, 1, 2, 2, 3, 1, 2, 1, 1]
counter = Counter(lst)
result = counter.most_common(2) # 统计TOP2
# 返回值: [(1, 7), (2, 5)]
复制代码
Subtract statistics from list data elements
from collections import Counter
test1 = Counter(a=4, b=2, c=0, d=-2)
test2 = Counter(a=1, b=2, c=3, d=4, e=4)
test1.subtract(test2)
# 返回值:
# test1: Counter({'a': 3, 'b': 0, 'c': -3, 'd': -6})
# test2: Counter({'d': 4, 'c': 3, 'b': 2, 'a': 1})
复制代码
Perform statistical calculations on list data elements
from collections import Counter
test1 = Counter(a=4, b=2, c=0, d=-2)
test2 = Counter(a=1, b=2, c=3, d=4, e=4)
result1 = test1 + test2 # counter相加: 相同Key值相加, 不同Key保留
result2 = test1 - test2 # counter相减: 相同Key值相减, 不同Key用0代替再相减, 结果只保留value是正值的key
result3 = test1 & test2 # counter交集: 取相同key, value取小
result4 = test1 | test2 # counter并集: 取所有key, key相同时value取大
# 返回值:
# result1: Counter({'a': 5, 'b': 4, 'e': 4, 'c': 3, 'd': 2})
# result2: Counter({'a': 3})
# result3: Counter({'b': 2, 'a': 1})
# result4: Counter({'a': 4, 'd': 4, 'e': 4, 'c': 3, 'b': 2})
复制代码
defaultdict
Get Dict does not exist key
from collections import defaultdict
test = defaultdict(str)
test['key1'] = '1'
test['key2'] = '2'
# 获取不存在的Key将使用实例化的类型所对应的空对象作为初始化数据
# str -> "" | int -> 0 | list -> list() | dict -> dict() | set -> set() | tuple -> tuple()
v = test['medusa']
# 返回值:
# v: ""
# test: defaultdict(<class 'str'>, {'key1': '1', 'key2': '2', 'medusa': ''})
复制代码
about
ListQueue of specified length
# First-In-First-Out,FIFO
from collections import deque
my_queue = deque(maxlen=10)
for i in range(10):
my_queue.append(i+1)
print(my_queue)
# 输出: deque([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], maxlen=10)
print(my_queue.popleft())
# 输出: 1
for i in range(5):
my_queue.append(i+1)
print(my_qeueu)
# 输出: deque([6, 7, 8, 9, 10, 1, 2, 3, 4, 5], maxlen=10)
复制代码
namedtuple
the specified index object of the tuple
from collections import namedtuple
# 创建数据模板, 名称为Person, 数据模板域名称 name | description | forever | size
Person = namedtuple('Person', 'name description forever size')
# 使用模板创建两个索引对象, 两种方法等价
Medusa = Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
You = Person._make(['You', '...', True, 'Max'])
print(Medusa)
print(You)
# 输出:
# Medusa: Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
# You: Person(name='You', description='...', forever=True, size='Max')
# 修改属性值, 实际上生成了新的对象
update_Medusa = Medusa._replace(description='https://juejin.cn/user/2805609406139950')
print(Medusa)
print(update_Medusa)
# 输出:
# Medusa: Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
# update_Medusa: Person(name='Medusa', description='https://juejin.cn/user/2805609406139950', forever=True, size='Max')
# 输出字典
print(Medusa._asdict())
# 输出: OrderedDict([('name', 'Medusa'), ('description', 'Medusa blog'), ('forever', True), ('size', 'Max')])
复制代码
pandas + numpy
Read and write file data
import pandas as pd
df = pd.DataFrame(pd.read_csv('csv_name.csv',header=1))
df = pd.DataFrame(pd.read_excel('xlsx_name.xlsx'))
复制代码
Read | Write |
---|---|
read_csv | to_csv |
read_excel | to_excel |
read_hdf | to_hdf |
read_sql | to_sql |
read_json | to_json |
read_msgpack(experimental) | to_msgpack(experimental) |
read_html | to_html |
read_gbq (experimental) | to_gbq (experimental) |
read_stata | to_stata |
read_sas | - |
read_clipboard | to_clipboard |
read_pickle | to_pickle |
read_csv parameter description
- filepath_or_buffer:
str
supports strings or any readable file object, including UEL type files - sep:
str
specifies the data separator, the default tries to separate ",", the separator is longer than one character and not "\s+", will use python's parser, and ignore commas in the data - delimiter:
str
delimiter, alternative delimiter, if this parameter is specified, the sep parameter will be invalid - delim_whitespace:
bool
Specifies whether spaces are used as separators, equivalent to setting sep="\s+", if this parameter is set to "True", the delimiter parameter is invalid - header:
int or list of ints
specify the line number as the column name. If there is no column name in the file, the default value is 0, otherwise it is set to None. If header=0 is explicitly set, the original existing column name will be replaced. These lines are used as column headers (meaning that each column has multiple headers), and the lines in between will be ignored. Note: If skip_blank_lines=True, the header parameter ignores comment lines and blank lines, so header=0 means the first line data instead of the first line of the file - names:
array like
The list of column names used for the result. If there is no column header row in the data file, header=None needs to be executed. The default list cannot be repeated unless the parameter mangle_dupe_cols=True is set - index_col :
int or sequence or False
The column number or column name used as the row index. If a sequence is given, there are multiple row indices. If the file is irregular and there is a separator at the end of the row, you can set index_col=False to make pandas not applicable to the first column as row index - usecols:
array-like
returns a subset of data, the values in the list must correspond to the position in the file (numbers can correspond to the specified columns) or characters passed as the column names in the file, for example: usecols valid parameters may be [0 ,1,2] or ['foo', 'bar', 'baz'], using this parameter can speed up loading and reduce memory consumption. - as_recarray:
bool
Not supported: This parameter will be removed in a future version, please use pd.read_csv(...).to_records() instead, which returns a Numpy recarray instead of DataFrame. If this parameter is set to True, it will The squeeze parameter is used in preference, and the row index will no longer be available, and the index column will also be ignored - squeeze:
bool
returns a Series if the file value contains a column - prefix:
str
add a prefix to the column when there is no column header - mangle_dupe_cols :
bool
Duplicate columns, represent multiple duplicate columns as "X.0"..."XN", otherwise overwrite the columns. - dtype:
Type name or dict of column -> type
The data type of each column of data - engine:
"c" or "python"
Specify the analysis engine, the C engine is fast, but the Python engine is more complete - converters:
dict
a dictionary of column conversion functions, the key can be the column name or the serial number of the column - true_values:
list
Values to consider as True - false_values:
list
, Values to consider as False - skipinitialspace:
bool
ignore whitespace after delimiter - skiprows:
list-like or integer
The number of lines to ignore (from the beginning of the file), or a list of line numbers to skip - skipfooter:
int
Ignore from the end of the file - skip_footer:
int
Ignore from the end of the file (deprecated) - nrows:
int
the number of rows to read (from the beginning of the file) - na_values:
scalar, str, list-like, or dict
A set of values used to replace NA/NaN, if passed, need to specify the null value of a specific column. Defaults to "1.#IND", "1.#QNAN", "N/A", "NA", "NULL", "NaN", "nan" - keep_default_na:
bool
If the na_values parameter is specified, and keep_default_na=False, then the default NaN will be overwritten, otherwise add - na_filter:
bool
Whether to check for missing values (empty strings or null values), for large files, there are no null values in the dataset, setting na_filter=False can improve the reading speed - verbose:
bool
whether to print the output information of various parsers - skip_blank_lines:
bool
if True, skip blank lines, otherwise record as NaN - parse_dates:
boolean or list of ints or names or list of lists or dict
- Passing True will resolve the index
- Passing a list of ints or names (eg [1, 2, 3]) will parse the values of columns 1, 2, and 3 as separate date columns
- Passing a list of lists (eg [[1, 3]]) will combine columns 1 and 3 as a date column
- Passing a dict (eg {"foo": [1, 3]}) will combine columns 1 and 3 and name the combined column "foo"
- infer_datetime_format:
bool
If set to True and parse_dates is available, pandas will try to convert to date type, and if it can, convert method and parse. 5~10 times faster in some cases - keep_date_col:
bool
If you join multi-column parsing dates, keep the columns participating in the join - date_parser:
function
The function used to parse the date. By default, dateutil.parser.parser is used for conversion. Pandas tries to use three different ways of parsing. If you encounter problems, use the next method.- Takes one or more arrays (specified by parse_dates ) as arguments
- Concatenate specifying multiple columns of strings as one column as argument
- Call the date_parser function once per line to parse one or more strings (specified by parse_dates ) as arguments
- dayfirst:
bool
date type in DD/MM format - iterator:
bool
Returns a TextFileReader object to process the file chunk by chunk - chunksize:
int
the size of the file chunks - Compression:
"infer" or "gzip" or "bz2" or "zip" or "xz" or None
directly use the compressed file on the disk, if the infer parameter is used, the file with the specified suffix will be decompressed in the specified way - thousands:
str
Thousands separator - decimal:
str
decimal point in characters - float_precision:
str
Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter - lineterminator :
str
line separator, only used with C parsers - quotechar:
str
quotation marks, used as a character to identify the beginning and interpretation, the delimiter inside the quotation marks will be ignored - quoting:
int or csv.QUOTE_* instance
control quoting constants in csv, optional QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3) - doublequote:
bool
double quotes, when single quotes have been defined and the quoting parameter is not QUOTE_NONE, use double quotes to indicate that the element within the quotes is used as an element - escapechar:
str
When quoting is QUOTE_NONE, specify a character so that it is not limited by the delimiter - comment:
str
indicates that extra lines are not to be parsed. If the character appears at the beginning of the line, this line will be ignored. This parameter can only be one character. Empty lines (like skip_blank_lines=True) comment lines are ignored by header and skiprows Similarly, if you specify comment='#' to parse "#empty\na,b,c\n1,2,3" with header=0, then the returned result will be 'a,b,c' as header - encoding:
str
specify the character set type, usually specified as 'utf-8' - dialect:
str or csv.Dialect instance
if no specific language is specified, ignored if sep is greater than one character - tupleize_cols:
bool
Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns) - error_bad_lines:
bool
If a row contains too many columns, the DataFrame will not be returned by default. If set to False, the new line will be removed (only available under the C parser) - warn_bad_lines:
bool
if error_bad_lines=False, and warn_bad_lines=True then all "bad lines" will be output (only available with C parser) - low_memory:
bool
load into memory in chunks, and then parse in low memory consumption, but type confusion may occur. To ensure that the type is not confused, you need to set it to False or use the dtype parameter to specify the type. Note that using the chunksize or iterator parameter to read in chunks will cause the entire file to be read in chunks. Read into a Dataframe, ignoring the type (only available in C parsers) - buffer_lines:
int
This parameter will be removed in a future version because its value is deprecated in the parser (deprecated) - compact_ints:
bool
This parameter will be removed in a future version (deprecated). If compact_ints=True is set, then any column with an integer type will be stored as the smallest integer type, whether it is signed or not depends on the use_unsigned parameter - use_unsigned:
bool
This parameter will be removed in a future version (deprecated), if the integer column is compacted (ie compact_ints=True), specifies whether the compacted column is signed or unsigned - memory_map:
bool
If the file used is in memory, then map the file directly. Using this method can avoid the file IO operation again