Practical Tips for Python Libraries

 

collections

Count the number of list data elements

from collections import Counter

test_list = [1, 2, 3, 3, 2, 1, 1, 1, 2, 2, 3, 1, 2, 1, 1]

counter = Counter(test_list)
# 返回值: Counter({1: 7, 2: 5, 3: 3})

value = counter[2]
# 返回值: 5

# 实际上以上引入等价于以下代码:
counter = {i: test_list.count(i) for i in set(test_list)}
# 返回值: {1: 7, 2: 5, 3: 3}
复制代码

TOP statistics on list data elements

from collections import Counter

test_list = [1, 2, 3, 3, 2, 1, 1, 1, 2, 2, 3, 1, 2, 1, 1]

counter = Counter(lst)
result = counter.most_common(2)  # 统计TOP2
# 返回值: [(1, 7), (2, 5)]
复制代码

Subtract statistics from list data elements

from collections import Counter

test1 = Counter(a=4, b=2, c=0, d=-2)
test2 = Counter(a=1, b=2, c=3, d=4, e=4)
test1.subtract(test2)
# 返回值:
# test1: Counter({'a': 3, 'b': 0, 'c': -3, 'd': -6})
# test2: Counter({'d': 4, 'c': 3, 'b': 2, 'a': 1})
复制代码

Perform statistical calculations on list data elements

from collections import Counter

test1 = Counter(a=4, b=2, c=0, d=-2)
test2 = Counter(a=1, b=2, c=3, d=4, e=4)

result1 = test1 + test2  # counter相加: 相同Key值相加, 不同Key保留
result2 = test1 - test2  # counter相减: 相同Key值相减, 不同Key用0代替再相减, 结果只保留value是正值的key
result3 = test1 & test2  # counter交集: 取相同key, value取小
result4 = test1 | test2  # counter并集: 取所有key, key相同时value取大
# 返回值:
# result1: Counter({'a': 5, 'b': 4, 'e': 4, 'c': 3, 'd': 2})
# result2: Counter({'a': 3})
# result3: Counter({'b': 2, 'a': 1})
# result4: Counter({'a': 4, 'd': 4, 'e': 4, 'c': 3, 'b': 2})
复制代码

defaultdict

Get Dict does not exist key

from collections import defaultdict

test = defaultdict(str)
test['key1'] = '1'
test['key2'] = '2'
# 获取不存在的Key将使用实例化的类型所对应的空对象作为初始化数据
# str -> "" | int -> 0 | list -> list() | dict -> dict() | set -> set() | tuple -> tuple() 
v = test['medusa']

# 返回值:
# v: ""
# test: defaultdict(<class 'str'>, {'key1': '1', 'key2': '2', 'medusa': ''})
复制代码

about

ListQueue of specified length

# First-In-First-Out,FIFO
from collections import deque

my_queue = deque(maxlen=10)

for i in range(10):
    my_queue.append(i+1)

print(my_queue)
# 输出: deque([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], maxlen=10)

print(my_queue.popleft())
# 输出: 1

for i in range(5):
    my_queue.append(i+1)
print(my_qeueu)
# 输出: deque([6, 7, 8, 9, 10, 1, 2, 3, 4, 5], maxlen=10)
复制代码

namedtuple

the specified index object of the tuple

from collections import namedtuple

# 创建数据模板, 名称为Person, 数据模板域名称 name | description | forever | size
Person = namedtuple('Person', 'name description forever size')

# 使用模板创建两个索引对象, 两种方法等价
Medusa = Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
You = Person._make(['You', '...', True, 'Max'])

print(Medusa)
print(You)
# 输出:
# Medusa: Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
# You: Person(name='You', description='...', forever=True, size='Max')

# 修改属性值, 实际上生成了新的对象
update_Medusa = Medusa._replace(description='https://juejin.cn/user/2805609406139950')
print(Medusa)
print(update_Medusa)
# 输出:
# Medusa: Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
# update_Medusa: Person(name='Medusa', description='https://juejin.cn/user/2805609406139950', forever=True, size='Max')

# 输出字典
print(Medusa._asdict())
# 输出: OrderedDict([('name', 'Medusa'), ('description', 'Medusa blog'), ('forever', True), ('size', 'Max')])
复制代码

pandas + numpy

Read and write file data

import pandas as pd

df = pd.DataFrame(pd.read_csv('csv_name.csv',header=1)) 
df = pd.DataFrame(pd.read_excel('xlsx_name.xlsx'))
复制代码
Read Write
read_csv to_csv
read_excel to_excel
read_hdf to_hdf
read_sql to_sql
read_json to_json
read_msgpack(experimental) to_msgpack(experimental)
read_html to_html
read_gbq (experimental) to_gbq (experimental)
read_stata to_stata
read_sas -
read_clipboard to_clipboard
read_pickle to_pickle

read_csv parameter description

  • filepath_or_buffer: strsupports strings or any readable file object, including UEL type files
  • sep: strspecifies the data separator, the default tries to separate ",", the separator is longer than one character and not "\s+", will use python's parser, and ignore commas in the data
  • delimiter: strdelimiter, alternative delimiter, if this parameter is specified, the sep parameter will be invalid
  • delim_whitespace: boolSpecifies whether spaces are used as separators, equivalent to setting sep="\s+", if this parameter is set to "True", the delimiter parameter is invalid
  • header: int or list of intsspecify the line number as the column name. If there is no column name in the file, the default value is 0, otherwise it is set to None. If header=0 is explicitly set, the original existing column name will be replaced. These lines are used as column headers (meaning that each column has multiple headers), and the lines in between will be ignored. Note: If skip_blank_lines=True, the header parameter ignores comment lines and blank lines, so header=0 means the first line data instead of the first line of the file
  • names: array likeThe list of column names used for the result. If there is no column header row in the data file, header=None needs to be executed. The default list cannot be repeated unless the parameter mangle_dupe_cols=True is set
  • index_col : int or sequence or False The column number or column name used as the row index. If a sequence is given, there are multiple row indices. If the file is irregular and there is a separator at the end of the row, you can set index_col=False to make pandas not applicable to the first column as row index
  • usecols: array-likereturns a subset of data, the values ​​in the list must correspond to the position in the file (numbers can correspond to the specified columns) or characters passed as the column names in the file, for example: usecols valid parameters may be [0 ,1,2] or ['foo', 'bar', 'baz'], using this parameter can speed up loading and reduce memory consumption.
  • as_recarray: boolNot supported: This parameter will be removed in a future version, please use pd.read_csv(...).to_records() instead, which returns a Numpy recarray instead of DataFrame. If this parameter is set to True, it will The squeeze parameter is used in preference, and the row index will no longer be available, and the index column will also be ignored
  • squeeze: boolreturns a Series if the file value contains a column
  • prefix: stradd a prefix to the column when there is no column header
  • mangle_dupe_cols : boolDuplicate columns, represent multiple duplicate columns as "X.0"..."XN", otherwise overwrite the columns.
  • dtype: Type name or dict of column -> typeThe data type of each column of data
  • engine: "c" or "python"Specify the analysis engine, the C engine is fast, but the Python engine is more complete
  • converters: dicta dictionary of column conversion functions, the key can be the column name or the serial number of the column
  • true_values: list Values to consider as True
  • false_values: list, Values to consider as False
  • skipinitialspace: boolignore whitespace after delimiter
  • skiprows: list-like or integerThe number of lines to ignore (from the beginning of the file), or a list of line numbers to skip
  • skipfooter: intIgnore from the end of the file
  • skip_footer: intIgnore from the end of the file (deprecated)
  • nrows: intthe number of rows to read (from the beginning of the file)
  • na_values: scalar, str, list-like, or dictA set of values ​​used to replace NA/NaN, if passed, need to specify the null value of a specific column. Defaults to "1.#IND", "1.#QNAN", "N/A", "NA", "NULL", "NaN", "nan"
  • keep_default_na: boolIf the na_values ​​parameter is specified, and keep_default_na=False, then the default NaN will be overwritten, otherwise add
  • na_filter: boolWhether to check for missing values ​​(empty strings or null values), for large files, there are no null values ​​in the dataset, setting na_filter=False can improve the reading speed
  • verbose: boolwhether to print the output information of various parsers
  • skip_blank_lines: boolif True, skip blank lines, otherwise record as NaN
  • parse_dates:boolean or list of ints or names or list of lists or dict
    • Passing True will resolve the index
    • Passing a list of ints or names (eg [1, 2, 3]) will parse the values ​​of columns 1, 2, and 3 as separate date columns
    • Passing a list of lists (eg [[1, 3]]) will combine columns 1 and 3 as a date column
    • Passing a dict (eg {"foo": [1, 3]}) will combine columns 1 and 3 and name the combined column "foo"
  • infer_datetime_format: boolIf set to True and parse_dates is available, pandas will try to convert to date type, and if it can, convert method and parse. 5~10 times faster in some cases
  • keep_date_col: boolIf you join multi-column parsing dates, keep the columns participating in the join
  • date_parser: functionThe function used to parse the date. By default, dateutil.parser.parser is used for conversion. Pandas tries to use three different ways of parsing. If you encounter problems, use the next method.
    • Takes one or more arrays (specified by parse_dates ) as arguments
    • Concatenate specifying multiple columns of strings as one column as argument
    • Call the date_parser function once per line to parse one or more strings (specified by parse_dates ) as arguments
  • dayfirst: booldate type in DD/MM format
  • iterator: boolReturns a TextFileReader object to process the file chunk by chunk
  • chunksize: intthe size of the file chunks
  • Compression: "infer" or "gzip" or "bz2" or "zip" or "xz" or Nonedirectly use the compressed file on the disk, if the infer parameter is used, the file with the specified suffix will be decompressed in the specified way
  • thousands: strThousands separator
  • decimal: strdecimal point in characters
  • float_precision: str Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter
  • lineterminator : strline separator, only used with C parsers
  • quotechar: strquotation marks, used as a character to identify the beginning and interpretation, the delimiter inside the quotation marks will be ignored
  • quoting: int or csv.QUOTE_* instancecontrol quoting constants in csv, optional QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)
  • doublequote: booldouble quotes, when single quotes have been defined and the quoting parameter is not QUOTE_NONE, use double quotes to indicate that the element within the quotes is used as an element
  • escapechar: strWhen quoting is QUOTE_NONE, specify a character so that it is not limited by the delimiter
  • comment: strindicates that extra lines are not to be parsed. If the character appears at the beginning of the line, this line will be ignored. This parameter can only be one character. Empty lines (like skip_blank_lines=True) comment lines are ignored by header and skiprows Similarly, if you specify comment='#' to parse "#empty\na,b,c\n1,2,3" with header=0, then the returned result will be 'a,b,c' as header
  • encoding: strspecify the character set type, usually specified as 'utf-8'
  • dialect: str or csv.Dialect instanceif no specific language is specified, ignored if sep is greater than one character
  • tupleize_cols: bool Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns)
  • error_bad_lines: boolIf a row contains too many columns, the DataFrame will not be returned by default. If set to False, the new line will be removed (only available under the C parser)
  • warn_bad_lines: boolif error_bad_lines=False, and warn_bad_lines=True then all "bad lines" will be output (only available with C parser)
  • low_memory: boolload into memory in chunks, and then parse in low memory consumption, but type confusion may occur. To ensure that the type is not confused, you need to set it to False or use the dtype parameter to specify the type. Note that using the chunksize or iterator parameter to read in chunks will cause the entire file to be read in chunks. Read into a Dataframe, ignoring the type (only available in C parsers)
  • buffer_lines: intThis parameter will be removed in a future version because its value is deprecated in the parser (deprecated)
  • compact_ints: boolThis parameter will be removed in a future version (deprecated). If compact_ints=True is set, then any column with an integer type will be stored as the smallest integer type, whether it is signed or not depends on the use_unsigned parameter
  • use_unsigned: boolThis parameter will be removed in a future version (deprecated), if the integer column is compacted (ie compact_ints=True), specifies whether the compacted column is signed or unsigned
  • memory_map: boolIf the file used is in memory, then map the file directly. Using this method can avoid the file IO operation again

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324685081&siteId=291194637