Pandas DataFrame tips

Pandas is a powerful tool set for analyzing structured data; it is based on Numpy (providing high-performance matrix operations); it is used for data mining and data analysis, and also provides data cleaning functions. Pandas is Python's core data analysis support library, providing fast, flexible, and clear data structures, designed to handle relational and labeled data simply and intuitively. The goal of Pandas is to become an essential advanced tool for Python data analysis practice and combat. Its long-term goal is to become the most powerful and flexible open source data analysis tool that can support any language. After years of unremitting efforts, Pandas is getting closer and closer to this goal.

Next are some tips when using Pandas to process data, which may help improve efficiency in daily work data processing:

First create a DataFrame:

import pandas as pd
df = pd.DataFrame([{
    
    'a':1, 'b':2, 'c':3}]).T.reset_index()
df.columns = ['label','value']
df

insert image description here

1 Add progress bar (tqdm)

from tqdm import tqdm
tqdm.pandas(desc="Process Bar")
df['pair'] = df.progress_apply(lambda x: (x['label'],x['value']), axis=1)
df

insert image description here
tqdm can also display progress in other cycle types, the following is a simple example:

for i in tqdm(range(5)):
    print(i)

insert image description here

2 One column expands multiple columns

df[['label_2','_value_2']] = df.apply(lambda x: x['pair'], axis=1, result_type='expand')
df

Although the pd.DataFrame and pd.Series data structures are similar in the use of the apply method, axisthese result_typeparameters are only applicable to pd.DataFrame.
insert image description here

3 dask parallel computing

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
ddf = df.copy()

with ProgressBar():  # 添加进度条
    ddf[['label_3','_value_3']] = dd.from_pandas(ddf, npartitions=8).map_partitions(lambda df: \
        df.apply(lambda x: x['pair'], axis=1, result_type="expand"), meta={
    
    0:'str',1:'f8'}).compute()
ddf

insert image description here
dask enables pandas DataFrame to use multi-core parallel computing for batch calculations, making full use of CPU resources to increase operating speed.
① ProgressBar can provide a progress bar for the calculation.
npartitionsis the number of parallel computing threads, generally the number of computer cores.
③ When using dask's apply function and setting result_type="expand", a metadictionary is needed to clarify the data type of each column, for example str, intor f8.

4 Saving CSV garbled characters

insert image description here
When we want to save the DataFrame as a csv file, if there is Chinese in it, we often find garbled characters after opening. For example, we use the above DataFrame to df.to_csv('data.csv')save the file, and after opening it, we find that it becomes like this: this problem can be easily solved
insert image description here
by adding encoding='utf_8_sig'the specified encoding format when saving the file.

df.to_csv('data.csv', encoding='utf_8_sig')

insert image description here

Guess you like

Origin blog.csdn.net/qq_40039731/article/details/130136930