Pandas is a powerful tool set for analyzing structured data; it is based on Numpy (providing high-performance matrix operations); it is used for data mining and data analysis, and also provides data cleaning functions. Pandas is Python's core data analysis support library, providing fast, flexible, and clear data structures, designed to handle relational and labeled data simply and intuitively. The goal of Pandas is to become an essential advanced tool for Python data analysis practice and combat. Its long-term goal is to become the most powerful and flexible open source data analysis tool that can support any language. After years of unremitting efforts, Pandas is getting closer and closer to this goal.
Next are some tips when using Pandas to process data, which may help improve efficiency in daily work data processing:
First create a DataFrame:
import pandas as pd
df = pd.DataFrame([{
'a':1, 'b':2, 'c':3}]).T.reset_index()
df.columns = ['label','value']
df
1 Add progress bar (tqdm)
from tqdm import tqdm
tqdm.pandas(desc="Process Bar")
df['pair'] = df.progress_apply(lambda x: (x['label'],x['value']), axis=1)
df
tqdm can also display progress in other cycle types, the following is a simple example:
for i in tqdm(range(5)):
print(i)
2 One column expands multiple columns
df[['label_2','_value_2']] = df.apply(lambda x: x['pair'], axis=1, result_type='expand')
df
Although the pd.DataFrame and pd.Series data structures are similar in the use of the apply method, axis
these result_type
parameters are only applicable to pd.DataFrame.
3 dask parallel computing
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
ddf = df.copy()
with ProgressBar(): # 添加进度条
ddf[['label_3','_value_3']] = dd.from_pandas(ddf, npartitions=8).map_partitions(lambda df: \
df.apply(lambda x: x['pair'], axis=1, result_type="expand"), meta={
0:'str',1:'f8'}).compute()
ddf
dask enables pandas DataFrame to use multi-core parallel computing for batch calculations, making full use of CPU resources to increase operating speed.
① ProgressBar can provide a progress bar for the calculation.
② npartitions
is the number of parallel computing threads, generally the number of computer cores.
③ When using dask's apply function and setting result_type="expand"
, a meta
dictionary is needed to clarify the data type of each column, for example str
, int
or f8
.
4 Saving CSV garbled characters
When we want to save the DataFrame as a csv file, if there is Chinese in it, we often find garbled characters after opening. For example, we use the above DataFrame to df.to_csv('data.csv')
save the file, and after opening it, we find that it becomes like this: this problem can be easily solved
by adding encoding='utf_8_sig'
the specified encoding format when saving the file.
df.to_csv('data.csv', encoding='utf_8_sig')