Pandas is a foundational library for analytics, data processing, and data science. Some lesser-used but idiomatic Pandas features that make code more readable, versatile, and speedy, like Buzzfeed lists.
Article directory
- Interpreter startup configuration options and settings
- Processing using Pandas' Test module
- Pandas Accessor
- Create DatetimeIndex from component column
- CategoricalAccessor Categorical Data
- Self-iterating Groupby objects
- Pandas uses boolean operators
- Load data from clipboard
- Pandas objects are written directly to compressed format
Interpreter startup configuration options and settings
To set custom Pandas options when the interpreter starts, use the startup file pd.set_option() to configure what you want.
import pandas as pd
def start():
options = {
'display': {
'max_columns': None,
'max_colwidth': 25,
'expand_frame_repr': False, # 不要换行到多个页面
'max_rows': 4, # 数据截断行数
'max_seq_items': 50, # 打印序列的最大长度
'precision': 4, # 精度
'show_dimensions': False
},
'mode': {
'chained_assignment': None # 控制 SettingWithCopyWarning
}
}
for category, option in options.items():
for op, value in option.items():
pd.set_option(f'{
category}.{
op}', value) # Python 3.6+
For example, look at the settings of pandas.
pd.get_option('display.max_rows')
4
Use the data to print the result and automatically truncate the set number of lines.
import pandas as pd
url = ('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
cols = ['sex', 'length', 'diam', 'height', 'weight', 'rings']
abalone = pd.read_csv(url, usecols=[0, 1, 2, 3, 4, 8], names=cols)
abalone
Processing using Pandas' Test module
Hidden in Pandas' test module are some handy functions for quickly building quasi-realistic Series and DataFrames.
import pandas.util.testing as tm
tm.N, tm.K = 15, 3 # 默认行/列
import numpy as np
np.random.seed(444)
tm.makeTimeDataFrame(freq='M').head()
tm.makeDataFrame().head()
To query all functions, you can use the dir method, and you can know the approximate role of each module through the method name.
[i for i in dir(tm) if i.startswith('make')]
['makeBoolIndex',
'makeCategoricalIndex',
'makeCustomDataframe',
'makeCustomIndex',
'makeDataFrame',
'makeDateIndex',
'makeFloatIndex',
'makeFloatSeries',
'makeIntIndex',
'makeIntervalIndex',
'makeMissingCustomDataframe',
'makeMissingDataframe',
'makeMixedDataFrame',
'makeMultiIndex',
'makeObjectSeries',
'makePeriodFrame',
'makePeriodIndex',
'makePeriodSeries',
'makeRangeIndex',
'makeStringIndex',
'makeStringSeries',
'makeTimeDataFrame',
'makeTimeSeries',
'makeTimedeltaIndex',
'makeUIntIndex',
'makeUnicodeIndex']
Pandas Accessor
Kind of like getters (although getters and setters are rarely used in Python). A Pandas accessor can be thought of as a property, used as an interface for additional methods.
There are 3 basic data processing standard formats:
- .str maps to StringMethods
- .dt maps to CombinedDatetimelikeProperties
- .cat route to CategoricalAccessor
pd.Series._accessors
{
'cat', 'str', 'dt'}
StringMethods , which is handled directly using string methods.
addr = pd.Series([
'Washington, D.C. 20003',
'Brooklyn, NY 11211-1755',
'Omaha, NE 68154',
'Pittsburgh, PA 15211'
])
addr.str.upper()
0 WASHINGTON, D.C. 20003
1 BROOKLYN, NY 11211-1755
2 OMAHA, NE 68154
3 PITTSBURGH, PA 15211
dtype: object
CombinedDatetimelikeProperties , which is processed directly according to the time data.
daterng = pd.Series(pd.date_range('2022', periods=5, freq='Q'))
daterng
0 2022-03-31
1 2022-06-30
2 2022-09-30
3 2022-12-31
4 2023-03-31
dtype: datetime64[ns]
daterng.dt.day_name()
0 Thursday
1 Thursday
2 Friday
3 Saturday
4 Friday
dtype: object
CategoricalAccessor , for categorical data. There will be detailed instructions in the following apply.
Create DatetimeIndex from component column
For datetime-like data like dateng above, a Pandas DatetimeIndex can be created from multiple component columns that together form a date or datetime.
from itertools import product
datecols = ['year', 'month', 'day']
df = pd.DataFrame(list(product([2021, 2022], [1, 2], [1, 2])),columns=datecols)
df['data'] = np.random.randn(len(df))
df
df.index = pd.to_datetime(df[datecols])
df.head()
CategoricalAccessor Categorical Data
This application is generally intended to save space and time.
For example, the color data is of the str string type, and if it is converted to a number, it is very intuitive to see the space saving.
colors = pd.Series([
'periwinkle',
'mint green',
'burnt orange',
'periwinkle',
'burnt orange',
'rose',
'rose',
'mint green',
'rose',
'navy'
])
import sys
colors.apply(sys.getsizeof)
0 59
1 59
2 61
3 59
4 61
5 53
6 53
7 59
8 53
9 53
dtype: int64
It can also be processed using the method of map dictionary mapping.
mapper = {
v: k for k, v in enumerate(colors.unique())}
mapper
{
'periwinkle': 0, 'mint green': 1, 'burnt orange': 2, 'rose': 3, 'navy': 4}
as_int = colors.map(mapper)
as_int
0 0
1 1
2 2
3 0
4 2
5 3
6 3
7 1
8 3
9 4
dtype: int64
as_int.apply(sys.getsizeof)
0 24
1 28
2 28
3 24
4 28
5 28
6 28
7 28
8 28
9 28
dtype: int64
Self-iterating Groupby objects
Groupby is used for aggregation of data. The main uses for DataFrame are:
- Split data into groups based on given criteria.
- Functions (such as sum, mean, std, etc.) can be applied individually to each group.
- Combine the results into one data result.
The aggregation itself has no practical meaning, and various conditions need to be added to summarize the results.
abalone['ring_quartile'] = pd.qcut(abalone.rings, q=4, labels=range(1, 5))
grouped = abalone.groupby('ring_quartile')
grouped
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x11c1169b0>
Summarize based on fixed conditions.
abalone.groupby(['height', 'weight']).agg(['mean', 'median'])
Pandas uses boolean operators
The data column is judged using bool.
abalone["sex"] == "M"
0 True
1 True
2 False
3 True
4 False
...
4172 False
4173 True
4174 True
4175 False
4176 True
Name: sex, dtype: bool
Correspondingly, the data judged to be true can be extracted, corresponding to the row whose gender is male M in the data.
abalone[abalone["sex"] == "M"]
Or use the ~ sign for inversion.
abalone[~(abalone["sex"] == "M")]
Load data from clipboard
This sounds amazing, but it's actually possible.
For example, the data in Excel is like this.
Copy directly, and then do code manipulation.
df = pd.read_clipboard(na_values=[None])
df
Pandas objects are written directly to compressed format
Write Pandas objects directly to gzip, bz2, zip or xz compression instead of storing uncompressed files in memory and converting them.
abalone.to_json('df.json.gz', orient='records',
lines=True, compression='gzip')
The data size difference after saving is 9.9 times.
mport os.path
abalone.to_json('df.json', orient='records', lines=True)
os.path.getsize('df.json') / os.path.getsize('df.json.gz')
9.90544507008506