Data Science Essential Pandas Unpopular But Very Practical Tips

Pandas is a foundational library for analytics, data processing, and data science. Some lesser-used but idiomatic Pandas features that make code more readable, versatile, and speedy, like Buzzfeed lists.

insert image description here

Interpreter startup configuration options and settings

To set custom Pandas options when the interpreter starts, use the startup file pd.set_option() to configure what you want.

import pandas as pd

def start():
    options = {
    
    
        'display': {
    
    
            'max_columns': None,
            'max_colwidth': 25,
            'expand_frame_repr': False,  # 不要换行到多个页面
            'max_rows': 4,              # 数据截断行数
            'max_seq_items': 50,         # 打印序列的最大长度
            'precision': 4,              # 精度
            'show_dimensions': False
        },
        'mode': {
    
    
            'chained_assignment': None   # 控制 SettingWithCopyWarning
        }
    }

    for category, option in options.items():
        for op, value in option.items():
            pd.set_option(f'{
      
      category}.{
      
      op}', value)  # Python 3.6+

For example, look at the settings of pandas.

pd.get_option('display.max_rows')
4

Use the data to print the result and automatically truncate the set number of lines.

import pandas as pd
url = ('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
cols = ['sex', 'length', 'diam', 'height', 'weight', 'rings']
abalone = pd.read_csv(url, usecols=[0, 1, 2, 3, 4, 8], names=cols)
abalone 

insert image description here

Processing using Pandas' Test module

Hidden in Pandas' test module are some handy functions for quickly building quasi-realistic Series and DataFrames.

import pandas.util.testing as tm
tm.N, tm.K = 15, 3  # 默认行/列

import numpy as np
np.random.seed(444)

tm.makeTimeDataFrame(freq='M').head()

insert image description here

tm.makeDataFrame().head()

insert image description here
To query all functions, you can use the dir method, and you can know the approximate role of each module through the method name.

[i for i in dir(tm) if i.startswith('make')]

['makeBoolIndex',
 'makeCategoricalIndex',
 'makeCustomDataframe',
 'makeCustomIndex',
 'makeDataFrame',
 'makeDateIndex',
 'makeFloatIndex',
 'makeFloatSeries',
 'makeIntIndex',
 'makeIntervalIndex',
 'makeMissingCustomDataframe',
 'makeMissingDataframe',
 'makeMixedDataFrame',
 'makeMultiIndex',
 'makeObjectSeries',
 'makePeriodFrame',
 'makePeriodIndex',
 'makePeriodSeries',
 'makeRangeIndex',
 'makeStringIndex',
 'makeStringSeries',
 'makeTimeDataFrame',
 'makeTimeSeries',
 'makeTimedeltaIndex',
 'makeUIntIndex',
 'makeUnicodeIndex']

Pandas Accessor

Kind of like getters (although getters and setters are rarely used in Python). A Pandas accessor can be thought of as a property, used as an interface for additional methods.

There are 3 basic data processing standard formats:

  • .str maps to StringMethods
  • .dt maps to CombinedDatetimelikeProperties
  • .cat route to CategoricalAccessor
pd.Series._accessors
{
    
    'cat', 'str', 'dt'}

StringMethods , which is handled directly using string methods.

addr = pd.Series([
    'Washington, D.C. 20003',
    'Brooklyn, NY 11211-1755',
    'Omaha, NE 68154',
    'Pittsburgh, PA 15211'
])
addr.str.upper()

0     WASHINGTON, D.C. 20003
1    BROOKLYN, NY 11211-1755
2            OMAHA, NE 68154
3       PITTSBURGH, PA 15211
dtype: object

CombinedDatetimelikeProperties , which is processed directly according to the time data.

daterng = pd.Series(pd.date_range('2022', periods=5, freq='Q'))
daterng 

0   2022-03-31
1   2022-06-30
2   2022-09-30
3   2022-12-31
4   2023-03-31
dtype: datetime64[ns]

daterng.dt.day_name()
0    Thursday
1    Thursday
2      Friday
3    Saturday
4      Friday
dtype: object

CategoricalAccessor , for categorical data. There will be detailed instructions in the following apply.

Create DatetimeIndex from component column

For datetime-like data like dateng above, a Pandas DatetimeIndex can be created from multiple component columns that together form a date or datetime.

from itertools import product
datecols = ['year', 'month', 'day']

df = pd.DataFrame(list(product([2021, 2022], [1, 2], [1, 2])),columns=datecols)
df['data'] = np.random.randn(len(df))
df

insert image description here

df.index = pd.to_datetime(df[datecols])
df.head()

insert image description here

CategoricalAccessor Categorical Data

This application is generally intended to save space and time.

For example, the color data is of the str string type, and if it is converted to a number, it is very intuitive to see the space saving.

colors = pd.Series([
    'periwinkle',
    'mint green',
    'burnt orange',
    'periwinkle',
    'burnt orange',
    'rose',
    'rose',
    'mint green',
    'rose',
    'navy'
])

import sys
colors.apply(sys.getsizeof)

0    59
1    59
2    61
3    59
4    61
5    53
6    53
7    59
8    53
9    53
dtype: int64

It can also be processed using the method of map dictionary mapping.

mapper = {
    
    v: k for k, v in enumerate(colors.unique())}
mapper

{
    
    'periwinkle': 0, 'mint green': 1, 'burnt orange': 2, 'rose': 3, 'navy': 4}

as_int = colors.map(mapper)
as_int
0    0
1    1
2    2
3    0
4    2
5    3
6    3
7    1
8    3
9    4
dtype: int64

as_int.apply(sys.getsizeof)
0    24
1    28
2    28
3    24
4    28
5    28
6    28
7    28
8    28
9    28
dtype: int64

Self-iterating Groupby objects

Groupby is used for aggregation of data. The main uses for DataFrame are:

  • Split data into groups based on given criteria.
  • Functions (such as sum, mean, std, etc.) can be applied individually to each group.
  • Combine the results into one data result.
    insert image description here

The aggregation itself has no practical meaning, and various conditions need to be added to summarize the results.

abalone['ring_quartile'] = pd.qcut(abalone.rings, q=4, labels=range(1, 5))
grouped = abalone.groupby('ring_quartile')
grouped
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x11c1169b0>

Summarize based on fixed conditions.

abalone.groupby(['height', 'weight']).agg(['mean', 'median'])

insert image description here

Pandas uses boolean operators

The data column is judged using bool.

insert image description here

abalone["sex"] == "M"
0        True
1        True
2       False
3        True
4       False
        ...  
4172    False
4173     True
4174     True
4175    False
4176     True
Name: sex, dtype: bool

Correspondingly, the data judged to be true can be extracted, corresponding to the row whose gender is male M in the data.

abalone[abalone["sex"] == "M"]

insert image description here
Or use the ~ sign for inversion.

abalone[~(abalone["sex"] == "M")]

insert image description here

Load data from clipboard

This sounds amazing, but it's actually possible.

For example, the data in Excel is like this.
insert image description here
Copy directly, and then do code manipulation.

df = pd.read_clipboard(na_values=[None])
df

insert image description here

Pandas objects are written directly to compressed format

Write Pandas objects directly to gzip, bz2, zip or xz compression instead of storing uncompressed files in memory and converting them.

abalone.to_json('df.json.gz', orient='records',
                lines=True, compression='gzip')

The data size difference after saving is 9.9 times.

mport os.path
abalone.to_json('df.json', orient='records', lines=True)
os.path.getsize('df.json') / os.path.getsize('df.json.gz')
9.90544507008506

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124139460