Pandas study notes (8) - Pandas classification data

leading


For more article code details, please check the blogger’s personal website: https://www.iwtmbtly.com/


Import the required libraries and files:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv('data/table.csv')
>>> df.head()
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

1. The creation and nature of category

(1) Creation of categorical variables

1. Create with Series

>>> pd.Series(["a", "b", "c", "a"], dtype="category")
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

2. Create the specified type of DataFrame

>>> temp_df = pd.DataFrame({
    
    'A':pd.Series(["a", "b", "c", "a"], dtype="category"),'B':list('abcd')})
>>> temp_df.dtypes
A    category
B      object
dtype: object

3. Use the built-in Categorical type to create

>>> cat = pd.Categorical(["a", "b", "c", "a"], categories=['a','b','c'])
>>> pd.Series(cat)
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

4. Use the cut function to create

The default interval type is label:

>>> pd.cut(np.random.randint(0,60,5), [0,10,30,60])
[(0, 10], (10, 30], (10, 30], (10, 30], (30, 60]]
Categories (3, interval[int64, right]): [(0, 10] < (10, 30] < (30, 60]]

Characters can be specified as labels:

>>> pd.cut(np.random.randint(0,60,5), [0,10,30,60], right=False, labels=['0-10','10-30','30-60'])
['30-60', '30-60', '30-60', '10-30', '30-60']
Categories (3, object): ['0-10' < '10-30' < '30-60']

(2) Structure of categorical variables

A categorical variable consists of three parts, element values ​​(values), categorical categories (categories), and whether it is ordered (order)

As can be seen from the above, the categorical variables created using the cut function default to ordinal categorical variables

Here's how to get or modify these properties

1. describe method

This method describes the situation of a categorical sequence, including the number of non-missing values, the number of element value categories (not the number of categorical categories), the most frequently occurring element and its frequency

>>> s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

2. categories and ordered attributes

View classification categories and whether to sort:

>>> s.cat.categories
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> s.cat.ordered
False

(3) Modification of categories

1. Use set_categories to modify

Modify the classification, but the value itself will not change:

>>> s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
>>> s.cat.set_categories(['new_a','c'])
0    NaN
1    NaN
2      c
3    NaN
4    NaN
dtype: category
Categories (2, object): ['new_a', 'c']

2. Use rename_categories to modify

It should be noted that this method will modify the value and classification at the same time:

>>> s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
>>> s.cat.rename_categories(['new_%s'%i for i in s.cat.categories])
0    new_a
1    new_b
2    new_c
3    new_a
4      NaN
dtype: category
Categories (4, object): ['new_a', 'new_b', 'new_c', 'new_d']

Use a dictionary to modify values:

>>> s.cat.rename_categories({
    
    'a':'new_a','b':'new_b'})
0    new_a
1    new_b
2        c
3    new_a
4      NaN
dtype: category
Categories (4, object): ['new_a', 'new_b', 'c', 'd']

3. Use add_categories to add

>>> s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
>>> s.cat.add_categories(['e'])
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

4. Use remove_categories to remove

>>> s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
>>> s.cat.remove_categories(['d'])
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): ['a', 'b', 'c']

5. Delete the classification type that does not appear in the element value

>>> s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
>>> s.cat.remove_unused_categories()
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): ['a', 'b', 'c']

2. Sorting of categorical variables

As mentioned earlier, categorical data types are divided into ordered and unordered, which is very easy to understand. For example, the level of the score interval is an ordered variable, and the category of test subjects is generally regarded as an unordered variable.

(1) Establishment of sequence

1. Generally, a sequence will be converted into an ordered variable, you can use the as_ordered method

>>> s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
>>> s
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): ['a' < 'c' < 'd']

Degenerates into unordered variables, just use as_unordered

>>> s.cat.as_unordered()
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): ['a', 'c', 'd']

2. Use the order parameter in the set_categories method

>>> pd.Series(["a", "d", "c", "a"]).astype('category').cat.set_categories(['a','c','d'],ordered=True)
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): ['a' < 'c' < 'd']

3. Use the reorder_categories method

The characteristic of this method is that the newly set classification must be the same set as the original classification:

>>> s = pd.Series(["a", "d", "c", "a"]).astype('category')
>>> s.cat.reorder_categories(['a','c','d'],ordered=True)
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): ['a' < 'c' < 'd']
    
# s.cat.reorder_categories(['a','c'],ordered=True) # 报错
# s.cat.reorder_categories(['a','c','d','e'],ordered=True) # 报错

(2) Sorting

1. Value sorting and index sorting

>>> s = pd.Series(np.random.choice(['perfect','good','fair','bad','awful'],50)).astype('category')
>>> s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1],ordered=True).head()
0     good
1     good
2     fair
3    awful
4     good
dtype: category
Categories (5, object): ['awful' < 'bad' < 'fair' < 'good' < 'perfect']
>>> s.sort_values(ascending=False).head()
49    perfect
41    perfect
38    perfect
34    perfect
31    perfect
dtype: category
Categories (5, object): ['awful', 'bad', 'fair', 'good', 'perfect']
>>> df_sort = pd.DataFrame({
    
    'cat':s.values,'value':np.random.randn(50)}).set_index('cat')
>>> df_sort.head()
          value
cat
good  -0.492898
good  -0.800566
fair   0.355032
awful  0.299062
good  -2.321767
>>> df_sort.sort_index().head()
          value
cat
awful  0.435638
awful  1.912790
awful -0.204264
awful  0.299062
awful -0.585013

3. Comparison operation of categorical variables

(1) Comparison with scalar or equal-length sequences

1. Scalar comparison

>>> s = pd.Series(["a", "d", "c", "a"]).astype('category')
>>> s == 'a'
0     True
1    False
2    False
3     True
dtype: bool

2. Isometric sequence comparison

>>> s == list('abcd')
0     True
1    False
2     True
3    False
dtype: bool

(2) Comparison with another categorical variable

1. Equality discrimination (including equal signs and inequality signs)

The equality discriminant of two categorical variables requires that the categories are identical:

>>> s = pd.Series(["a", "d", "c", "a"]).astype('category')
>>> s == s
0    True
1    True
2    True
3    True
dtype: bool
>>> s != s
0    False
1    False
2    False
3    False
dtype: bool
>>> s_new = s.cat.set_categories(['a','d','e'])
>>> s == s_new	# 报错
    raise TypeError(msg)
TypeError: Categoricals can only be compared if 'categories' are the same.

2. Inequality discrimination (including >=,<=,<,>)

The inequality discrimination of two categorical variables needs to meet two conditions:

  • The categories are exactly the same

  • sort exactly the same

>>> s = pd.Series(["a", "d", "c", "a"]).astype('category')
>>> # s >= s # 报错
 raise TypeError(
TypeError: Unordered Categoricals can only compare equality or not
>>> s=pd.Series(["a","d","c","a"]).astype('category').cat.reorder_categories(['a','c','d'],ordered=True)
>>> s >= s
0    True
1    True
2    True
3    True
dtype: bool

Guess you like

Origin blog.csdn.net/qq_43300880/article/details/125029001