Pandas classification (category) data processing

  Category (Category) data: straightforward, it is of limited value, or a fixed number of possible values. For example: gender, blood type

Construction of the specified data type classification data 

dtype="category"

  With blood, for example, create an object on a classification of blood

import pandas as pd
index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name")
user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="bloo d_type", dtype="category")
user_info
"""
name
Tom        A
Bob       AB
Mary     NaN
James     AB
Andy       O
Alice      B
Name: bloo d_type, dtype: category
Categories (4, object): [A, AB, B, O]
"""

Use pd.Categorical to build a data classification

import pandas as pd
index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name")
user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="blood_type") 
# categories:自定义类别数据
pd.Categorical(user_info, categories=["A", "B", "AB"]) 
"""
[A, AB, NaN, AB, NaN, B]
Categories (3, object): [A, B, AB]
"""

Into categorical data

  We often encountered have created a Series, how it will classify data into it? Take a look at  astype  use it

user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="bloo d_type")
user_info = user_info.astype("category")
user_info 
"""
name
Tom        A
Bob       AB
Mary     NaN
James     AB
Andy       O
Alice      B
Name: bloo d_type, dtype: category
Categories (4, object): [A, AB, B, O]
"""

Common Operations

.describe()

  Data classification can be used .describe () method, and the results obtained data string of the same type

user_info.describe () 
 "" " 
COUNT. 5 
UNIQUE 4 
Top AB 
FREQ 2 
the Name: Bloo d_type, DTYPE: Object 
" "" 

"" " 
COUNT: nonempty data has five 
unique: non-empty data deduplication 4 Article 
top: value occurs most often in AB 
FREQ:. the number of the most frequently occurring value of 2 
"" "

.cat.categories  

  Use .cat.categories to obtain classified data for all possible values

user_info.cat.categories 
"""
Index(['A', 'AB', 'B', 'O'], dtype='object')
"""

.cat.rename_categories  

  Use .cat.rename_categories way to rename a category name 

user_info.cat.rename_categories(["A+", "AB+", "B+", "O+"])

  Add, delete categories of operations, which can be achieved through .cat.add_categories, .cat.remove_categories.

.value_counts()

  Use value_counts method to view the data distribution of 

user_info.value_counts() 

.str property  

  Use .str property to access classified data

# See if contains the letter "A 
user_info.str.contains ( " A " )

pd.concat

  Use pd.concat classification data merge

blood_type1 = pd.Categorical(["A", "AB"]) 
blood_type2 = pd.Categorical(["B", "O"]) 
pd.concat([pd.Series(blood_type1), pd.Series(blood_type2)]) 

union_categoricals   

  After pd.concat disaggregated data type into a merged object type, if you want to keep the classification type , you can help union_categoricals to complete.  

from pandas.api.types import union_categoricals 
union_categoricals([blood_type1, blood_type2])

Memory Usage

  Categorical  memory use is classified proportional to the length multiplied by the number of data

  object  type of data is a  constant data length *

  In the category of small number of cases, very disaggregated data to save memory

blood_type = pd.Series(["AB","O"]*1000) 
blood_type.nbytes   # 32000
blood_type.astype("category").nbytes   # 4016

  When the number of close to the length of the data category will then use Categorical equivalent or substantially the same object represents more memory

blood_type = pd.Series(['AB%04d' % i for i in range(2000)]) 
blood_type.nbytes   # 16000
blood_type.astype("category").nbytes  # 20000

Guess you like

Origin www.cnblogs.com/zry-yt/p/11803892.html