Category (Category) data: straightforward, it is of limited value, or a fixed number of possible values. For example: gender, blood type
Construction of the specified data type classification data
dtype="category"
With blood, for example, create an object on a classification of blood
import pandas as pd index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name") user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="bloo d_type", dtype="category") user_info """ name Tom A Bob AB Mary NaN James AB Andy O Alice B Name: bloo d_type, dtype: category Categories (4, object): [A, AB, B, O] """
Use pd.Categorical to build a data classification
import pandas as pd index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name") user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="blood_type") # categories:自定义类别数据 pd.Categorical(user_info, categories=["A", "B", "AB"]) """ [A, AB, NaN, AB, NaN, B] Categories (3, object): [A, B, AB] """
Into categorical data
We often encountered have created a Series, how it will classify data into it? Take a look at astype use it
user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="bloo d_type") user_info = user_info.astype("category") user_info """ name Tom A Bob AB Mary NaN James AB Andy O Alice B Name: bloo d_type, dtype: category Categories (4, object): [A, AB, B, O] """
Common Operations
.describe()
Data classification can be used .describe () method, and the results obtained data string of the same type
user_info.describe () "" " COUNT. 5 UNIQUE 4 Top AB FREQ 2 the Name: Bloo d_type, DTYPE: Object " "" "" " COUNT: nonempty data has five unique: non-empty data deduplication 4 Article top: value occurs most often in AB FREQ:. the number of the most frequently occurring value of 2 "" "
.cat.categories
Use .cat.categories to obtain classified data for all possible values :
user_info.cat.categories """ Index(['A', 'AB', 'B', 'O'], dtype='object') """
.cat.rename_categories
Use .cat.rename_categories way to rename a category name
user_info.cat.rename_categories(["A+", "AB+", "B+", "O+"])
Add, delete categories of operations, which can be achieved through .cat.add_categories, .cat.remove_categories.
.value_counts()
Use value_counts method to view the data distribution of
user_info.value_counts()
.str property
Use .str property to access classified data
# See if contains the letter "A user_info.str.contains ( " A " )
pd.concat
Use pd.concat classification data merge
blood_type1 = pd.Categorical(["A", "AB"])
blood_type2 = pd.Categorical(["B", "O"])
pd.concat([pd.Series(blood_type1), pd.Series(blood_type2)])
union_categoricals
After pd.concat disaggregated data type into a merged object type, if you want to keep the classification type , you can help union_categoricals to complete.
from pandas.api.types import union_categoricals union_categoricals([blood_type1, blood_type2])
Memory Usage
Categorical memory use is classified proportional to the length multiplied by the number of data
object type of data is a constant data length *
In the category of small number of cases, very disaggregated data to save memory
blood_type = pd.Series(["AB","O"]*1000) blood_type.nbytes # 32000 blood_type.astype("category").nbytes # 4016
When the number of close to the length of the data category will then use Categorical equivalent or substantially the same object represents more memory
blood_type = pd.Series(['AB%04d' % i for i in range(2000)]) blood_type.nbytes # 16000 blood_type.astype("category").nbytes # 20000