Pandas classification data-task3

1. Create classification data

Categorized data refers to classifying data according to a certain interval, similar to data binning.
Specify the column data type: dtype="category", so the newly created series is the category type.

pd.Series(["a", "b", "c", "a"], dtype="category")
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

pandas has built-in classes, which can be created directly
using built-in Categorical types

cat = pd.Categorical(["a", "b", "c", "a"], categories=['a','b','c'])
pd.Series(cat)
Out[4]:
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

Perform data classification on existing columns, the function is cut, and the interval type is used as label by default

In [5]:
pd.cut(np.random.randint(0,60,5), [0,10,30,60])
Out[5]:
[(10, 30], (0, 10], (10, 30], (30, 60], (30, 60]]
Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]

Can specify characters as labels

pd.cut(np.random.randint(0,60,5), [0,10,30,60], 
right=False, labels=['0-10','10-30','30-60'])
Out[6]:
[10-30, 30-60, 30-60, 10-30, 30-60]
Categories (3, object): [0-10 < 10-30 < 30-60]

2. Obtain information on classification data

A categorical variable consists of three parts: values, categories, and order.
As you can see from the above, the categorical variable created using the cut function defaults to an ordered categorical variable.
Here’s how to get or Modify these properties

(A) describe method¶
This method describes the situation of a classification sequence, including the number of non-missing values, the number of element value categories (not the number of classification categories), the most frequently occurring elements and their frequency

describe() can view the classified statistics.

In [7]:
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.describe()
Out[7]:
count     4
unique    3
top       a
freq      2
dtype: object

categories and ordered attributes¶
View classification categories and whether they are sorted

In [8]:
s.cat.categories
Out[8]:
Index(['a', 'b', 'c', 'd'], dtype='object')
In [9]:
s.cat.ordered
Out[9]:
False

3. Modification of category

  • Use set_categories to
    modify the category, but the value will not change, just put the original data in the new bucket.
In [10]:
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.set_categories(['new_a','c'])
Out[10]:
0    NaN
1    NaN
2      c
3    NaN
4    NaN
dtype: category
Categories (2, object): [new_a, c]
  • Use rename_categories to modify. It
    should be noted that this method will modify the value and category at the same time (this is not actually a new bucket, but the original bucket is renamed, which is equivalent to a new label, but the original data has not changed)
In [11]:
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.rename_categories(['new_%s'%i for i in s.cat.categories])
Out[11]:
0    new_a
1    new_b
2    new_c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, new_c, new_d]
  • Using the dictionary to modify the value is actually renaming
In [12]:
s.cat.rename_categories({
    
    'a':'new_a','b':'new_b'})
Out[12]:
0    new_a
1    new_b
2        c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, c, d]
  • add add a new group
In [13]:
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.add_categories(['e'])
Out[13]:
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (5, object): [a, b, c, d, e]
  • Use remove_categories to remove
In [14]:
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_categories(['d'])
Out[14]:
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]
  • Delete the category type where the element value does not appear
In [15]:
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_unused_categories()
Out[15]:
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]

4. Sorting classification data

Categorical data types are divided into ordered and disordered, which is very easy to understand. For example, the level of the score interval is an ordinal variable, and the category of the test subject is generally regarded as an unordered variable.

  • Generally speaking, a sequence will be converted into an ordered variable, you can use the as_ordered method
In [16]:
s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
s
Out[16]:
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]

Degenerate into an unordered variable, just use as_unordered

In [17]:
s.cat.as_unordered()

It is also value sorting and index sorting. Set the category order, and then you can sort by value

s.cat.set_categories(['perfect','good','fair','bad',
'awful'][::-1],ordered=True).head()
Out[21]:
0       good
1       fair
2        bad
3    perfect
4    perfect
dtype: category
Categories (5, object): [awful < bad < fair < good < perfect]

Sort by value

s.sort_values(ascending=False).head()
Out[22]:
29    perfect
17    perfect
31    perfect
3     perfect
4     perfect
dtype: category
Categories (5, object): [awful, bad, fair, good, perfect]

When the index is a category, you can sort by index

df_sort = pd.DataFrame({
    
    'cat':s.values,
'value':np.random.randn(50)}).set_index('cat')
df_sort.head()
	    value
cat	
good	-1.746975
fair	0.836732
bad	0.094912
perfect	-0.724338
perfect	-1.456362

Sort by index (default descending order)

df_sort.sort_index().head()
Out[24]:
		value
cat	
awful	0.245782
awful	0.063991
awful	1.541862
awful	-0.062976
awful	0.472542

Guess you like

Origin blog.csdn.net/hu_hao/article/details/106989922