100 days proficient in Python (data analysis) - Day 71: str/object type conversion, case conversion, text alignment, acquisition length, number of occurrences, encoding of Pandas text data processing method

insert image description here

1. Introduction to text data types

  • objectThere are two types of Pandas text data string, and if a column of data contains text and data, it will default to the object type.
  • Before pandas1.0, there was only text data and only the object type, and after the pandas1.01.0 dynasty, there was a string type.
  • If the type is not specified as string, the text type is generally object

1) Type introduction

(1) A column of data contains text and data, which is of object type by default:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    
    
    'A': ['a', 'b', 'c', 'd'],
    'B': ['ee', 'ff', 'gg', np.nan],
    'C': [1, 2, 3, 4],
    'D': [5, 6, 7, np.nan]
})
print(df)
print(df.dtypes)

operation result:
insert image description here

(2) The string type needs to be specified by setting dtypeparameters :

# 方法1 :dtype='string'
df = pd.DataFrame({
    
    'A': ['a', 'b', 'c', 'd']}, dtype='string')
print(df.dtypes)

# 方法2 : dtype=pd.StringDtype()
df = pd.DataFrame({
    
    'A': ['a', 'b', 'c', 'd']}, dtype=pd.StringDtype())
print(df.dtypes)

operation result:
insert image description here

2) Type conversion

Method 1: cast to string by astype

df = pd.Series({
    
    'A': ['a', 'b', 'c', 'd']})
# 转换前
print(df)
print(df.dtypes)
# 转换后
df = df.astype("string")
print(df)
print(df.dtypes)

operation result:
insert image description here

Method 2: Intelligent data type selection through df.convert_dtypes()

import pandas as pd
import numpy as np

df = pd.DataFrame({
    
    
    'A': ['a', 'b', 'c', 'd'],
    'B': ['ee', 'ff', 'gg', np.nan],
    'C': [1, 2, 3, 4],
    'D': [5, 6, 7, np.nan]
})
print('类型转换前')
print(df.dtypes)

df = df.convert_dtypes()  # 智能数据类型选择
print('类型转换后')
print(df.dtypes)

operation result:
insert image description here

3) Type difference

The difference between string type and object type is as follows:

  • String accessor methods that return numeric output will always return a nullable integer type for a sting, or an int or float for an object, depending on the presence of NA values
  • For the string type, the method that returns a boolean output will return a nullable boolean boolean data type; and the object type is still object

Difference 1: When counting strings

When counting the string s.str.count():

  • None of the string type returns NaN, and the dtype is Int64; after removing missing values ​​through dropna(), the dtype is alsoInt64
  • None of the object type returns NaN, and dtpye is float64; after removing missing values ​​through dropna(), the dtype isInt64

string type:

import pandas as pd
import numpy as np

s = pd.Series(['小明', '小红', None], dtype='string')
print("去除空值前:")
print(s)
print(s.str.count('小'))

print("去除空值后:")
s.dropna(inplace=True)
print(s)
print(s.str.count('小'))

operation result:
insert image description here

object type:

import pandas as pd
import numpy as np

s = pd.Series(['小明', '小红', None], dtype='object')
print("去除空值前:")
print(s)
print(s.str.count('小'))

print("去除空值后:")
s.dropna(inplace=True)
print(s)
print(s.str.count('小'))

operation result:
insert image description here

Difference 2: When checking a string

When checking a string via str.isdigit():

  • string type, return Boolean type,dtype= boolean, missing value is NA
  • object type, although the return type is Boolean,But dtype = object, None returns None
import pandas as pd
import numpy as np

s = pd.Series(['小明', '小红', None], dtype='string')
print("string类型:")
print(s.str.isdigit())

s = pd.Series(['小明', '小红', None], dtype='object')
print("object:")
print(s.str.isdigit())

operation result:
insert image description here

2. Python string built-in method

String is a common data type. The text and json data we encounter all belong to the category of string. Python has many built-in methods for processing strings, which provide great convenience for us to process and clean data. This article will introduce case conversion, text alignment methods,

1) Case conversion

method illustrate
string.lower() Converts all uppercase characters in string to lowercase
string.upper() Convert lowercase letters in string to uppercase
string.capitalize() capitalizes the first character of a string
string.title() Capitalize the first letter of each word in a string
string.swapcase() flip case in string

2) Text alignment

method illustrate
string.bright(width) Returns a new string with the original string left-aligned and padded with spaces to length width
string.rjust(width) Returns a new string with the original string right-aligned and padded with spaces to length width
string.center(width) Returns a new string with the original string centered and padded with spaces to length width

3) Get the length

method illustrate
len(string) Returns the length of the string.

4) Get the number of occurrences

method illustrate
count(string) Returns the number of occurrences of each string element.

5) Coding

method illustrate
encode(‘utf-8’) character encoding, passing the string

3. How does Pandas use built-in methods?

  • In the process of daily data cleaning and data analysis, it is often necessary to process string type data. And pandas has Series.strmany built-in accessor-based methods for processing strings. After pandas-specific columns pass str, you can use various python-common character processing methods and built-in functions, which can help us greatly improve string data. processing efficiency.
  • Pandas .strcan use the string built-in method on the Series object by calling it (the string processing function in pandas starts with str) to operate on a column in the data frame. This vectorized operation improves the processing efficiency.

1) Case conversion

method illustrate
series_obj.str.lower() Converts all uppercase characters in string to lowercase
series_obj.str.upper() Convert lowercase letters in string to uppercase
series_obj.str.capitalize() capitalizes the first character of a string
series_obj.str.title() Capitalize the first letter of each word in a string
series_obj.str.swapcase() flip case in string

Prepare data:

import pandas as pd
import numpy as np

series_obj = pd.Series(['A', 'b', 'ABC', 'Abc', 'abc', 'This is abc', np.nan], dtype='string')
print(df)

operation result:
insert image description here

1. Convert all uppercase characters in string to lowercase:

series_obj.str.lower()

operation result:
insert image description here

2. Convert lowercase letters in string to uppercase:

series_obj.str.upper()

operation result:
insert image description here

3. Capitalize the first character of a string:

series_obj.str.capitalize()

operation result:
insert image description here

4. Put the stringCapitalize the first letter of each word(Note the difference with capitalize):

series_obj.str.title()

insert image description here

5. Flip the case in string:

series_obj.str.swapcase()

operation result:
insert image description here

2) Text alignment

method illustrate
series_obj.str.ljust(width) Returns a new string with the original string left-aligned and padded with spaces to length width
series_obj.str.rjust(width) Returns a new string with the original string right-aligned and padded with spaces to length width
series_obj.str.center(width) Returns a new string with the original string center-aligned and padded with spaces to length width

1. Return a new string with the original string left-aligned and padded with spaces to a length of width:

# 左对齐:宽度为10,空余部分用 '-' 填充
series_obj.str.ljust(8, fillchar='-')

operation result:
insert image description here

2. Return a new string with the original string right-aligned and padded with spaces to a length of width :

# 右对齐:宽度为10,空余部分用 '-' 填充
series_obj.str.rjust(8, fillchar='-')

operation result:
insert image description here

3. Return a new string whose original string is center-aligned and padded with spaces to a length of width:

# 居中对齐:宽度为10,空余部分用 '-' 填充
series_obj.str.center(8, fillchar='-')

operation result:

insert image description here

3) Get the length

method illustrate
series_obj.str.len(string) Returns the length of the string.
series_obj.str.len()

operation result:
insert image description here

4) Get the number of occurrences

method illustrate
series_obj.str.count(string) Returns the number of occurrences of each string element.

Count how many times A appears, count is case-sensitive:

series_obj.str.count('A')
series_obj.str.count('a')

operation result:

insert image description here

5) Coding

method illustrate
series_obj.str.encode(‘utf-8’) character encoding, passing the string

The character encoding is set to utf8:

series_obj.str.encode('utf-8')

insert image description here

4. Precautions

1,.strThe accessor can only be used on the Series data structure. In addition to the regular column variable df.col, you can also use the index types df.Index and df.columns

2,Ensure that the object type accessed is a string str type. If it is not necessary to astype(str) convert the type first, otherwise an error will be reported

3. Some methods cannot be used on stringSeries, such as: series_obj.str.decode(), because Series stores strings instead of bytes:

series_obj.str.decode('utf-8')

operation result:

insert image description here

4.Accessors can be used with multiple connections. For example series_obj.str.lower().str.title(), using effect stacking:

# 先设置全部小写,然后设置首字母大写
series_obj.str.lower().str.title()

operation result:
insert image description here

Proxy IP

Python is inseparable from reptiles. Recently, some friends who want to learn reptiles asked me where to find the proxy IP. The blogger himself uses a high-stable crawler proxy IP: Shenlong HTTP Proxy (you can click to find out if you need it)

book introduction

"PyTorch Tutorial: 21 Projects to Play with PyTorch in Action"

insert image description here

PyTorch is an open source machine learning library based on the Torch library. It is mainly developed by the artificial intelligence research laboratory of Meta (formerly Facebook), and has a wide range of applications in the fields of natural language processing and computer vision. This book introduces simple and classic introductory projects, which are easy to get started quickly, such as MNIST digit recognition. Readers can understand basic concepts such as data sets, models, and training during the process of completing the projects. This book also introduces some practical and classic models, such as the R-CNN model. Through the study of this model, readers can have a basic understanding of the target detection task and have a certain understanding of the basic network structure principles. In addition, this book also has a certain introduction to the currently popular generative adversarial networks and reinforcement learning, which is convenient for readers to broaden their horizons and grasp the cutting-edge directions.
If you don’t want to draw a lottery, JD’s self-operated purchase link: https://item.jd.com/13522327.html

Guess you like

Origin blog.csdn.net/yuan2019035055/article/details/128602503