Article Directory
1. Introduction to text data types
object
There are two types of Pandas text datastring
, and if a column of data contains text and data, it will default to the object type.- Before pandas1.0, there was only text data and only the object type, and after the pandas1.01.0 dynasty, there was a string type.
- If the type is not specified as string, the text type is generally object
1) Type introduction
(1) A column of data contains text and data, which is of object type by default:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['a', 'b', 'c', 'd'],
'B': ['ee', 'ff', 'gg', np.nan],
'C': [1, 2, 3, 4],
'D': [5, 6, 7, np.nan]
})
print(df)
print(df.dtypes)
operation result:
(2) The string type needs to be specified by setting dtype
parameters :
# 方法1 :dtype='string'
df = pd.DataFrame({
'A': ['a', 'b', 'c', 'd']}, dtype='string')
print(df.dtypes)
# 方法2 : dtype=pd.StringDtype()
df = pd.DataFrame({
'A': ['a', 'b', 'c', 'd']}, dtype=pd.StringDtype())
print(df.dtypes)
operation result:
2) Type conversion
Method 1: cast to string by astype
df = pd.Series({
'A': ['a', 'b', 'c', 'd']})
# 转换前
print(df)
print(df.dtypes)
# 转换后
df = df.astype("string")
print(df)
print(df.dtypes)
operation result:
Method 2: Intelligent data type selection through df.convert_dtypes()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['a', 'b', 'c', 'd'],
'B': ['ee', 'ff', 'gg', np.nan],
'C': [1, 2, 3, 4],
'D': [5, 6, 7, np.nan]
})
print('类型转换前')
print(df.dtypes)
df = df.convert_dtypes() # 智能数据类型选择
print('类型转换后')
print(df.dtypes)
operation result:
3) Type difference
The difference between string type and object type is as follows:
- String accessor methods that return numeric output will always return a nullable integer type for a sting, or an int or float for an object, depending on the presence of NA values
- For the string type, the method that returns a boolean output will return a nullable boolean boolean data type; and the object type is still object
Difference 1: When counting strings
When counting the string s.str.count():
- None of the string type returns NaN, and the dtype is
Int64
; after removing missing values through dropna(), the dtype is alsoInt64
- None of the object type returns NaN, and dtpye is
float64
; after removing missing values through dropna(), the dtype isInt64
string type:
import pandas as pd
import numpy as np
s = pd.Series(['小明', '小红', None], dtype='string')
print("去除空值前:")
print(s)
print(s.str.count('小'))
print("去除空值后:")
s.dropna(inplace=True)
print(s)
print(s.str.count('小'))
operation result:
object type:
import pandas as pd
import numpy as np
s = pd.Series(['小明', '小红', None], dtype='object')
print("去除空值前:")
print(s)
print(s.str.count('小'))
print("去除空值后:")
s.dropna(inplace=True)
print(s)
print(s.str.count('小'))
operation result:
Difference 2: When checking a string
When checking a string via str.isdigit():
- string type, return Boolean type,dtype= boolean, missing value is NA
- object type, although the return type is Boolean,But dtype = object, None returns None
import pandas as pd
import numpy as np
s = pd.Series(['小明', '小红', None], dtype='string')
print("string类型:")
print(s.str.isdigit())
s = pd.Series(['小明', '小红', None], dtype='object')
print("object:")
print(s.str.isdigit())
operation result:
2. Python string built-in method
String is a common data type. The text and json data we encounter all belong to the category of string. Python has many built-in methods for processing strings, which provide great convenience for us to process and clean data. This article will introduce case conversion, text alignment methods,
1) Case conversion
method | illustrate |
---|---|
string.lower() | Converts all uppercase characters in string to lowercase |
string.upper() | Convert lowercase letters in string to uppercase |
string.capitalize() | capitalizes the first character of a string |
string.title() | Capitalize the first letter of each word in a string |
string.swapcase() | flip case in string |
2) Text alignment
method | illustrate |
---|---|
string.bright(width) | Returns a new string with the original string left-aligned and padded with spaces to length width |
string.rjust(width) | Returns a new string with the original string right-aligned and padded with spaces to length width |
string.center(width) | Returns a new string with the original string centered and padded with spaces to length width |
3) Get the length
method | illustrate |
---|---|
len(string) | Returns the length of the string. |
4) Get the number of occurrences
method | illustrate |
---|---|
count(string) | Returns the number of occurrences of each string element. |
5) Coding
method | illustrate |
---|---|
encode(‘utf-8’) | character encoding, passing the string |
3. How does Pandas use built-in methods?
- In the process of daily data cleaning and data analysis, it is often necessary to process string type data. And pandas has
Series.str
many built-in accessor-based methods for processing strings. After pandas-specific columns pass str, you can use various python-common character processing methods and built-in functions, which can help us greatly improve string data. processing efficiency.- Pandas
.str
can use the string built-in method on the Series object by calling it (the string processing function in pandas starts with str) to operate on a column in the data frame. This vectorized operation improves the processing efficiency.
1) Case conversion
method | illustrate |
---|---|
series_obj.str.lower() | Converts all uppercase characters in string to lowercase |
series_obj.str.upper() | Convert lowercase letters in string to uppercase |
series_obj.str.capitalize() | capitalizes the first character of a string |
series_obj.str.title() | Capitalize the first letter of each word in a string |
series_obj.str.swapcase() | flip case in string |
Prepare data:
import pandas as pd
import numpy as np
series_obj = pd.Series(['A', 'b', 'ABC', 'Abc', 'abc', 'This is abc', np.nan], dtype='string')
print(df)
operation result:
1. Convert all uppercase characters in string to lowercase:
series_obj.str.lower()
operation result:
2. Convert lowercase letters in string to uppercase:
series_obj.str.upper()
operation result:
3. Capitalize the first character of a string:
series_obj.str.capitalize()
operation result:
4. Put the stringCapitalize the first letter of each word(Note the difference with capitalize):
series_obj.str.title()
5. Flip the case in string:
series_obj.str.swapcase()
operation result:
2) Text alignment
method | illustrate |
---|---|
series_obj.str.ljust(width) | Returns a new string with the original string left-aligned and padded with spaces to length width |
series_obj.str.rjust(width) | Returns a new string with the original string right-aligned and padded with spaces to length width |
series_obj.str.center(width) | Returns a new string with the original string center-aligned and padded with spaces to length width |
1. Return a new string with the original string left-aligned and padded with spaces to a length of width:
# 左对齐:宽度为10,空余部分用 '-' 填充
series_obj.str.ljust(8, fillchar='-')
operation result:
2. Return a new string with the original string right-aligned and padded with spaces to a length of width :
# 右对齐:宽度为10,空余部分用 '-' 填充
series_obj.str.rjust(8, fillchar='-')
operation result:
3. Return a new string whose original string is center-aligned and padded with spaces to a length of width:
# 居中对齐:宽度为10,空余部分用 '-' 填充
series_obj.str.center(8, fillchar='-')
operation result:
3) Get the length
method | illustrate |
---|---|
series_obj.str.len(string) | Returns the length of the string. |
series_obj.str.len()
operation result:
4) Get the number of occurrences
method | illustrate |
---|---|
series_obj.str.count(string) | Returns the number of occurrences of each string element. |
Count how many times A appears, count is case-sensitive:
series_obj.str.count('A')
series_obj.str.count('a')
operation result:
5) Coding
method | illustrate |
---|---|
series_obj.str.encode(‘utf-8’) | character encoding, passing the string |
The character encoding is set to utf8:
series_obj.str.encode('utf-8')
4. Precautions
1,.str
The accessor can only be used on the Series data structure. In addition to the regular column variable df.col, you can also use the index types df.Index and df.columns
2,Ensure that the object type accessed is a string str type. If it is not necessary to astype(str) convert the type first, otherwise an error will be reported
3. Some methods cannot be used on stringSeries, such as: series_obj.str.decode()
, because Series stores strings instead of bytes:
series_obj.str.decode('utf-8')
operation result:
4.Accessors can be used with multiple connections. For example series_obj.str.lower().str.title()
, using effect stacking:
# 先设置全部小写,然后设置首字母大写
series_obj.str.lower().str.title()
operation result:
Proxy IP
Python is inseparable from reptiles. Recently, some friends who want to learn reptiles asked me where to find the proxy IP. The blogger himself uses a high-stable crawler proxy IP: Shenlong HTTP Proxy (you can click to find out if you need it)
book introduction
"PyTorch Tutorial: 21 Projects to Play with PyTorch in Action"
PyTorch is an open source machine learning library based on the Torch library. It is mainly developed by the artificial intelligence research laboratory of Meta (formerly Facebook), and has a wide range of applications in the fields of natural language processing and computer vision. This book introduces simple and classic introductory projects, which are easy to get started quickly, such as MNIST digit recognition. Readers can understand basic concepts such as data sets, models, and training during the process of completing the projects. This book also introduces some practical and classic models, such as the R-CNN model. Through the study of this model, readers can have a basic understanding of the target detection task and have a certain understanding of the basic network structure principles. In addition, this book also has a certain introduction to the currently popular generative adversarial networks and reinforcement learning, which is convenient for readers to broaden their horizons and grasp the cutting-edge directions.
If you don’t want to draw a lottery, JD’s self-operated purchase link: https://item.jd.com/13522327.html