Task7 文本数据

1.问题
2. 练习

1.问题

【问题一】 str对象方法和df/Series对象方法有什么区别？

str对象方法主要是针对类型为string的对象

【问题二】给出一列string类型，如何判断单元格是否是数值型数据？

使用str.isnumetric()方法

【问题三】 rsplit方法的作用是什么？它在什么场合下适用？

rsplit() 方法通过指定分隔符对字符串进行分割并返回一个列表，默认分隔符为所有空字符，包括空格、换行(\n)、制表符(\t)等。类似于 split() 方法，只不过是从字符串最后面开始分割。

【问题四】在本章的第二到第四节分别介绍了字符串类型的5类操作，请思考它们各自应用于什么场景？

2. 练习

【练习一】现有一份关于字符串的数据集，请解决以下问题：

（a）现对字符串编码存储人员信息（在编号后添加ID列），使用如下格式：“×××（名字）：×国人，性别×，生于×年×月×日”

df_1 = pd.read_csv('../data/String_data_one.csv',index_col='人员编号')
df_1.head()

在这里插入图片描述

df_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 1 to 2000
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   姓名      2000 non-null   object
 1   国籍      2000 non-null   int64 
 2   性别      2000 non-null   object
 3   出生年     2000 non-null   int64 
 4   出生月     2000 non-null   int64 
 5   出生日     2000 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 109.4+ KB

# 将所有类型转成object
df_1 = df_1.astype('str')
(df_1['姓名']+ '：'+df_1['国籍']+'国人，'+'性别'+df_1['性别']+'，生于'+df_1['出生年']+'年'+\
df_1['出生月']+'月'+df_1['出生日']+'日').to_frame().rename(columns={
    
    0:'ID'})

在这里插入图片描述
（b）将（a）中的人员生日信息部分修改为用中文表示（如一九七四年十月二十三日），其余返回格式不变。

L_year = list('零一二三四五六七八九')
# one和two是用来输出月份和日期的（比如11输出就是十一而不是一一）
L_one = [s.strip() for s in list('  二三四五六七八九')]
L_two = [s.strip() for s in list(' 一二三四五六七八九')]

df_new = df_1['姓名']+ '：'+df_1['国籍']+'国人，'+'性别'+df_1['性别']+'，生于'+\
          df_1['出生年'].str.replace(r'\d', lambda x:L_year[int(x.group(0))])+'年'+\
          df_1['出生月'].apply(lambda x: x if len(x) == 2 else '0'+x).str.replace(r'(?P<one>\d)(?P<two>\d)', 
          lambda x: L_one[int(x.group('one'))] + bool(int(x.group('one')))*'十'+L_two[int(x.group('two'))])+'月'+\
          df_1['出生日'].apply(lambda x: x if len(x) == 2 else '0'+x).str.replace(r'(?P<one>\d)(?P<two>\d)', 
          lambda x: L_one[int(x.group('one'))] + bool(int(x.group('one')))*'十'+L_two[int(x.group('two'))]) + '日'

df_new = df_new.to_frame().rename(columns={
    
    0:'ID'})
df_new

在这里插入图片描述

（c）将（b）中的ID列结果拆分为原列表相应的5列，并使用equals检验是否一致。

dic_year = {
    
    i[0]:i[1] for i in zip(list('零一二三四五六七八九'),list('0123456789'))}
dic_two = {
    
    i[0]:i[1] for i in zip(list('十一二三四五六七八九'),list('0123456789'))}
dic_one = {
    
    '十':'1','二十':'2','三十':'3',None:''}
df_res = df_new['ID'].str.extract(r'(?P<姓名>[a-zA-Z]+)：(?P<国籍>[\d])国人，性别(?P<性别>[\w])，生于(?P<出生年>[\w]{4})年(?P<出生月>[\w]+)月(?P<出生日>[\w]+)日')
# df_res1['出生年'] = df_res1['出生年'].str.replace(r'\w', lambda x: dic_year[x.group(0)])
df_res['出生年'] = df_res['出生年'].str.replace(r'(\w)+',lambda x:''.join([dic_year[x.group(0)[i]] for i in range(4)]))
df_res['出生月'] = df_res['出生月'].str.replace(r'(?P<one>\w?十)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'0','10')
df_res['出生日'] = df_res['出生日'].str.replace(r'(?P<one>\w?十)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'^0','10')
df_res.head()

在这里插入图片描述

【练习二】现有一份半虚拟的数据集，第一列包含了新型冠状病毒的一些新闻标题，请解决以下问题：

（a）选出所有关于北京市和上海市新闻标题的所在行。

df_2 = pd.read_csv('../data/String_data_two.csv')
df_2.head()

在这里插入图片描述

df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col1    500 non-null    object
 1   col2    500 non-null    object
 2   col3    500 non-null    object
dtypes: object(3)
memory usage: 11.8+ KB

df_2.loc[df_2['col1'].str.contains(r'[北京]{2}|[上海]{2}')]

在这里插入图片描述

（b）求col2的均值。

# 由于在info中看到col2的类型不是int，说明里面肯定有杂质
df_2.loc[~df_2['col2'].str.match(r'^-?\d+$'), 'col2']

在这里插入图片描述

df_2.loc[[309, 396, 485], 'col2'] = [0, 9, 7]
df_2['col2'].astype('int').mean()

-0.984

（c）求col3的均值。¶

# 同理对col3进行筛选
# 但是这里有个坑，'col3'的列名实际上是'col3  '
df_2.columns

Index(['col1', 'col2', 'col3  '], dtype='object')

df_2.loc[~df_2['col3  '].str.match(r'^-?\d+.?\d+$'), 'col3  ']

在这里插入图片描述

df_2.loc[[28, 122, 332], 'col3  '] = [355.3567, 9056.2253, 3534.6554]
df_2['col3  '].astype('float').mean()

24.707484999999988

Task7 文本数据

Task7 文本数据

1.问题

2. 练习

猜你喜欢