Python determines whether a string contains characters (str.contain)

There is a ssqdatav2data, to find Shenzhen, and replace it with Shenzhen.

Because of errors in the collected data, Shenzhen appears where only the province is abbreviated.
insert image description here
How to find data that contains Shenzhen in DF?

cond=ssqdatav2['first'].str.contains('深圳')
ssqdatav2.loc[cond]

insert image description here
At this point, the data containing Shenzhen in the first is found.

1. Find Chinese characters in first

# 为分解firstprize定义函数
def fpp(x):
    if len(x)<=2:  # 判断是否只有汉字,还是也有数字
        return "待定" # 没有汉字的用待定表示
    else: # 使用正则表达式获取中文
        pattern="[\u4e00-\u9fa5]"  # 汉字专用字符ASCII区间
        pat=re.compile(pattern)
        return ','.join(pat.findall(x)) # 使用逗号作为每个省份的分隔符
#使用fp() 
ssqdatav2['fpprovince']=ssqdatav2['first'].apply(lambda x:fpp(x))
ssqdatav2.head()

insert image description here
Form each province into a separate column:

fpnames=['p01','p02','p03','p04','p05'] 
ssqdatav3[fpnames]=ssqdatav3['fpprovince'].str.split(',',expand=True)
ssqdatav3

insert image description here
Remove the None value, and the place of None becomes a null value:

# 逐个分割
ssqdatav3['p001']=ssqdatav3['fpprovince'].apply(lambda x:x if x.count(',')==0 else x.split(',')[0])
ssqdatav3['p002']=ssqdatav3['fpprovince'].apply(lambda x:x.split(',')[1] if x.count(',')>=1 else '')
ssqdatav3['p003']=ssqdatav3['fpprovince'].apply(lambda x:x.split(',')[2] if x.count(',')>=2 else '')
ssqdatav3['p004']=ssqdatav3['fpprovince'].apply(lambda x:x.split(',')[3] if x.count(',')>=3 else '')
ssqdatav3['p005']=ssqdatav3['fpprovince'].apply(lambda x:x.split(',')[4] if x.count(',')>=4 else '')

ssqdatav3.to_excel('ssqdatav3p05.xlsx',index=False)
ssqdatav3.head()

insert image description here

# 让双色球的期号ID成为订单号,7个号码都有对应的订单号,即每个期号都有7个订单号且分成不同的行
import numpy as np 
ssqdatav3['province2']=ssqdatav3['fpprovince'].apply(lambda x:x.split(','))
ssqdatav3

province2=ssqdatav3['province2'].to_list()
province2
rs=[len(r) for r in province2]
rs 
a=np.repeat(ssqdatav3['id'],rs)
a 
ssqdataprov=pd.DataFrame(np.column_stack((a,np.concatenate(province2))),columns=['ID','PROVINCE'])


# ssqdataprov=ssqdataprov[(ssqdataprov['PROVINCE']!='深')] # 等价
# ssqdataprov=ssqdataprov[~(ssqdataprov['PROVINCE']=='深')] # 等价
ssqdataprov=ssqdataprov[~(ssqdataprov['PROVINCE'].str.contains('深'))]

ssqdataprov

insert image description here
Divide according to each field, and delete the fields containing deep, so that only the word "Zhen" is retained.

Guess you like

Origin blog.csdn.net/wxfighting/article/details/124212433