本文链接： https://blog.csdn.net/Yellow_python/article/details/99084214

文章目录

用于爬虫
用于NLP：替换标点（英→中）
用于NLP：数量词
文本切分

用于爬虫

pattern	description
[\u4e00-\u9fa5]	中文
\s	任何空白字符
\S	任何非空白字符
(?=pattern)	正向肯定预查
(?<=pattern)	反向肯定预查
(?!pattern)	正向否定预查
(?<!pattern)	反向否定预查

保留常用字符

lambda a: re.sub('[^a-zA-Z_\d\u4e00-\u9fa5\s，。？！；：、…—￥·《》【】‘’“”（）.?!:|+*/=%~-]', '', a)

爬虫数据入库前

rc = re.compile('[^a-zA-Z_\d\u4e00-\u9fa5\s，。？！；：、…—￥·《》【】‘’“”（）.?!:|+*/=%$@#~-]')

def clean(text):
    text = re.sub('\s+', ' ', text.strip())  # 连续空白
    text = re.sub('<[^>]*>', '', text)  # 标签
    text = re.sub('&[a-zA-Z]{2,6};', '', text)  # &nbsp;等HTML转义字符
    text = text.replace('(', '（').replace(')', '）')
    text = text.replace("'", '"').replace(',', '，')
    return rc.sub('', text)

清除连续空白

def clean(text):
    text = re.sub('\s*\n\s*', '\n', text.strip())
    text = re.sub('[ \f\r\t　]+', ' ', text)
    return text

用于NLP：替换标点（英→中）

def replace_punctuation(text):
    """替换标点（英→中）"""
    text = text.strip().lower()  # 转小写
    text = text.replace(',', '，')  # 逗号
    text = text.replace(';', '；')  # 分号
    text = text.replace('(', '（').replace(')', '）')  # 括号
    text = re.sub('[!！]+', '！', text)  # 叹号
    text = re.sub('[?？]+', '？', text)  # 问号
    text = re.sub('\.{3,}', '…', text)  # 省略号
    text = re.sub('''['"]+''', '‘', text)  # 引号
    text = re.sub('(?<!\d)\.', '。', text)  # 句号
    return text

用于NLP：数量词

总

re_mq = re.compile(
    '[0-9][0-9.+%/~-]*([a-zA-Z/两个十百千万亿几多'
    '天年月日时分秒毫厘微纳米公里尺寸'
    '亩码克斤吨升度元块行列排名号位只架辆部台轮转匹'
    '种类对包瓶箱盒支根条张片队场届双串头层阵级座'
    '岁顿件班次集页管间步镑杯粒点枚幅瓦棵卷单款℃￥°π]'  # 款
    '|小时|分钟|平方|立方|摄氏度|度电|英[寸尺里]|海里|次数|回合'
    '|[美日欧港]元|美金|块钱|人民币|[港台硬金银]币|[块角分]钱?)+')

小数、百分比

re.compile('[+-]?([0-9]+|[0-9]+[./~-][0-9]+)[%π]?')

1.23
-11.24
+12.5

13.5%
-9.5%
+1%

1/2
-1/2
+1/2

9.1-13
11~22

11-13%
11~25%

1.2-1.3%
9.1~11%

数字

re.compile('(\d+[.]\d+|\d+)[多百千万]+([0-9][多百千万]+)?')

2千多万
23万多
13.4万
4百万
4万3千多
3千2百多万
2百万多
100万
100千万

数词

re.compile('(大[概约致]|接近|近似)?[0-9][0-9.十百千万亿π]*(左右)?')

时间

re.compile('[0-9]+([天年月日时分秒°]|小时|分钟|毫秒|时辰)+[0-9天年月日时分秒]*')

货币

re.compile(
    '[0-9][0-9.十百千万亿k]*'
    '([美日欧港]?元|人民币|rmb|[块角分]钱?|￥|[港台硬金银]币|折扣?)',
    re.I)

身高

re.compile('(身?高|体?重)?([0-9]+米[0-9]*|[0-9]+[.]?[0-9]+)(厘?米|c?m|公分)?高?', re.I)

1米7
1.7米
1.7m
170cm
169厘米
168公分
身高170
体重52.2
高180
60重

长度

re.compile(
    '[0-9][0-9.十百千万亿]*'
    '([mcdkn]?m|[毫厘分千微纳]?米|[公英]?[里尺寸]|光年|[丈码仞咫跬步]|海里|公分)',
    re.I)

面积

re.compile(
    '[0-9][0-9.十百千万亿]*'
    '(平方([千分厘毫]?米|英[里尺寸]|公里)|英?亩|公顷)',
    re.I)

体积

re.compile(
    '[0-9][0-9.十百千万亿]*'
    '(立方[分厘毫]?米|毫?升|m?l)',
    re.I)

重量

re.compile(
    '[0-9][0-9.十百千万亿]*'
    '([千毫]?克|公?斤|[km]?g|[吨两磅]|盎司|克拉)',
    re.I)

光、电、热

re.compile(
    '[0-9][0-9.十百千万亿k]*'
    '(千瓦时?|度电?|焦耳?|瓦特?|伏特?|kW|库仑|欧姆|安培'
    '(摄氏)?度|℃|开尔文|摩尔|赫兹|Hz)')

力学

re.compile(
    '(时速|油耗)?[0-9][0-9.十百千万亿k]*'
    '(牛[顿米]?|牛[·*]米|[nN]|帕斯卡|米每秒|摩尔?)')
# 牛米：扭矩的单位是力的单位和距离的单位的乘积，即牛顿*米

文本切分

sep10 = lambda text: re.split('[\n。…；;]|(?<!\d)\.', text)  # 一级切割
sep20 = lambda sentence: re.split('[,，!！?？:：]', sentence)  # 二级切割
sep15 = lambda text: re.split('(?<=[^!！]{7})[!！](?=[^!！]{7})', text)  # ！
sep16 = lambda text: re.split('(?<!\d)\d\.(?!\d)', text)  # e.g.【1、】【2.】
sep30 = lambda clause: re.split('[^a-zA-Z\u4e00-\u9fa5]+', clause)  # 三级切割

文本清洗正则表达式（持续更新）

文章目录

用于爬虫

用于NLP：替换标点（英→中）

用于NLP：数量词

文本切分

猜你喜欢