day19- common module IV (re, typing)

re module

It used to find specific things from a string (text) in

1. yuan characters: characters have special meaning

  • ^ From the beginning of the match
import re
a = re.findall('^abc', 'abcsds')
b = re.findall('^abc', 'aabcsds')  # 不是以abc开头,所以返回空
print(a,b)
['abc'] []
  • $ Matches from the end
a = re.findall('abc$', 'sdfabcsdsabc')
b = re.findall('abc$', 'aabcsdsbc')  # 不是以abc结尾,所以返回空
print(a,b)
['abc'] []
  • | Or equivalent or
a = re.findall('a|bc', 'sdfabbcsdsabc')  # 将匹配到的对象用列表的形式返回
print(a)
['a', 'bc', 'a', 'bc']
  • [] Find the elements [] in the
a = re.findall('[bac]', 'sdfabcsdsabc')
print(a)
['a', 'b', 'c', 'a', 'b', 'c']
  • [^] Negated, in addition to the match [^] inside the character, ^ metacharacter if written character set is anti-take
a = re.findall('[^bac]', 'sdfabcsdsabc')
print(a)
['s', 'd', 'f', 's', 'd', 's']
  • After () Results match is found, take only, packet matches (in)
a = re.findall('a(bc)s', 'sdfabcsdsabc')
print(a)
['bc']
  • Indicates any one character
a = re.findall('b.', 'sdb,sdb sdkjfbasd sdb')  # 可表示任意字符,包括空格及其他字符
print(a)
['b,', 'b ', 'ba']
  • {N} the front braces first character matches the most recent n times
a = re.findall('ab{3}','abbbbsfsabbs dfbbb')
print(a)
['abbb']
  • * Character match in front of a 0 to infinity
a = re.findall('sa*','fsa dsaasdf')
print(a)
['sa', 'saa', 's']
  • + 1 to the previous character to match an infinite number
a = re.findall('a+','fsa dsaasdf')  # 至少要匹配到一个a
print(a)
['a', 'aa']
  • ? Matches the preceding character zero or one
a = re.findall('sa?','fsa dsaasdf')  # 匹配0或一个a
print(a)
['sa', 'sa', 's']

2. predefined characters: backslash behind the realization of special functions with ordinary characters

  • \ D match numbers (0-9)
a = re.findall('\d', 'sda123jf 342 4sdf4')
print(a)
['1', '2', '3', '3', '4', '2', '4', '4']
  • \ D matches non-numeric characters
a = re.findall('\D', 'sda123jf 342 4sdf4')
print(a)
['s', 'd', 'a', 'j', 'f', ' ', ' ', 's', 'd', 'f']
  • \ S matches the null character
a = re.findall('\s', 'sda123jf 342 4sd,f4')
print(a)
[' ', ' ']
  • \ S matches non-null character
a = re.findall('\S', 'sda123jf 342 4sd,f4')
print(a)
['s', 'd', 'a', '1', '2', '3', 'j', 'f', '3', '4', '2', '4', 's', 'd', ',', 'f', '4']
  • \ W match letters, numbers, underscores
a = re.findall('\w', 'sd_f 34?2 4sd,f4')
print(a)
['s', 'd', '_', 'f', '3', '4', '2', '4', 's', 'd', 'f', '4']
  • \ W matches non-alphabetic, non-numeric, non-underscore character
a = re.findall('\W', 'sd_f 34?2 4sd,f4')
print(a)
[' ', '?', ' ', ',']

3. Greed match: been looking until not satisfied

a = re.findall('a.*', 'asda123456asa')
print(a)
['asda123456asa']

4. Non-greedy match, find a stop,? Stop character equivalent

a = re.findall('a.*?', 'asda123456asa')
print(a)
['a', 'a', 'a', 'a']

The common features function

  • re.complie the equivalent of writing a general rule template
phone_compile = re.compile('1\d{10}')

email_compile = re.compile('\w+@\w+.\w+')

test_s = '12345678900  [email protected]  [email protected]'
res = phone_compile.findall(test_s)
print(res)

res = email_compile.findall(test_s)
print(res)
['12345678900']
['[email protected]', '[email protected]']
  • re.match match from the beginning, to get a matching
a = re.match('\d','sdf123sdd456')
b = re.match('\d','123sdfa 212d')
print(a)
print(b)
None
<_sre.SRE_Match object; span=(0, 1), match='1'>
  • re.search search to match the first character, and returns its index
a = re.search('\d','sdfs1213hfjsf 2323')
print(a)
<_sre.SRE_Match object; span=(4, 5), match='1'>

The difference between match and search: mathch to find a match from the beginning, search to find the first search all

  • re, split according to a regular division matching string, it returns a list of the divided
s = 'asb sfsl sfjwo212 12312,dsfsf'
print(s.split(' '))

res = re.split('\d+',s)
print(res)
['asb', 'sfsl', 'sfjwo212', '12312,dsfsf']
['asb sfsl sfjwo', ' ', ',dsfsf']
  • re, sub and two re.subn they are to replace the contents, but subn calculates how many times replaced, replace the built-in string methods similar to
import re

s = 'asfhf12fdgds 743wiuw22'

print(re.sub('\d',',',s))

print(re.subn('\d',',',s))  # 除了会修改内容,还会返回修改了多少次
asfhf,,fdgds ,,,wiuw,,
('asfhf,,fdgds ,,,wiuw,,', 7)

typing module

1. type checking, appeared parameters and return values ​​type does not match the run-time.

2. incoming parameters and return types as development documents annotated, user-friendly call.

3. After adding the module does not affect the operation of the program, will not officially reported errors, only a reminder.

  • Note: typing module can only be used in more than python3.5 version, pycharm currently supports typing inspection
from typing import List, Tuple, Dict
def add(a: int, string: str, f: float,
        b: bool) -> Tuple[List, Tuple, Dict, bool]:
    list1 = list(range(a))
    tup = (string, string, string)
    d = {"a": f}
    bl = b
    return list1, tup, d, bl
print(add(5, "hhhh", 2.3, False))

Crawling Audio

import re
import requests

response = requests.get('http://www.gov.cn/premier/index.htm')
data = response.text

res = re.findall('href="(/\w+/\w+_yp.htm)"', data)  # ()只取括号内的
yp_res = 'http://www.gov.cn' + res[0]

yp_response = requests.get(yp_res)
yp_data = yp_response.text

res = re.findall('<a href="(.*?)"', yp_data)
count = 0
for url in res:
    if url == 'javascript:;':
        continue
    mp3_url = 'http://www.gov.cn' + url

    mp3_response = requests.get(mp3_url)
    mp3_response.encoding = 'utf8'  # 改变网址的utf8
    mp3_data = mp3_response.text
    # print(mp3_data)

    res = re.findall('<title>(.*?)</title>|data-src="(.*?)"',mp3_data)
    title = res[0][0]
    mp3_url = res[1][1]
    if res[1][1].startswith('/home'):
        continue

    res_response = requests.get(mp3_url)
    mp3_data = res_response.content  # MP3的二进制形式

    with open(f'{title}.mp3','wb') as fw:
        fw.write(mp3_data)
        fw.flush()
    count += 1
    print(f'{count}')

Guess you like

Origin www.cnblogs.com/863652104kai/p/11019250.html