Section V] [Python crawler (b station barrage)

First of all, many thanks to God the article  https://www.cnblogs.com/LexMoon/p/pyspider03.html#4361286

import requests
import re
av_id = '67946325'
headers = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    'Accept': 'text/html',
    'Cookie': "_uuid=1DBA4F96-2E63-8488-DC25-B8623EFF40E773841infoc; buvid3=FE0D3174-E871-4A3E-877C-A4ED86E20523155831infoc; LIVE_BUVID=AUTO8515670521735348; sid=l765gx48; DedeUserID=33717177; DedeUserID__ckMd5=be4de02fd64f0e56; SESSDATA=cf65a5e0%2C1569644183%2Cc4de7381; bili_jct=1e8cdbb5755b4ecd0346761a121650f5; CURRENT_FNVAL=16; stardustvideo=1; rpdid=|(umY))|ukl~0J'ulY~uJm)kJ; UM_distinctid=16ce0e51cf0abc-02da63c2df0b4b-5373e62-1fa400-16ce0e51cf18d8; stardustpgcv=0606; im_notify_type_33717177=0; finger=b3372c5f; CURRENT_QUALITY=112; bp_t_offset_33717177=300203628285382610"

}
= requests.get RESP ( 'https://www.bilibili.com/video/av'+av_id,headers=headers)

match_rule = r'cid=(.*?)&aid'
oid = re.search(match_rule,resp.text).group().replace('cid=','').replace('&aid','')
Print ( 'OID =' + OID) 

xml_url = 'https://api.bilibili.com/x/v1/dm/list.so?oid='+oid 

RESP = requests.get (xml_url, headers = headers) 
IF == resp.encoding 'the ISO-8859-1': 
    Encodings = requests.utils.get_encodings_from_content (resp.text) 
    IF Encodings: 
        encoding = Encodings [0] 
    the else: 
        encoding = resp.apparent_encoding 
    Global encode_content 
    encode_content = resp.content.decode (encoding, 'the replace') 
Print (encode_content) 
# reptile headers need to contain what it will not return to 404? I try to write the full seven, he was found not right. 
# Regular expressions fast forget ...... 
# garbled final solution

 

Guess you like

Origin www.cnblogs.com/break03/p/11575327.html