python3: crawling content includes Chinese, garbled question after output

Demand: If you want to achieve this function: the user enters the name of favorite movies, programs can be in the movie Paradise https://www.ygdy8.com crawling movie corresponding to the download link, download link and print out

Challenges: acquiring magnetic links include Chinese, after printing out garbage

 

Workaround: Manually specify the encoding:

if res.encoding == 'ISO-8859-1':
    encodings = requests.utils.get_encodings_from_content(res.text)
    if encodings:
        encoding = encodings[0]
    else:
        encoding = res.apparent_encoding
else:
    encoding = res.encoding
encode_content = res.content.decode(encoding, 'replace').encode('utf-8', 'replace')

 

# Want to achieve this function: the user enters the name of favorite movies, programs can be in the movie Paradise https://www.ygdy8.com crawling movie corresponding to the download link, download link and print out the 

Import Requests
 from BS4 Import the BeautifulSoup
 from the urllib.request Import pathname2url 

# of anti-climb avoidance mechanism, disguised browser request header 
headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Macintosh; the Intel the Mac the OS X-10_14_3) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 78.0.3904.108 Safari / 537.36 OPR / 65.0.3467.78 (Baidu Edition) ' } 

# get film magnetic link 
DEF getMovieDownloadLink (filmlink): 
    RES = requests.get (filmlink, headers = headers)
    IF res.status_code == 200 : 

        # Content Chinese garbled approach after the request: 
        # when the response code is 'ISO-8859-1', we should first look for the encoding response header set; if this coding does not exist, to see the return of Html encoding the header set 
        IF res.encoding == ' the ISO-8859-1 ' : 
            Encodings = requests.utils.get_encodings_from_content (res.text)
             IF Encodings: 
                encoding = Encodings [0]
             the else : 
                encoding = res.apparent_encoding
         the else : 
            encoding = res.encoding 
        encode_content= res.content.decode(encoding, 'replace').encode('utf-8', 'replace')

        soup = BeautifulSoup(encode_content, 'html.parser')
        Zoom = soup.select_one('#Zoom')
        fileurl = Zoom.find('table').find('a').text
        with open('./17-电影天堂磁力.txt','a', newline='') AS File: 
            a file.write (fileurl + ' \ n- ' ) 

    the else :
         Print ( ' Movie links: {} request failed! ' .Format (filmlink)) 

DEF main (): 
    dyurl = ' HTTPS: //www.ygdy8 .com ' 
    # movie = the iNPUT (' Please enter the name of the movie: ') 
    movie = ' Maleficent ' 
    movie = movie.encode ( ' GBK ' ) 
    url = ' http://s.ygdy8.com/plus/s0. PHP? =. 1 the typeid & keyword = {0} ' .format(pathname2url(movie))
    res = requests.get(url, headers=headers)
    if res.status_code == 200:
        htmltext = res.text
        soup = BeautifulSoup(htmltext, 'html.parser')
        co_content8 = soup.find('div', class_='co_content8')
        tables = co_content8.find('ul').find_all('table')
        if len(tables) <= 0:
            print('No resources found, the search can be site 0} { ' .format (dyurl))
         the else :
             for Table in Tables: 
                filmlink = dyurl table.find + ( ' A ' ) [ ' the href ' ] 
                getMovieDownloadLink (filmlink) 

    the else :
         Print ( ' request failed! ' ) 

main ()

result:

reference:

https://blog.csdn.net/guoxinian/article/details/82978067

http://blog.csdn.net/a491057947/article/details/47292923

http://docs.python-requests.org/en/latest/user/quickstart/#response-content

 

Guess you like

Origin www.cnblogs.com/KeenLeung/p/12160712.html