Xiaojie's learning process of requests + re (regular) cat's eye top100 ranking information crawl

This blog describes how to crawl the ranking, website, and ratings of cateye top100. Using the most basic requests of crawling technology plus re (regular) extraction.

Sometimes when we watch a movie, we do n’t know what movie is better. Generally, we open the movie ranking, and we have to click the movie ranking page. So if you have this crawler, you can directly get Cat Eye's movie rankings and his website. Wouldn't it be fun?

Let ’s first open to the cat100 top100 page: https://maoyan.com/board/4?

Then click to the second page: https://maoyan.com/board/4?offset=10

Click on the third page: https://maoyan.com/board/4?offset=20

Then we found that only by changing the offset value behind the URL can the page turning effect be achieved, because the first number of the programming language starts from 0, so directly i * 10 write a loop to turn the page.

Code:

import requests, re, json
from requests.exceptions import RequestException
from my_fake_useragent import UserAgent


def get_one_page(url):
    headers = {
        'User-Agent': UserAgent().random()
    }
    try:
        reponse = requests.get(url, headers=headers)
        if reponse.status_code == 200:
            print("ok!")
        return None
    except RequestException:
        return None


def main(offset):
    url = 'https://maoyan.com/board/4?offset=' + str(offset)
    get_one_page(url)


if __name__ == '__main__':
    for i in range(10):
        main(i * 10)

operation result:

C:\Users\User\AppData\Local\Programs\Python\Python37\python.exe G:/Python/code/requeats/try.py
ok!
ok!
ok!
ok!
ok!
ok!
ok!
ok!
ok!
ok!

It seems that all ten URLs can be requested, and then proceed to the next step.

When using regular rules, I like to view the source code to write re, because the source code is the real code when the web page is requested. The well-ordered code seen in the F12 developer mode is a format that has been post-processed through CSS rendering So the two are a bit different.

We right-click to view the source code. Then CTRL + F to find the top "Farewell My Concubine". We can see this string of code:

 <i class="board-index board-index-1">1</i>
    <a href="/films/1203" title="霸王别姬" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">

Then we open the page of Farewell My Concubine on his behalf, his URL is: https://maoyan.com/films/1203, and we find that his second half / films / 1203 is also in the intercepted code.

So we can construct a re: "> (. ?) \ S <a href =" (. ?) "Title =" (. ?) "Class =" image-link ", and change the part we need to ( , ?), Because the code has a line break, add \ s to the carriage return .

PS: The specific usage of regularity is not mentioned.

Finally write a TXT file.

import requests, re, json
from requests.exceptions import RequestException
from my_fake_useragent import UserAgent


def get_one_page(url):
    headers = {
        'User-Agent': UserAgent().random()
    }
    try:
        reponse = requests.get(url, headers=headers)
        if reponse.status_code == 200:
            return reponse.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    pattern = re.compile('>(.*?)</i>\s*<a href="(.*?)" title="(.*?)" class="image-link')
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'image': 'http://maoyan.com' + item[1],
            'title': item[2]
        }


def write_to_file(content):
    with open('maoyan.txt', 'a', encoding='UTF-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')


def main(offset):
    url = 'https://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)


if __name__ == '__main__':
    for i in range(10):
        main(i * 10)

 

In general, my rule is to use xpath with xpath, followed by re, followed by bs4. They are written in the same way as xpath, and they are found layer by layer. But for general crawlers, we do not need to be so troublesome to use regularization. Finding the place to crawl directly, copying, and changing the part to be extracted to (, *?) Can fully meet the needs of ordinary crawlers, and it is also very fast.

Above I just wrote a TXT file, if necessary, I can change to CSV format or write to the database.

Insert picture description hereHere I only extracted the ranking, URL and movie name. If there are other crawling content, please add it yourself.

Published 10 original articles · Likes0 · Visits 52

Guess you like

Origin blog.csdn.net/z55947810/article/details/105583922