[Reptile combat project] Python crawls Top100 movie list data and saves csv file (with source code)

foreword

What I will introduce to you today is to crawl the Top100 movie list data with Python and save the csv file. Here, I will give the code to the friends who need it, and give some tips.

First of all, before crawling, you should pretend to be a browser as much as possible without being recognized as a crawler. The basic thing is to add a request header, but there will be many people crawling such plain text data, so we need to consider changing the proxy IP and random replacement Crawl the top 100 movie list data by request header.

Before writing crawler code every time, our first and most important step is to analyze our web pages.

Through analysis, we found that the speed of crawling is relatively slow during the crawling process, so we can also improve the crawling speed of crawlers by disabling Google browser images, JavaScript, etc.

Insert picture description here![Movie top![](https://img-blog.csdnimg.cn/710df550f5fb4156bd711f82cf122245.png)

development tools

Python version: 3.6

Related modules:

requests module

time module

parser module

csv module

Environment build

Install Python and add it to the environment variable, and pip installs the required related modules.

The complete code and files in the article can be obtained from comments and messages

Idea analysis

Open the page we want to crawl in the browser and
press F12 to enter the developer tool to see where the Top100 movie list data we want is
here. We need the page data here.

page data

Code

for page in range(0, 101, 10):
    time.sleep(2)
    url = 'https://maoyan.com/board/4?offset={}'.format(page)
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
        'Cookie': '__mta=20345351.1670903159717.1670903413872.1670903436333.5; uuid_n_v=v1; uuid=A8065B807A9811ED82C293D7E110319C9B09821067E1411AB6F4EC82889E1869; _csrf=916b8446658bd722f56f2c092eaae35ea3cd3689ef950542e202b39ddfe7c91e; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1670903160; _lxsdk_cuid=1850996db5dc8-07670e36da28-26021151-1fa400-1850996db5d67; _lxsdk=A8065B807A9811ED82C293D7E110319C9B09821067E1411AB6F4EC82889E1869; __mta=213622443.1670903327420.1670903417327.1670903424017.4; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1670903436; _lxsdk_s=1850996db5e-8b2-284-88a%7C%7C18',
        'Host': 'www.maoyan.com',
        'Referer': 'https://www.maoyan.com/films/1200486'

    }
    response = requests.get(url, headers=headers)
    selector = parsel.Selector(response.text)
    li_s = selector.css('.board-wrapper dd')
    for li in li_s:
        name = li.css('.name a::text').get()
        star = li.css('.star::text').get()
        star_string = star.strip()
        releasetime = li.css('.releasetime::text').get()
        data_time = releasetime.strip()
        follow = li.css('.score i::text').getall()
        score = ''.join(follow)
        dit = {
    
    
            '电影名字': name,
            '主演': star_string,
            '上映时间': data_time,
            '评分': score,
        }
        csv_write.writerow(dit)
        print(dit)

Cookie acquisition

Cookie

Show results

Show results

At last

In order to thank the readers, I would like to share with you some of my recent favorite programming dry goods, to give back to every reader, and hope to help you.

There are practical tutorials suitable for beginners~

Come and grow up with Xiaoyu!

① More than 100 Python PDFs (mainstream and classic books should be available)

② Python standard library (the most complete Chinese version)

③ Source code of reptile projects (forty or fifty interesting and classic hand-practicing projects and source codes)

④ Videos on basics of Python, crawlers, web development, and big data analysis (suitable for beginners)

⑤ Python Learning Roadmap (Farewell to Influential Learning)

Guess you like

Origin blog.csdn.net/Modeler_xiaoyu/article/details/128299879