Day01, python reptile basis

Today's content:

Reptile courses:

The basic principle of a reptile

Two requests library request

The basic principle of a reptile

1. What is the reptile?

Reptile is crawling data.

2. What is the Internet?

A stack of a network device, the computer network to a station call with the Internet.

3, the purpose of the establishment of the Internet

Data transfer and sharing of data.

4. What is the data?

E.g:

Product Information electricity supplier platform (Taobao, Jingdong, Amazon) listings chain of home, renting freely platform

Equity Securities Investment Information (Eastern wealth, snowball network) ■.

 12306, ticket information (grab votes)

5. What is the Internet?

Ordinary users: open the browser a -> enter the URL ..-.-> a host sends a request to a target ---> returns the response data ---> the rendering data to the browser

Crawlers: Analog browser ---..-> a host sends a request to a target-..-.> Returns the response data ---> parse and extract valuable data ---> saved data (files written to a local , persisted to the database)

6, the whole process of reptiles

1. Send request (Library: Requests / Selenium)

2. Get response data

3. analysis data (parsing library: BeautifulSoup4)

4. Save the data (store: a file save / MongoDB)

Summary: We can put the data in the Internet metaphor - - Block treasure,  reptile is in fact digging treasure.

 

import time
import requests
def get_page(url):
    response =requests.get(url)
    return response

import  re
def parse_index(html):
    detail_urls=re.findall(
        '<div class="items"><a class="imglink" href="(.*?)"',html,re.S)
    print(detail_urls)
    return detail_urls

def parse_detail(html):
    movie_url=re.findall('<source src="(.*?)">',html,re.S)
    if movie_url:
        return movie_url[0]

import uuid
def save_video(content):
    with open(f'{uuid.uuid4()}.mp4','wb') as f:
        f.write(content)
        print('下载完毕')

if __name__ == '__main__':
    for line in range(6):
        url=f'http://www.xiaohuar.com/list-3-{line}.html'
        response=get_page(url)
        detail_urls=parse_index(response.text)
        for detail_url in detail_urls:
            print(detail_url)
            detail_res=get_page(detail_url)
            movie_url=parse_detail(detail_res.text)
            if movie_url:
                print(movie_url)
                movie_res=get_page(movie_url)
                save_video(movie_res.content)

 

 

 

Guess you like

Origin www.cnblogs.com/zhoujie333/p/11114076.html
Recommended