Today's content:
Reptile courses:
The basic principle of a reptile
Two requests library request
The basic principle of a reptile
1. What is the reptile?
Reptile is crawling data.
2. What is the Internet?
A stack of a network device, the computer network to a station call with the Internet.
3, the purpose of the establishment of the Internet
Data transfer and sharing of data.
4. What is the data?
E.g:
Product Information electricity supplier platform (Taobao, Jingdong, Amazon) listings chain of home, renting freely platform
Equity Securities Investment Information (Eastern wealth, snowball network) ■.
12306, ticket information (grab votes)
5. What is the Internet?
Ordinary users: open the browser a -> enter the URL ..-.-> a host sends a request to a target ---> returns the response data ---> the rendering data to the browser
Crawlers: Analog browser ---..-> a host sends a request to a target-..-.> Returns the response data ---> parse and extract valuable data ---> saved data (files written to a local , persisted to the database)
6, the whole process of reptiles
1. Send request (Library: Requests / Selenium)
2. Get response data
3. analysis data (parsing library: BeautifulSoup4)
4. Save the data (store: a file save / MongoDB)
Summary: We can put the data in the Internet metaphor - - Block treasure, reptile is in fact digging treasure.
import time import requests def get_page(url): response =requests.get(url) return response import re def parse_index(html): detail_urls=re.findall( '<div class="items"><a class="imglink" href="(.*?)"',html,re.S) print(detail_urls) return detail_urls def parse_detail(html): movie_url=re.findall('<source src="(.*?)">',html,re.S) if movie_url: return movie_url[0] import uuid def save_video(content): with open(f'{uuid.uuid4()}.mp4','wb') as f: f.write(content) print('下载完毕') if __name__ == '__main__': for line in range(6): url=f'http://www.xiaohuar.com/list-3-{line}.html' response=get_page(url) detail_urls=parse_index(response.text) for detail_url in detail_urls: print(detail_url) detail_res=get_page(detail_url) movie_url=parse_detail(detail_res.text) if movie_url: print(movie_url) movie_res=get_page(movie_url) save_video(movie_res.content)