Crawling watercress watercress and books and movies Top250

IMDb top250url: https: //movie.douban.com/top250 start = 0 & filter =?

First f12 to enter the examination CTRL + shift + c to locate the title

 

 You can see the movie title at <a> label, so we just locate next to a label on it, we find the <div> tag attribute class = item <a> top level label label, well now we probably have ideas, let me write code right now

Step 1: Set request header headers

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 Edg/80.0.361.69'}

Then we send get requests and parsing with BeautifulSoup

1 html = requests.get(url,headers=headers)
2 soup = BeautifulSoup(html.text,'lxml')

This is the method we select the BeautifulSoup

movies = soup.select('.item')

So we are targeting the div tag, we create an iteration object positioning down step by step, what we need to locate

1     for movie in movies:
2         movie_name = movie.select_one('div.info a').text.strip().replace(' ','').replace('\n','')
3         movie_star = movie.select_one('div.star').text.strip()
4         movie_quote = movie.select_one('p.quote').text.strip()
5         print(movie_name,movie_star[0:3])
6         print(movie_quote,'\n','-'*80)

In fact, the code is running, but we can only get to the content of the first page, if we want to get how to do it automatically flip

We click on the second page, the second page to view the url: https://movie.douban.com/top250?start=25&filter= view the third page url: https://movie.douban.com/top250?start= 50 & filter = we can find a law

Flip is actually only brought the url start = _x_ this number changes, so we just write a for loop each change of this figure can be achieved change the url

Now we run the code data is successfully being crawled down

 

 The complete code is as follows:

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 def get_movies(url):
 5     headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 Edg/80.0.361.69'}
 6     html = requests.get(url,headers=headers)
 7     soup = BeautifulSoup(html.text,'lxml')
 8     movies = soup.select('.item')
 9     for movie in movies:
10         movie_name = movie.select_one('div.info a').text.strip().replace(' ','').replace('\n','')
11         movie_star = movie.select_one('div.star').text.strip()
12         movie_quote = movie.select_one('p.quote').text.strip()
13         print(movie_name,movie_star[0:3])
14         print(movie_quote,'\n','-'*80)
15 
16 if __name__ == '__main__':
17     for each in range(0,100,25):
18         url = 'https://movie.douban.com/top250?start={}&filter='.format(each)
19         get_movies(url)

When we crawled IMDb other similar sites we can also crawl: Here we then crawl while watercress books:

Watercress crawling complete code reading as follows:

The method is the same as above, first get a request to obtain the source code pages, and then BeautifulSoup Resolve, and then select the method by iterative

 1 import requests,time
 2 from bs4 import BeautifulSoup
 3 from urllib import parse
 4 headers = {
 5     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 Edg/80.0.361.69'
 6 }
 7 def get_html(url):
 8     html = requests.get(url,headers=headers)
 9     soup = BeautifulSoup(html.text,'lxml')
10     books = soup.select('li.subject-item')
11     for book in books:
12         title = book.select_one('div.info h2 a').text.strip().replace('\n','').replace(' ','')
13         pub = book.select_one('div.pub').text.strip().replace('\n','')
14         shuping = book.select_one('p').text
15         ratingnum = book.select_one('.rating_nums').text
16         print('',title,'',pub,"评分",ratingnum,"")
17         print(shuping)
18         print('-'*80)
19 
20 
21 if __name__ == '__main__':
22     for each in Range (0,100,20 ):
 23 is          URL = ' https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={}&type=T ' .format (each)
 24          Print ( ' being of crawler} { ' .format ((each + 20 is //. 1 )))
 25          get_html (URL)

 

Guess you like

Origin www.cnblogs.com/Truedragon/p/12584315.html