IMDb top250url: https: //movie.douban.com/top250 start = 0 & filter =?
First f12 to enter the examination CTRL + shift + c to locate the title
You can see the movie title at <a> label, so we just locate next to a label on it, we find the <div> tag attribute class = item <a> top level label label, well now we probably have ideas, let me write code right now
Step 1: Set request header headers
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 Edg/80.0.361.69'}
Then we send get requests and parsing with BeautifulSoup
1 html = requests.get(url,headers=headers) 2 soup = BeautifulSoup(html.text,'lxml')
This is the method we select the BeautifulSoup
movies = soup.select('.item')
So we are targeting the div tag, we create an iteration object positioning down step by step, what we need to locate
1 for movie in movies: 2 movie_name = movie.select_one('div.info a').text.strip().replace(' ','').replace('\n','') 3 movie_star = movie.select_one('div.star').text.strip() 4 movie_quote = movie.select_one('p.quote').text.strip() 5 print(movie_name,movie_star[0:3]) 6 print(movie_quote,'\n','-'*80)
In fact, the code is running, but we can only get to the content of the first page, if we want to get how to do it automatically flip
We click on the second page, the second page to view the url: https://movie.douban.com/top250?start=25&filter= view the third page url: https://movie.douban.com/top250?start= 50 & filter = we can find a law
Flip is actually only brought the url start = _x_ this number changes, so we just write a for loop each change of this figure can be achieved change the url
Now we run the code data is successfully being crawled down
The complete code is as follows:
1 import requests 2 from bs4 import BeautifulSoup 3 4 def get_movies(url): 5 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 Edg/80.0.361.69'} 6 html = requests.get(url,headers=headers) 7 soup = BeautifulSoup(html.text,'lxml') 8 movies = soup.select('.item') 9 for movie in movies: 10 movie_name = movie.select_one('div.info a').text.strip().replace(' ','').replace('\n','') 11 movie_star = movie.select_one('div.star').text.strip() 12 movie_quote = movie.select_one('p.quote').text.strip() 13 print(movie_name,movie_star[0:3]) 14 print(movie_quote,'\n','-'*80) 15 16 if __name__ == '__main__': 17 for each in range(0,100,25): 18 url = 'https://movie.douban.com/top250?start={}&filter='.format(each) 19 get_movies(url)
When we crawled IMDb other similar sites we can also crawl: Here we then crawl while watercress books:
Watercress crawling complete code reading as follows:
The method is the same as above, first get a request to obtain the source code pages, and then BeautifulSoup Resolve, and then select the method by iterative
1 import requests,time 2 from bs4 import BeautifulSoup 3 from urllib import parse 4 headers = { 5 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 Edg/80.0.361.69' 6 } 7 def get_html(url): 8 html = requests.get(url,headers=headers) 9 soup = BeautifulSoup(html.text,'lxml') 10 books = soup.select('li.subject-item') 11 for book in books: 12 title = book.select_one('div.info h2 a').text.strip().replace('\n','').replace(' ','') 13 pub = book.select_one('div.pub').text.strip().replace('\n','') 14 shuping = book.select_one('p').text 15 ratingnum = book.select_one('.rating_nums').text 16 print('《',title,'》',pub,"评分",ratingnum,"分") 17 print(shuping) 18 print('-'*80) 19 20 21 if __name__ == '__main__': 22 for each in Range (0,100,20 ): 23 is URL = ' https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={}&type=T ' .format (each) 24 Print ( ' being of crawler} { ' .format ((each + 20 is //. 1 ))) 25 get_html (URL)