Page analysis
After the search for "Joey" in the watercress movie, Joey Wang entered the home page, click on the picture of all filmmakers, filmmakers enter the picture page.
In this page, click Next, you can see changes in the browser's URL as follows:
https://movie.douban.com/celebrity/1166896/photos/?type=C&start=30&sortby=like&size=a&subtype=a
Postman continue to use to analyze the URL, you can easily learn, start page is similar to the number of pages in the control parameters, and in steps of 30, ie, the first page is the start = 0, the second page to start = 30, third pages start = 60, and so on.
Details page analysis
Use Network to view the picture information on the page:
Here we get two messages:
- Links can get a tag for each image review information;
- Links img tag can be used to save the poster goddess.
For both information url, you can return respectively:
在学习过程中有什么不懂得可以加我的
python学习交流扣扣qun,784758214
群里有不错的学习视频教程、开发工具与电子书籍。
与你分享python企业当下人才需求及怎么从零基础学习好python,和学习什么内容
def get_posters():
comment_url_list = []
picture_list = []
for i in range(0, 40000, 30):
url = 'https://movie.douban.com/celebrity/1166896/photos/?type=C&start=%s&sortby=like&size=a&subtype=a' % str(i)
req = requests.get(url).text
content = BeautifulSoup(req, "html.parser")
chekc_point = content.find('span', attrs={'class': 'next'}).find('a')
if chekc_point != None:
data = content.find_all('div', attrs={'class': 'cover'})
for k in data:
ulist = k.find('a')['href']
plist = k.find('img')['src']
comment_url_list.append(ulist)
picture_list.append(plist)
else:
break
return comment_url_list, picture_list
After that, you can download the poster.
Comments Gets
Then we manually jump to the weekly poster details page, information continues to view comments.
Comments can be easily obtained information through BeautifulSoup, and then saved to the MongoDB.
def get_comment(comment_l):
client = pymongo.MongoClient('mongodb://douban:[email protected]:49744/douban')
db = client.douban
mongo_collection = db.comment
comment_list = []
comment = []
print("Save to MongoDB")
for i in comment_l:
response = requests.get(i).text
content = BeautifulSoup(response, "html.parser")
tmp_list = content.find_all('div', attrs={'class': 'comment-item'})
comment_list = comment_list + tmp_list
for k in comment_list:
tmp_comment = k.find('p').text
mongo_collection.insert_one({'comment': tmp_comment})
comment.append(tmp_comment)
print("Save Finish!")