Python爬取 豆瓣电影:《肖申克的救赎》

本文的思路是按照BeautifulSoup4文档(链接地址:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/index.html)写的.

爬取网址:https://movie.douban.com/subject/1292052/

1,获得网址源码(可以利用网站的查看元素直接查看):

import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import csv
import time

url = 'https://movie.douban.com/subject/1292052/'
data = requests.get(url).text

print(data)

2,获得电影名(知识点对应文档中 对象种类 ):

本段代码目的:获取title标签及其内容

import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import csv
import time

url = 'https://movie.douban.com/subject/1292052/'
data = requests.get(url).text
soup=BeautifulSoup(data,'lxml')
title1=soup.title

print(title1)

 

结果比预期多了一对<title>,我们要去掉他们,要利用get_text()方法

所以赋值title1那行改为:

title1=soup.title.get_text(strip=True)
#get_text()  获取标签的内容
#strip=True   去掉换行和空格

 3,获取影片信息:

import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import csv
import time

url = 'https://movie.douban.com/subject/1292052/'
data = requests.get(url).text
soup=BeautifulSoup(data,'lxml')
title2=soup.select("#info")[0].text

print(title2)

select() 取得属性为id=info的标签的所有内容(class用.匹配,id用#匹配)

select用[0].text提取文本内容

findall用get_text()提取文本内容

猜你喜欢

转载自blog.csdn.net/qq_41755143/article/details/88591915