版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/haoyuexihuai/article/details/82353773
1.BeautifulSoup
获取li下内容,并输出得分大于3的数据
知识点:zip,stripped_strings
from bs4 import BeautifulSoup
data = []
path = './web/new_index.html'
with open(path, 'r') as f:
Soup = BeautifulSoup(f.read(), 'lxml')
titles = Soup.select('ul > li > div.article-info > h3 > a')
pics = Soup.select('ul > li > img')
descs = Soup.select('ul > li > div.article-info > p.description')
rates = Soup.select('ul > li > div.rate > span')
cates = Soup.select('ul > li > div.article-info > p.meta-info')
//使用zip直接将获取的标签组存入字典中
//zip()返回一个序列(列表)对象.
//zip([seql, ...])接受一系列可迭代对象作为参数,将对象中对应的元素打包成一个个tuple(元组),然后返回由这些tuples组成的list(列表)。
for title, pic, desc, rate, cate in zip(titles, pics, descs, rates, cates):
info = {
'title': title.get_text(),
'pic': pic.get('src'),
'descs': desc.get_text(),
'rate': rate.get_text(),
'cate': list(cate.stripped_strings) //第一次见。
//.stripped_strings 可以去除多余空白内容,这里实现的是获取p标签下所有的内容,并保存为列表
}
data.append(info)
for i in data:
if len(i['rate']) >= 3:
print(i['title'], i['cate'])
结果
Sardinia’s top 10 beaches [‘fun’, ‘Wow’]
How to get tanned [‘butt’, ‘NSFW’]
How to be an Aussie beach bum [‘sea’]
Summer’s cheat sheet [‘bay’, ‘boat’, ‘beach’]
2.Beautifulsoup
获取真实网页
from bs4 import BeautifulSoup
import requests
url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
//网页中定位元素位置,获取一组数据
titles = soup.select('div.listing_title > a[target="_blank"]')
imgs = soup.select('img[width="180"]')
cates = soup.select('div.p13n_reasoning_v2')
# print(cates)
for title,img,cate in zip(titles,imgs,cates):
data = {
'title' :title.get_text(),
'img' :img.get('src'),
'cate' :list(cate.stripped_strings),
}
print(data)
上面获取到的img地址不是真实的,小技巧如果是js动态加载的页面,可以模拟手机浏览,使用User-Agent,因为手机的反爬策略有时没有那个严格
User-Agent : Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Mobile Safari/537.36