爬虫补充学习,带Python学习

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/haoyuexihuai/article/details/82353773

1.BeautifulSoup
获取li下内容,并输出得分大于3的数据

知识点:zip,stripped_strings

from bs4 import BeautifulSoup

data = []
path = './web/new_index.html'

with open(path, 'r') as f:
    Soup = BeautifulSoup(f.read(), 'lxml')
    titles = Soup.select('ul > li > div.article-info > h3 > a')
    pics = Soup.select('ul > li > img')
    descs = Soup.select('ul > li > div.article-info > p.description')
    rates = Soup.select('ul > li > div.rate > span')
    cates = Soup.select('ul > li > div.article-info > p.meta-info')

//使用zip直接将获取的标签组存入字典中
//zip()返回一个序列(列表)对象.
//zip([seql, ...])接受一系列可迭代对象作为参数,将对象中对应的元素打包成一个个tuple(元组),然后返回由这些tuples组成的list(列表)。
for title, pic, desc, rate, cate in zip(titles, pics, descs, rates, cates):
    info = {
        'title': title.get_text(),
        'pic': pic.get('src'),
        'descs': desc.get_text(),
        'rate': rate.get_text(),
        'cate': list(cate.stripped_strings) //第一次见。
        //.stripped_strings 可以去除多余空白内容,这里实现的是获取p标签下所有的内容,并保存为列表
    }
    data.append(info)

for i in data:
    if len(i['rate']) >= 3:
        print(i['title'], i['cate'])

结果

Sardinia’s top 10 beaches [‘fun’, ‘Wow’]
How to get tanned [‘butt’, ‘NSFW’]
How to be an Aussie beach bum [‘sea’]
Summer’s cheat sheet [‘bay’, ‘boat’, ‘beach’]

2.Beautifulsoup
获取真实网页

from bs4 import BeautifulSoup
import requests

url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
wb_data = requests.get(url)
soup      = BeautifulSoup(wb_data.text,'lxml')
//网页中定位元素位置,获取一组数据
titles    = soup.select('div.listing_title > a[target="_blank"]')
imgs      = soup.select('img[width="180"]')
cates     = soup.select('div.p13n_reasoning_v2')
# print(cates)

for title,img,cate in zip(titles,imgs,cates):
    data = {
        'title'  :title.get_text(),
        'img'    :img.get('src'),
        'cate'   :list(cate.stripped_strings),
        }
    print(data)

上面获取到的img地址不是真实的,小技巧如果是js动态加载的页面,可以模拟手机浏览,使用User-Agent,因为手机的反爬策略有时没有那个严格

User-Agent : Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Mobile Safari/537.36

猜你喜欢

转载自blog.csdn.net/haoyuexihuai/article/details/82353773