python爬虫 携程上海

年关将至,不想磕盐。正好有个美丽的小仙女要来上海玩。闲来无事,先用爬虫踩踩点。毕竟人懒,不想实地考察。

先看游记,注意到网页链接为http://you.ctrip.com/travels/shanghai2.html 我就很好奇第一页就是shanghai2???那shanghai1 是啥。怀着好奇的心情点进去一看,http://you.ctrip.com/travels/shanghai1.html

(⊙o⊙)…居然是北京游记,真是惊了个呆。为携程网的命名方式点赞,好了题外话结束。

翻到第二页,http://you.ctrip.com/travels/shanghai2/t3-p2.html
可以大胆地揣测p是指第几页,那么-p1,-p2,-p3…是我们将要爬取的网页。先爬个20页吧

urls=['http://you.ctrip.com/travels/shanghai2/t3-p'+str(i)+'.html' for i in range(1,21)]

携程网有最基础的反爬虫机制,那我们就套件外套,加个headers

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
res=requests.get(url,headers=headers)

bs4 解析获得每篇游记的地址,以第一页为例

tmp=soup.find_all('a',attrs={'class':'journal-item cf','target':'_blank'})
for t in tmp:
     detail_url.append(t.get('href'))
['/travels/shanghai2/3333236.html',
 '/travels/shanghai2/3534134.html',
 '/travels/shanghai2/3635663.html',
 '/travels/shanghai2/3742279.html',
 '/travels/tibet100003/1755676.html',
 '/travels/shanghai2/1560853.html',
 '/travels/shanghai2/1816039.html',
 '/travels/shanghai2/1578243.html',
 '/travels/shanghai2/1885378.html',
 '/travels/huangshan19/2189034.html']

貌似混进来了很了不得的东西,携程网还是个很神奇的网站,真的包容一切。加条判断‘shanghai’

if 'shanghai' in t.get('href'):detail_url.append(t.get('href'))

接下来提取正文中的中文字
注意到文字在p标签中,xpath路径为

/html/body/div[3]/div[4]/div[1]/div[1]/div[2]/p[2]/text()
/html/body/div[3]/div[4]/div[1]/div[1]/div[2]/p[3]/text()

接下来应该是p[4],p[5]… 这样就把他们安排得明明白白,一家人排排坐。但是lxml.etree解析xpath一直不对,这就尴尬了,博主水平有限啊。还是老老实实回归老本行bs4. 发现正文内容在class:ctd_content内 然后判断是否是中文字,是的话就写

def isContainChinese(s):
    for c in s:
        if ('\u4e00' <= c <= '\u9fa5'):
            return True
    return False
def get_detail_content(url):
    res=requests.get('http://you.ctrip.com'+url,headers=headers)
    soup = BeautifulSoup(res.content,'html.parser')
    tmp=soup.find_all('div',attrs={'class':'ctd_content'})
    s=str(tmp[0])
    contain=''
    for c in s:
        if isContainChinese(c):
            contain+=c 
    return contain

将结果保存到txt文档中
最后用多线程加速(算了一共就20也貌似不用多线程)
完整代码如下

import requests
from  bs4 import BeautifulSoup
from lxml import etree
import os
urls=['http://you.ctrip.com/travels/shanghai2/t3-p'+str(i)+'.html' for i in range(1,21)]
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
path=os.getcwd()
def isContainChinese(s):
    for c in s:
        if ('\u4e00' <= c <= '\u9fa5'):
            return True
    return False
def get_detail_url(urls):
    detail_url=[]
    for url in urls:
        res=requests.get(url,headers=headers)
        soup = BeautifulSoup(res.content,'html.parser')
        tmp=soup.find_all('a',attrs={'class':'journal-item cf','target':'_blank'})
        for t in tmp:
            if 'shanghai' in t.get('href'):detail_url.append(t.get('href'))
    return detail_url

def get_detail_content(url):
    print(url)
    res=requests.get('http://you.ctrip.com'+url,headers=headers)
    soup = BeautifulSoup(res.content,'html.parser')
    tmp=soup.find_all('div',attrs={'class':'ctd_content'})
    s=str(tmp[0])
    contain=''
    for c in s:
        if isContainChinese(c):
            contain+=c
    return contain

detail_url=get_detail_url(urls)
txt=''
for url in detail_url:
    txt+=get_detail_content(url)
with open(path+'/shanghai.txt','a') as f:
    f.write(txt)

这样我们就得到了游记正文的内容(好像忘记爬图片,算了有机会再更吧)。数据有了,接下来开始处理数据,从简单的开始,先来做个词频统计吧。

jieba中文分词,自己设置停用词(好多啊,好烦啊)

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
from os import path
import jieba
from scipy.misc import imread
d = path.dirname(__file__)
ciyun1=''
lists=''
remove=['点击','查看','原图','信息','相关','一个','可以','因为','这个','一下','这里','很多',
        '我们','没有','自己','还是','还有','就是','最后','觉得','开始','现在','里面','看到',
        '而且','一些','一种','一样','所以','如果','不过','时候','大家','附近','这样']
with open(path.join(d,"shanghai.txt"),'r') as f1:
	lists = f1.read()
word1=jieba.cut(lists)
ciyun1 = ",".join(word1)
text=ciyun1

alice_coloring = imread(path.join(d, "气球.png"))

wc = WordCloud(background_color="white", #背景颜色max_words=2000,# 词云显示的最大词数
mask=alice_coloring,#设置背景图片
font_path='simkai.ttf',
stopwords=remove,
max_font_size=40, #字体最大值
random_state=42)
# 生成词云, 可以用generate输入全部文本(中文不好分词),也可以我们计算好词频后使用generate_from_frequencies函数
wc.generate(text)
# wc.generate_from_frequencies(txt_freq)
# txt_freq例子为[('词a', 100),('词b', 90),('词c', 80)]
# 从背景图片生成颜色值
image_colors = ImageColorGenerator(alice_coloring)

# 以下代码显示图片
plt.imshow(wc)
plt.axis("off")
# 绘制词云
plt.figure()
# recolor wordcloud and show
# we could also give color_func=image_colors directly in the constructor
plt.imshow(wc.recolor(color_func=image_colors))
plt.axis("off")
# 绘制背景图片为颜色的图片
plt.figure()
plt.imshow(alice_coloring, cmap=plt.cm.gray)
plt.axis("off")
#plt.show()
# 保存图片
wc.to_file(path.join(d, "上海.png"))

气球

上海

今天先到这里了,下次再更,想爬什么东西底下留言。

猜你喜欢

转载自blog.csdn.net/Neekity/article/details/85161197
今日推荐