Python crawling reptiles combat the embarrassments Encyclopedia piece [Sharing] Huawei Cloud

First, the embarrassments Encyclopedia heard it? Faithful embarrassing hair funny piece grasping a lot, this time we try to use spiders to crawl them down.

friendly reminder

Embarrassments Encyclopedia was revised some time ago, before the lead to code can not use, and will not result in output and high CPU utilization, because the regular expression does not match any reason.

Now, bloggers have been re-modify the program code pro-test available, including screenshots and instructions before been busy so there is no update, hope you forgive!

Embarrassments Encyclopedia and another and another revision, and bloggers have no heart to go again and again to match it, if we encounter a long-running case no result is not an error, please refer to the latest comments, enthusiastic junior partner provided regular modifications to it ~

本篇 target

1. Capture embarrassments Encyclopedia popular scripts

2. Filter scripts with pictures

3. To achieve every time you press Enter to display a piece of release time, publisher, content scripts, Points of praise.

Embarrassments Wikipedia is not required to log in, so there is no need to use Cookie, another embarrassments Wikipedia is some piece of the drawings, we catch down the Figure is not easy to show the picture, then we have to try to filter out the map piece of it.

Well, now we try to crawl about the embarrassments Encyclopedia popular piece of it, every time you press Enter to show us a piece.

1. Determine the URL and crawl page code

First, we determine the URL of the page is good http://www.qiushibaike.com/hot/page/1, where the last digit represents the number of pages, we can pass in different values ​​to obtain a piece of the contents of a page.

We initially built the following code to try to print the contents page code, the basic structure of the first page Fetch, see whether this will succeed

 1 # -*- coding:utf-8 -*-
 2 import urllib
 3 import urllib2
 4  
 5  
 6 page = 1
 7 url = 'http://www.qiushibaike.com/hot/page/' + str(page)
 8 try:
 9     request = urllib2.Request(url)
10     response = urllib2.urlopen(request)
11     print response.read()
12 except urllib2.URLError, e:
13     if hasattr(e,"code"):
14         print e.code
15     if hasattr(e,"reason"):
16         print e.reason

Run the program, oh no, it turned out to be an error, really bad luck, drawn the short straw ah

1 line 373, in _read_status
2  raise BadStatusLine(line)
3 httplib.BadStatusLine: ''

Well, it should be headers verification problem, we add a verification headers to try it, change the code as follows

 1 # -*- coding:utf-8 -*-
 2 import urllib
 3 import urllib2
 4 
 5 page = 1
 6 url = 'http://www.qiushibaike.com/hot/page/' + str(page)
 7 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
 8 headers = { 'User-Agent' : user_agent }
 9 try:
10     request = urllib2.Request(url,headers = headers)
11     response = urllib2.urlopen(request)
12     print response.read()
13 except urllib2.URLError, e:
14     if hasattr(e,"code"):
15         print e.code
16     if hasattr(e,"reason"):
17         print e.reason

Hey, this run has finally normal, print out the first page of HTML code, you can try to run the code. Here too long not posted operating results.

2. Extract all scripts of a page

Well, then get the HTML code, we start analyzing how to get a piece of all pages.

First we look at the elements of the review, press F12 browser screenshot below

We can see, each piece is <div class = "article block untagged mb15" id = "..."> ... </ div> wrapped content.

Now we want to get publisher, release date, scripts content, and the number of points praise. But Also note that some scripts with pictures, if we want to display the picture is unrealistic in the console, so we direct the piece with pictures to weed out it, save only the piece containing only text.

So we add regular expressions to match what method is used to find the contents of all matches re.findall is as follows. Details of the use method can see the introduction of regular expressions in front of that.

Well, our regular expression matching the following written statement, append the following code in the original basis

1 content = response.read().decode('utf-8')
2 pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+
3                          'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S)
4 items = re.findall(pattern,content)
5 for item in items:
6     print item[0],item[1],item[2],item[3],item[4]

Now regular expressions in a brief description here

1). *? Is a fixed mix. * And representatives can match any unlimited number of characters, plus? Represents a non-greedy pattern matching, that is, we will try to do short match, then we will be a lot of use. *? Match.

2) (. *?) On behalf of a group, in this regular expression matching the five we grouped behind traversal item in, item [0] represents the contents of the first (. *?) Of the referenced , item [1] represents the second (. *?) of the referenced content, and so on.

3) re.S symbol represents any point matching pattern match point may also represent a newline.

So we get a publisher, release time, publish content, additional pictures, and the number of points praise.

In this note, we want to get content if it is with picture output directly out of the more complicated, so here we only get scripts with no pictures just fine.

So, here we need to filter the scripts with pictures.

We can find that piece with pictures will be with code similar to, but not with the picture is not so, our regular expressions item [3] is to get the following, if not with pictures, item [3] acquired content is empty.

1 <div class="thumb">
2 
3 <a href="/article/112061287?list=hot&amp;s=4794990" target="_blank">
4 <img src="http://pic.qiushibaike.com/system/pictures/11206/112061287/medium/app112061287.jpg" alt="但他们依然乐观">
5 </a>
6 </div>

So we just need to determine the item [3] if it contains an img tag on it.

Well, then we have the code above for loop instead look like this

1 for item in items:
2         haveImg = re.search("img",item[3])
3         if not haveImg:
4             print item[0],item[1],item[2],item[4]

Now, the overall code is as follows

 1 # -*- coding:utf-8 -*-
 2 import urllib
 3 import urllib2
 4 import re
 5 
 6 page = 1
 7 url = 'http://www.qiushibaike.com/hot/page/' + str(page)
 8 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
 9 headers = { 'User-Agent' : user_agent }
10 try:
11     request = urllib2.Request(url,headers = headers)
12     response = urllib2.urlopen(request)
13     content = response.read().decode('utf-8')
14     pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+
15                          'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S)
16     items = re.findall(pattern,content)
17     for item in items:
18         haveImg = re.search("img",item[3])
19         if not haveImg:
20             print item[0],item[1],item[2],item[4]
21 except urllib2.URLError, e:
22     if hasattr(e,"code"):
23         print e.code
24     if hasattr(e,"reason"):
25         print e.reason

Run it look effect

Well, with a piece of picture has been removed it. Ferguson is not it?

3. The perfect interaction, object-oriented design patterns

Well, now part of the core we've done it, the rest is repaired corners of the things we want to achieve the purpose of:

Press Enter, read a piece, showing the piece of publisher, publication date, content and the number of points praise.

In addition, we need to object-oriented design patterns, introducing classes and methods, code and optimization packages do something, finally, our code shown below

  1 __author__ = 'CQC'
  2 # -*- coding:utf-8 -*-
  3 import urllib
  4 import urllib2
  5 import re
  6 import thread
  7 import time
  8 
  9 #糗事百科爬虫类
 10 class QSBK:
 11 
 12     #初始化方法,定义一些变量
 13     def __init__(self):
 14         self.pageIndex = 1
 15         self.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
 16         #初始化headers
 17         self.headers = { 'User-Agent' : self.user_agent }
 18         #存放段子的变量,每一个元素是每一页的段子们
 19         self.stories = []
 20         #存放程序是否继续运行的变量
 21         self.enable = False
 22     #传入某一页的索引获得页面代码
 23     def getPage(self,pageIndex):
 24         try:
 25             url = 'http://www.qiushibaike.com/hot/page/' + str(pageIndex)
 26             #构建请求的request
 27             request = urllib2.Request(url,headers = self.headers)
 28             #利用urlopen获取页面代码
 29             response = urllib2.urlopen(request)
 30             #将页面转化为UTF-8编码
 31             pageCode = response.read().decode('utf-8')
 32             return pageCode
 33 
 34         except urllib2.URLError, e:
 35             if hasattr(e,"reason"):
 36                 print u"连接糗事百科失败,错误原因",e.reason
 37                 return None
 38 
 39 
 40     #传入某一页代码,返回本页不带图片的段子列表
 41     def getPageItems(self,pageIndex):
 42         pageCode = self.getPage(pageIndex)
 43         if not pageCode:
 44             print "页面加载失败...."
 45             return None
 46         pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+
 47                          'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S)
 48         items = re.findall(pattern,pageCode)
 49         #用来存储每页的段子们
 50         pageStories = []
 51         #遍历正则表达式匹配的信息
 52         for item in items:
 53             #是否含有图片
 54             haveImg = re.search("img",item[3])
 55             #如果不含有图片,把它加入list中
 56             if not haveImg:
 57                 replaceBR = re.compile('<br/>')
 58                 text = re.sub(replaceBR,"\n",item[1])
 59                 #item[0]是一个段子的发布者,item[1]是内容,item[2]是发布时间,item[4]是点赞数
 60                 pageStories.append([item[0].strip(),text.strip(),item[2].strip(),item[4].strip()])
 61         return pageStories
 62 
 63     #加载并提取页面的内容,加入到列表中
 64     def loadPage(self):
 65         #如果当前未看的页数少于2页,则加载新一页
 66         if self.enable == True:
 67             if len(self.stories) < 2:
 68                 #获取新一页
 69                 pageStories = self.getPageItems(self.pageIndex)
 70                 #将该页的段子存放到全局list中
 71                 if pageStories:
 72                     self.stories.append(pageStories)
 73                     #获取完之后页码索引加一,表示下次读取下一页
 74                     self.pageIndex += 1
 75     
 76     #调用该方法,每次敲回车打印输出一个段子
 77     def getOneStory(self,pageStories,page):
 78         #遍历一页的段子
 79         for story in pageStories:
 80             #等待用户输入
 81             input = raw_input()
 82             #每当输入回车一次,判断一下是否要加载新页面
 83             self.loadPage()
 84             #如果输入Q则程序结束
 85             if input == "Q":
 86                 self.enable = False
 87                 return
 88             print u"第%d页\t发布人:%s\t发布时间:%s\t赞:%s\n%s" %(page,story[0],story[2],story[3],story[1])
 89     
 90     #开始方法
 91     def start(self):
 92         print u"正在读取糗事百科,按回车查看新段子,Q退出"
 93         #使变量为True,程序可以正常运行
 94         self.enable = True
 95         #先加载一页内容
 96         self.loadPage()
 97         #局部变量,控制当前读到了第几页
 98         nowPage = 0
 99         while self.enable:
100             if len(self.stories)>0:
101                 #从全局list中获取一页的段子
102                 pageStories = self.stories[0]
103                 #当前读到的页数加一
104                 nowPage += 1
105                 #将全局list中第一个元素删除,因为已经取出
106                 del self.stories[0]
107                 #输出该页的段子
108                 self.getOneStory(pageStories,nowPage)
109 
110 
111 spider = QSBK()
112 spider.start()

好啦,大家来测试一下吧,点一下回车会输出一个段子,包括发布人,发布时间,段子内容以及点赞数,是不是感觉爽爆了!

我们第一个爬虫实战项目介绍到这里,下次继续,欢迎大家继续关注,小伙伴们加油!

作者:程序员私房菜

Guess you like

Origin www.cnblogs.com/huaweicloud/p/12018290.html