Sesame HTTP: Python crawler combat crawling embarrassing things encyclopedia

First of all, everyone has heard of the Encyclopedia of Embarrassing Things, right? I caught a lot of funny jokes posted by my friends, this time we will try to grab them with a crawler.

friendly reminder

The embarrassing thing encyclopedia has been revised some time ago, which makes the previous code unusable, which will lead to inability to output and high CPU usage, because the regular expression does not match.

Now, the blogger has re-modified the program, and the code is available for personal testing, including screenshots and descriptions. I have been busy before so I haven't updated it in time. I hope everyone will be happy!

Update time: 2015/8/2

The embarrassing thing encyclopedia has been revised again and again, and bloggers have no heart to match it again and again. If you encounter a situation where you can’t run for a long time and get no results without reporting an error, please refer to the latest comments, provided by enthusiastic friends Let's correct it~

Update time: 2016/3/27

The goal of this article

1. Grab the popular jokes of the embarrassing thing encyclopedia

2. Filter paragraphs with pictures

3. Realize that each time you press Enter to display the release time of a paragraph, the publisher, the content of the paragraph, and the number of likes.

Embarrassing Things Encyclopedia does not need to be logged in, so there is no need to use cookies. In addition, some of the embarrassing things encyclopedia are attached with pictures. We grab the pictures and the pictures are not easy to display, so let's try to filter out the jokes with pictures.

Okay, now let's try to grab the popular jokes of the embarrassing thing encyclopedia, every time we press enter, we will display a joke.

1. Determine the URL and grab the page code

First, we determine that the URL of the page is http://www.qiushibaike.com/hot/page/1, where the last number 1 represents the number of pages, and we can pass in different values ​​to get the paragraph content of a page.

We initially build the following code to print the page code content and try it out, first construct the most basic page crawling method, and see if it will be successful

 


# -*- coding:utf-8 -*-
import urllib
import urllib2


page = 1
url = 'http://www.qiushibaike.com/hot/page/' + str(page)
try:
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    print response.read()
except urllib2.URLError, e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason

 Run the program, oh no, it actually reported an error, really bad luck, bad luck

 

 



line 373, in _read_status
 raise BadStatusLine(line)
httplib.BadStatusLine: ''

 Well, it should be a header verification problem. Let's add a header verification and try it out. Modify the code as follows

 

 



# -*- coding:utf-8 -*-
import urllib
import urllib2
 
page = 1
url = 'http://www.qiushibaike.com/hot/page/' + str(page)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
try:
    request = urllib2.Request(url,headers = headers)
    response = urllib2.urlopen(request)
    print response.read()
except urllib2.URLError, e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason

 Hey, the operation is finally normal this time, and the HTML code of the first page is printed out. You can run the code and try it out. The results of running here are too long to post.

 

2. Extract all paragraphs of a page

Well, after getting the HTML code, we start to analyze how to get all the snippets of a page.

First, let's inspect the element and take a look, press F12 of the browser, the screenshot is as follows

 

20150802154147

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

We can see that each paragraph is wrapped in <div class=”article block untagged mb15″ id=”…”>…</div>.

Now we want to get the person who posted it, the date it was posted, the content of the paragraph, and the number of likes. However, it should be noted that some of the paragraphs have pictures. If we want to display pictures on the console, it is unrealistic, so we directly remove the paragraphs with pictures, and only save the paragraphs that only contain text.

So we add the following regular expression to match, the method used is re.findall to find all matching content. For details on the usage of the method, see the introduction to regular expressions mentioned earlier.

Well, our regular expression matching statement is written as follows, and the following code is added on the original basis

 



content = response.read().decode('utf-8')
pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+
                         'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S)
items = re.findall(pattern,content)
for item in items:
    print item[0],item[1],item[2],item[3],item[4]

 Now regex a little bit here

 

1) .*? is a fixed combination, . and * represent any unlimited number of characters, plus ? Indicates that the non-greedy pattern is used for matching, that is, we will do the matching as short as possible, and we will use a lot of .*? combinations in the future.

2) (.*?) represents a group. In this regular expression, we match five groups. In the subsequent traversal of items, item[0] represents the content referred to by the first (.*?) , item[1] represents what the second (.*?) refers to, and so on.

3) The re.S flag represents an arbitrary matching pattern for dots when matching, and dots. can also represent newlines.

In this way, we get the publisher, the time of the post, the content of the post, the attached image, and the number of likes.

Note here that if the content we want to get is with pictures, it is more complicated to output it directly, so here we only get the paragraphs without pictures.

Therefore, here we need to filter the paragraphs with pictures.

We can find that the paragraphs with pictures will have codes similar to the following, and those without pictures will not. Therefore, item[3] of our regular expression obtains the following content. If there is no picture, item [3] The content obtained is empty.

 



<div class="thumb">
 
<a href="/article/112061287?list=hot&s=4794990" target="_blank">
<img src="http://pic.qiushibaike.com/system/pictures/11206/112061287/medium/app112061287.jpg" alt="But they are still optimistic">
</a>
 
</div>

 So we only need to judge whether item[3] contains the img tag.

 

Ok, let's change the for loop in the above code to the following

 



for item in items:
        haveImg = re.search("img",item[3])
        if not haveImg:
            print item[0],item[1],item[2],item[4]

 

 

Now, the overall code is as follows



# -*- coding:utf-8 -*-
import urllib
import urllib2
import re
 
page = 1
url = 'http://www.qiushibaike.com/hot/page/' + str(page)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
try:
    request = urllib2.Request(url,headers = headers)
    response = urllib2.urlopen(request)
    content = response.read().decode('utf-8')
    pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+
                         'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S)
    items = re.findall(pattern,content)
    for item in items:
        haveImg = re.search("img",item[3])
        if not haveImg:
            print item[0],item[1],item[2],item[4]
except urllib2.URLError, e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason

 Run it to see the effect

20150802154832

 

 

Well, the jokes with pictures have been removed. Is it very open?

3. Improve interaction and design object-oriented mode

Ok, now we have completed the core part, the rest is to fix the corners and corners. The goal we want to achieve is:

Press Enter to read a paragraph, and display the publisher, release date, content and number of likes of the paragraph.

In addition, we need to design the object-oriented model, introduce classes and methods, and optimize and encapsulate the code. Finally, our code is as follows



__author__ = 'CQC'
# -*- coding:utf-8 -*-
import urllib
import urllib2
import re
import thread
import time
 
# embarrassing thing encyclopedia reptiles
class QSBK:
 
    #Initialization method, define some variables
    def __init__(self):
        self.pageIndex = 1
        self.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        #initialize headers
        self.headers = { 'User-Agent' : self.user_agent }
        #Variables to store paragraphs, each element is the paragraphs of each page
        self.stories = []
        #The variable that stores whether the program continues to run
        self.enable = False
    #Pass in the index of a page to get the page code
    def getPage(self,pageIndex):
        try:
            url = 'http://www.qiushibaike.com/hot/page/' + str(pageIndex)
            #build the request request
            request = urllib2.Request(url,headers = self.headers)
            #Use urlopen to get the page code
            response = urllib2.urlopen(request)
            #Convert the page to UTF-8 encoding
            pageCode = response.read().decode('utf-8')
            return pageCode
 
        except urllib2.URLError, e:
            if hasattr(e,"reason"):
                print u "Failed to connect to the embarrassing thing encyclopedia, the reason for the error", e.reason
                return None
 
 
    #Pass in a page code, return the list of paragraphs without pictures on this page
    def getPageItems(self,pageIndex):
        pageCode = self.getPage(pageIndex)
        if not pageCode:
            print "Page load failed...."
            return None
        pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+
                         'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S)
        items = re.findall(pattern,pageCode)
        #Used to store the paragraphs of each page
        pageStories = []
        # Traverse the information matched by the regular expression
        for item in items:
            #Does it contain pictures
            haveImg = re.search("img",item[3])
            #If there is no picture, add it to the list
            if not haveImg:
                replaceBR = re.compile('<br/>')
                text = re.sub(replaceBR,"\n",item[1])
                #item[0] is the publisher of a paragraph, item[1] is the content, item[2] is the release time, and item[4] is the number of likes
                pageStories.append([item[0].strip(),text.strip(),item[2].strip(),item[4].strip()])
        return pageStories
 
    #Load and extract the content of the page and add it to the list
    def loadPage(self):
        #If the number of unread pages is less than 2, load a new page
        if self.enable == True:
            if len(self.stories) < 2:
                # get new page
                pageStories = self.getPageItems(self.pageIndex)
                #Store the paragraphs of this page in the global list
                if pageStories:
                    self.stories.append(pageStories)
                    #After the acquisition, the page number index is incremented by one, indicating that the next page is read next time
                    self.pageIndex += 1
    
    #Call this method, print out a paragraph each time you hit enter
    def getOneStory(self,pageStories,page):
        # Traverse a page of paragraphs
        for story in pageStories:
            #Wait for user input
            input = raw_input()
            #Every time you enter a carriage return, determine whether to load a new page
            self.loadPage()
            #If you enter Q, the program ends
            if input == "Q":
                self.enable = False
                return
            print u"Page %d\tPublished by:%s\tPublished time:%s\tLike:%s\n%s" %(page,story[0],story[2],story[3] ,story[1])
    
    #start method
    def start(self):
        print u"Reading embarrassing things encyclopedia, press Enter to view the new paragraph, Q to exit"
        #Make the variable True, the program can run normally
        self.enable = True
        #Load a page of content first
        self.loadPage()
        #Local variable, control the current page read
        nowPage = 0
        while self.enable:
            if len(self.stories)>0:
                #Get a page of paragraphs from the global list
                pageStories = self.stories[0]
                #Add one to the number of pages currently read
                nowPage += 1
                #Delete the first element in the global list because it has been taken out
                del self.stories[0]
                # output the paragraph of the page
                self.getOneStory(pageStories,nowPage)
 
 
spider = QSBK()
spider.start()

 Okay, let's test it out. Clicking Enter will output a paragraph, including the publisher, release time, content of the paragraph, and the number of likes. Does it feel cool?

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326353674&siteId=291194637