First of all, everyone has heard of the Encyclopedia of Embarrassing Things, right? I caught a lot of funny jokes posted by my friends, this time we will try to grab them with a crawler.
friendly reminder
The embarrassing thing encyclopedia has been revised some time ago, which makes the previous code unusable, which will lead to inability to output and high CPU usage, because the regular expression does not match.
Now, the blogger has re-modified the program, and the code is available for personal testing, including screenshots and descriptions. I have been busy before so I haven't updated it in time. I hope everyone will be happy!
Update time: 2015/8/2
The embarrassing thing encyclopedia has been revised again and again, and bloggers have no heart to match it again and again. If you encounter a situation where you can’t run for a long time and get no results without reporting an error, please refer to the latest comments, provided by enthusiastic friends Let's correct it~
Update time: 2016/3/27
The goal of this article
1. Grab the popular jokes of the embarrassing thing encyclopedia
2. Filter paragraphs with pictures
3. Realize that each time you press Enter to display the release time of a paragraph, the publisher, the content of the paragraph, and the number of likes.
Embarrassing Things Encyclopedia does not need to be logged in, so there is no need to use cookies. In addition, some of the embarrassing things encyclopedia are attached with pictures. We grab the pictures and the pictures are not easy to display, so let's try to filter out the jokes with pictures.
Okay, now let's try to grab the popular jokes of the embarrassing thing encyclopedia, every time we press enter, we will display a joke.
1. Determine the URL and grab the page code
First, we determine that the URL of the page is http://www.qiushibaike.com/hot/page/1, where the last number 1 represents the number of pages, and we can pass in different values to get the paragraph content of a page.
We initially build the following code to print the page code content and try it out, first construct the most basic page crawling method, and see if it will be successful
# -*- coding:utf-8 -*- import urllib import urllib2 page = 1 url = 'http://www.qiushibaike.com/hot/page/' + str(page) try: request = urllib2.Request(url) response = urllib2.urlopen(request) print response.read() except urllib2.URLError, e: if hasattr(e,"code"): print e.code if hasattr(e,"reason"): print e.reason
Run the program, oh no, it actually reported an error, really bad luck, bad luck
line 373, in _read_status raise BadStatusLine(line) httplib.BadStatusLine: ''
Well, it should be a header verification problem. Let's add a header verification and try it out. Modify the code as follows
# -*- coding:utf-8 -*- import urllib import urllib2 page = 1 url = 'http://www.qiushibaike.com/hot/page/' + str(page) user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent' : user_agent } try: request = urllib2.Request(url,headers = headers) response = urllib2.urlopen(request) print response.read() except urllib2.URLError, e: if hasattr(e,"code"): print e.code if hasattr(e,"reason"): print e.reason
Hey, the operation is finally normal this time, and the HTML code of the first page is printed out. You can run the code and try it out. The results of running here are too long to post.
2. Extract all paragraphs of a page
Well, after getting the HTML code, we start to analyze how to get all the snippets of a page.
First, let's inspect the element and take a look, press F12 of the browser, the screenshot is as follows
We can see that each paragraph is wrapped in <div class=”article block untagged mb15″ id=”…”>…</div>.
Now we want to get the person who posted it, the date it was posted, the content of the paragraph, and the number of likes. However, it should be noted that some of the paragraphs have pictures. If we want to display pictures on the console, it is unrealistic, so we directly remove the paragraphs with pictures, and only save the paragraphs that only contain text.
So we add the following regular expression to match, the method used is re.findall to find all matching content. For details on the usage of the method, see the introduction to regular expressions mentioned earlier.
Well, our regular expression matching statement is written as follows, and the following code is added on the original basis
content = response.read().decode('utf-8') pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+ 'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S) items = re.findall(pattern,content) for item in items: print item[0],item[1],item[2],item[3],item[4]
Now regex a little bit here
1) .*? is a fixed combination, . and * represent any unlimited number of characters, plus ? Indicates that the non-greedy pattern is used for matching, that is, we will do the matching as short as possible, and we will use a lot of .*? combinations in the future.
2) (.*?) represents a group. In this regular expression, we match five groups. In the subsequent traversal of items, item[0] represents the content referred to by the first (.*?) , item[1] represents what the second (.*?) refers to, and so on.
3) The re.S flag represents an arbitrary matching pattern for dots when matching, and dots. can also represent newlines.
In this way, we get the publisher, the time of the post, the content of the post, the attached image, and the number of likes.
Note here that if the content we want to get is with pictures, it is more complicated to output it directly, so here we only get the paragraphs without pictures.
Therefore, here we need to filter the paragraphs with pictures.
We can find that the paragraphs with pictures will have codes similar to the following, and those without pictures will not. Therefore, item[3] of our regular expression obtains the following content. If there is no picture, item [3] The content obtained is empty.
<div class="thumb"> <a href="/article/112061287?list=hot&s=4794990" target="_blank"> <img src="http://pic.qiushibaike.com/system/pictures/11206/112061287/medium/app112061287.jpg" alt="But they are still optimistic"> </a> </div>
So we only need to judge whether item[3] contains the img tag.
Ok, let's change the for loop in the above code to the following
for item in items: haveImg = re.search("img",item[3]) if not haveImg: print item[0],item[1],item[2],item[4]
Now, the overall code is as follows
# -*- coding:utf-8 -*- import urllib import urllib2 import re page = 1 url = 'http://www.qiushibaike.com/hot/page/' + str(page) user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent' : user_agent } try: request = urllib2.Request(url,headers = headers) response = urllib2.urlopen(request) content = response.read().decode('utf-8') pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+ 'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S) items = re.findall(pattern,content) for item in items: haveImg = re.search("img",item[3]) if not haveImg: print item[0],item[1],item[2],item[4] except urllib2.URLError, e: if hasattr(e,"code"): print e.code if hasattr(e,"reason"): print e.reason
Run it to see the effect
Well, the jokes with pictures have been removed. Is it very open?
3. Improve interaction and design object-oriented mode
Ok, now we have completed the core part, the rest is to fix the corners and corners. The goal we want to achieve is:
Press Enter to read a paragraph, and display the publisher, release date, content and number of likes of the paragraph.
In addition, we need to design the object-oriented model, introduce classes and methods, and optimize and encapsulate the code. Finally, our code is as follows
__author__ = 'CQC' # -*- coding:utf-8 -*- import urllib import urllib2 import re import thread import time # embarrassing thing encyclopedia reptiles class QSBK: #Initialization method, define some variables def __init__(self): self.pageIndex = 1 self.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' #initialize headers self.headers = { 'User-Agent' : self.user_agent } #Variables to store paragraphs, each element is the paragraphs of each page self.stories = [] #The variable that stores whether the program continues to run self.enable = False #Pass in the index of a page to get the page code def getPage(self,pageIndex): try: url = 'http://www.qiushibaike.com/hot/page/' + str(pageIndex) #build the request request request = urllib2.Request(url,headers = self.headers) #Use urlopen to get the page code response = urllib2.urlopen(request) #Convert the page to UTF-8 encoding pageCode = response.read().decode('utf-8') return pageCode except urllib2.URLError, e: if hasattr(e,"reason"): print u "Failed to connect to the embarrassing thing encyclopedia, the reason for the error", e.reason return None #Pass in a page code, return the list of paragraphs without pictures on this page def getPageItems(self,pageIndex): pageCode = self.getPage(pageIndex) if not pageCode: print "Page load failed...." return None pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+ 'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S) items = re.findall(pattern,pageCode) #Used to store the paragraphs of each page pageStories = [] # Traverse the information matched by the regular expression for item in items: #Does it contain pictures haveImg = re.search("img",item[3]) #If there is no picture, add it to the list if not haveImg: replaceBR = re.compile('<br/>') text = re.sub(replaceBR,"\n",item[1]) #item[0] is the publisher of a paragraph, item[1] is the content, item[2] is the release time, and item[4] is the number of likes pageStories.append([item[0].strip(),text.strip(),item[2].strip(),item[4].strip()]) return pageStories #Load and extract the content of the page and add it to the list def loadPage(self): #If the number of unread pages is less than 2, load a new page if self.enable == True: if len(self.stories) < 2: # get new page pageStories = self.getPageItems(self.pageIndex) #Store the paragraphs of this page in the global list if pageStories: self.stories.append(pageStories) #After the acquisition, the page number index is incremented by one, indicating that the next page is read next time self.pageIndex += 1 #Call this method, print out a paragraph each time you hit enter def getOneStory(self,pageStories,page): # Traverse a page of paragraphs for story in pageStories: #Wait for user input input = raw_input() #Every time you enter a carriage return, determine whether to load a new page self.loadPage() #If you enter Q, the program ends if input == "Q": self.enable = False return print u"Page %d\tPublished by:%s\tPublished time:%s\tLike:%s\n%s" %(page,story[0],story[2],story[3] ,story[1]) #start method def start(self): print u"Reading embarrassing things encyclopedia, press Enter to view the new paragraph, Q to exit" #Make the variable True, the program can run normally self.enable = True #Load a page of content first self.loadPage() #Local variable, control the current page read nowPage = 0 while self.enable: if len(self.stories)>0: #Get a page of paragraphs from the global list pageStories = self.stories[0] #Add one to the number of pages currently read nowPage += 1 #Delete the first element in the global list because it has been taken out del self.stories[0] # output the paragraph of the page self.getOneStory(pageStories,nowPage) spider = QSBK() spider.start()
Okay, let's test it out. Clicking Enter will output a paragraph, including the publisher, release time, content of the paragraph, and the number of likes. Does it feel cool?