The goal is to crawl all the data on the http://www.gg4493.cn/ homepage and get the name, time, source and text of each news.
Next break down the goal and do it step by step.
Step 1: Crawl out all the links on the home page and write them into a file.
Python is very convenient in obtaining html, and only a few lines of code can achieve the functions we need.
Copy code The code is as follows:
def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
page.close()
return html
We all know that the tag of the html link is "a", and the attribute of the link is " href", that is, to get all tag=a, attrs=href values in html.
After consulting the information, I originally planned to use HTMLParser, and also wrote it. But it has a problem, that is, it cannot handle Chinese characters.
Copy the code The code is as follows:
class parser(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a':
for attr, value in attrs:
if attr == 'href':
print value
After using SGMLParser, it doesn't have this problem.
Copy the code The code is as follows:
class URLParser(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = [] def start_a(self,attrs): href = [v for k,v in attrs if k= ='href'] if href: self.urls.extend(href) SGMLParser needs to overload its function for a certain tag, here is to put all the links in the urls of this class. Copy the code as follows: lParser = URLParser()# socket from parser = urllib.urlopen("http://www.gg4493.cn/")#Open this webpage fout = file('urls.txt', 'w ')#Write the link to this file lParser.feed(socket.read())#Analyze reg = 'http://www.gg4493.cn/.*'#This is used to match qualified links , using a regular expression to match pattern = re.
for url in lParser.urls:#Links are stored in urls
if pattern.match(url):
fout.write(url+'\n')
fout.close()
Source: http://www.m4493.cn.