Python collection example 1

The goal is to crawl all the data on the http://www.gg4493.cn/ homepage and get the name, time, source and text of each news.
Next break down the goal and do it step by step.
Step 1: Crawl out all the links on the home page and write them into a file.
Python is very convenient in obtaining html, and only a few lines of code can achieve the functions we need.
Copy code The code is as follows:
def getHtml(url):
     page = urllib.urlopen(url)
     html = page.read()
     page.close()
     return html
We all know that the tag of the html link is "a", and the attribute of the link is " href", that is, to get all tag=a, attrs=href values ​​in html.
After consulting the information, I originally planned to use HTMLParser, and also wrote it. But it has a problem, that is, it cannot handle Chinese characters.
Copy the code The code is as follows:


 class parser(HTMLParser.HTMLParser):
     def handle_starttag(self, tag, attrs):
             if tag == 'a':
             for attr, value in attrs:
                 if attr == 'href':
                     print value


After using SGMLParser, it doesn't have this problem.
Copy the code The code is as follows:


class URLParser(SGMLParser):       
        def reset(self):
                SGMLParser.reset(self)
                self.urls = []         def start_a(self,attrs):                         href = [v for k,v in attrs if k= ='href']                           if href:                         self.urls.extend(href) SGMLParser needs to overload its function for a certain tag, here is to put all the links in the urls of this class. Copy the code as follows: lParser = URLParser()# socket from parser = urllib.urlopen("http://www.gg4493.cn/")#Open this webpage fout = file('urls.txt', 'w ')#Write the link to this file lParser.feed(socket.read())#Analyze reg = 'http://www.gg4493.cn/.*'#This is used to match qualified links , using a regular expression to match pattern = re.
 














for url in lParser.urls:#Links are stored in urls
    if pattern.match(url):
        fout.write(url+'\n')
fout.close()

 

Source: http://www.m4493.cn.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326612801&siteId=291194637