Python collection example 2

The previous article said that we want to collect the data of http://www.gg4493.cn/, and then:

Step 2: For each link, get its web content.

It's very simple, just open the urls.txt file and read it line by line.
It might seem superfluous here, but based on my strong desire for decoupling, I decided to write it in the file. Later, if you use object-oriented programming, it is very convenient to refactor.
It is relatively simple to get the content of the web page, but you need to save the content of the web page to a folder.
There are several new usages here: The code to
copy the code is as follows:


os.getcwd()#Get the current folder path
os.path.sep#The current system path separator (is this the name?) It is "\" under windows, "/" under linux #Determine
whether the folder exists, if not, create a new folder
if os.path.exists('newsdir') == False:
    os.makedirs('newsdir')
#str() is used Convert a number to a string
i = 5
str(i)
With these methods, saving a string to a different file in a folder is no longer a difficult task.
Step 3: Enumerate each web page and obtain target data according to regular matching.
The following method is used to traverse a folder.
Copy the code The code is as follows: #This


 is
 for parent, dirnames, filenames in os.walk(dir):
     for dirname in dirnames
         print parent, dirname
     for filename in filenames:
         print parent, filename
traverse, read, match, and the result comes out.
The regular expression I use for data extraction is this:
Copy code The code is as follows:


reg = '<div class="hd">.*?<h1>(.*?)</h1>.*?<span class ="pubTime">(.*?)</span>.*?<a .*?>(.*?)</a>.*?<div id="Cnt-Main-Article-QQ" .* ?>(.*?)</div>'
In fact, this does not match all content, because the above news has two formats, and the tags are a little different, so only one can be extracted.
Another point is that the extraction by regular expressions is definitely not the mainstream extraction method. If you need to collect other websites, you need to change the regular expressions, which is a troublesome thing.
After extraction, we can see that the body part is always mixed with some irrelevant information, such as "<script>...</script>", "<p></p>" and so on. So I slice the text again by regular expression.
The code to copy the code is as follows:


def func(str):#Who named the name
    strs = re.split("<style>.*?</style>|<script.*?>.*?</script>|& #[0-9]+;|<!--

if!IE

>.+?<!

endif

-->|<.*?>", str)#各种匹配,通过“|”分隔
    ans = ''
    #将切分的结果组合起来
    for each in strs:
        ans += each
    return ans
这样网页上面的正文基本全部能够提取出来。
到此整个采集也就结束了。

来源:http://www.m4493.com

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326612674&siteId=291194637