python web crawler (7) Explanation static data crawling

purpose

And storing data crawling http://seputu.com/ csv file

Import library

lxml for parsing HTML page parse source code and the like, to extract the data. Some references: https://www.cnblogs.com/zhangxinqi/p/9210211.html

request a web page requests

chardet for determining a character encoding format page

csv used for storing text.

re for regular expressions

from lxml import etree
import requests
import chardet
import csv
import re

Get page

Generating a web page into the head to request.get, you can simulate browser. Where the head of the page, the browser console can look to the next network.

user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
headers={'User-Agent':user_agent}
r=requests.get('http://seputu.com/',headers=headers)

Analyzing and transcoding

r.encoding=chardet.detect(r.content)['encoding']

Parsing website

html=etree.HTML(r.text)

Extract information page

Browser opens the corresponding Web site to find a label to be extracted, by reviewing the elements to complete the extraction html text content.

Here extracted content h2_title, href, title content. By title regex packets, and data extraction.

Note that: python regular expression part, does not support the zero part of the broad assertion syntax, using grouping scheme to avoid the errors that may occur!

The following code will go wrong:

Re Import 
box_title = '[2012-5-23 21:14:42] Tomb articles truth New Year' 
pattern = the re.compile (R & lt '(? <= \ [. * \] \ S). *') 
RESULT1 = re.search (pattern, box_title)

rows of two-dimensional data storage, for writing a csv file.

div_mulus=html.xpath('.//*[@class="mulu"]')
rows=[]
for div_mulu in div_mulus:
    div_h2=div_mulu.xpath('./div[@class="mulu-title"]/center/h2/text()')
    if len(div_h2)>0:
        h2_title=div_h2[0]
        a_s=div_mulu.xpath('./div[@class="box"]/ul/li/a')
        for a in a_s:
            href=a.xpath('./@href')[0]
            box_title=a.xpath('./@title')[0]
            pattern=re.compile(r'\s*\[(.*)\]\s+(.*)')
            result1=re.search(pattern, box_title)
            rows.append([h2_title,result1.group(2),href,result1.group(1)])
            pass
        pass
    pass

Storing data

Establishing a header-dimensional data, two-dimensional data prior to mating rows by w authority, with writer method, to complete a one-dimensional, two-dimensional data is written

By the final output, the normal completion flag.

headers=['title','real_title','href','date']
with open('text.csv','w') as f:
    f_csv=csv.writer(f,)
    f_csv.writerow(headers)
    f_csv.writerows(rows)
print('finished')

  

 

Guess you like

Origin www.cnblogs.com/bai2018/p/10988788.html