purpose
And storing data crawling http://seputu.com/ csv file
Import library
lxml for parsing HTML page parse source code and the like, to extract the data. Some references: https://www.cnblogs.com/zhangxinqi/p/9210211.html
request a web page requests
chardet for determining a character encoding format page
csv used for storing text.
re for regular expressions
from lxml import etree import requests import chardet import csv import re
Get page
Generating a web page into the head to request.get, you can simulate browser. Where the head of the page, the browser console can look to the next network.
user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0' headers={'User-Agent':user_agent} r=requests.get('http://seputu.com/',headers=headers)
Analyzing and transcoding
r.encoding=chardet.detect(r.content)['encoding']
Parsing website
html=etree.HTML(r.text)
Extract information page
Browser opens the corresponding Web site to find a label to be extracted, by reviewing the elements to complete the extraction html text content.
Here extracted content h2_title, href, title content. By title regex packets, and data extraction.
Note that: python regular expression part, does not support the zero part of the broad assertion syntax, using grouping scheme to avoid the errors that may occur!
The following code will go wrong:
Re Import box_title = '[2012-5-23 21:14:42] Tomb articles truth New Year' pattern = the re.compile (R & lt '(? <= \ [. * \] \ S). *') RESULT1 = re.search (pattern, box_title)
rows of two-dimensional data storage, for writing a csv file.
div_mulus=html.xpath('.//*[@class="mulu"]') rows=[] for div_mulu in div_mulus: div_h2=div_mulu.xpath('./div[@class="mulu-title"]/center/h2/text()') if len(div_h2)>0: h2_title=div_h2[0] a_s=div_mulu.xpath('./div[@class="box"]/ul/li/a') for a in a_s: href=a.xpath('./@href')[0] box_title=a.xpath('./@title')[0] pattern=re.compile(r'\s*\[(.*)\]\s+(.*)') result1=re.search(pattern, box_title) rows.append([h2_title,result1.group(2),href,result1.group(1)]) pass pass pass
Storing data
Establishing a header-dimensional data, two-dimensional data prior to mating rows by w authority, with writer method, to complete a one-dimensional, two-dimensional data is written
By the final output, the normal completion flag.
headers=['title','real_title','href','date'] with open('text.csv','w') as f: f_csv=csv.writer(f,) f_csv.writerow(headers) f_csv.writerows(rows) print('finished')