python reptile collection

python reptile collection

Recently, a project need to collect some of the site's pages, had previously been using php to do, but now very popular to do with the acquisition python, a number of studies to do some recording.

Data collection is fundamental to get the contents of a Web page, and then filter out the required data based on content,

Python advantage is speed, multi-threading, high concurrency, it can be used to collect a lot of data, and php drawback is compared, python wheel and seemingly no php code library full, and python installed a little trouble spots, toss for a long time.

python3 installation, see the connection:

https://www.cnblogs.com/mengzhilva/p/11059329.html

Tools Editor:

PyCharm: a good use of python dedicated editor, you can compile and run, support windows

python library collection used:

requests: used to obtain the content of the page, support https, user login information, very powerful

lxml: used to parse html content acquisition, and very easy to use, flexible, easy to find but a lot of usage, api documentation easy to find.

pymysql: join operation mysql, this would not have said, the collected information is stored in the database.

Basically these three can support the collection page

Installation Code:

With pip install calling code:

pip install pymysql
pip install requests
pip install lxml

 Data collection:

Collection of code and print results:

# Coding = utf-8 # set page coding solve Chinese distortion 
Import Re
Import pymysql
Import Requests
from mydb Import *
from lxml Import etree
# analog browser to access the
headers = {
'the User-- Agent': 'the Mozilla / 5.0 (the Windows; the U- ; Windows NT 6.1; EN-US; rv: 1.9.1.6) Gecko / 20091201 Firefox / 3.5.6 '
}
#requests get pages
respose = requests.get (' https://www.cnblogs.com/mengzhilva/ ', = headers headers)
content = respose.text # acquires content
html = etree.HTML (content) # lxml formatted by
result = etree.tostring (html, encoding = 'utf-8') # output code analysis target
titles = html. xpath ( '// div [@ class = "day"] // div [@ class = "postTitle"] // a / text ()') # find the corresponding data
url = html.xpath ( '// div [ @ class = "day"] // div [@ class = "postTitle "] // a / @ href ') # find the corresponding data
Print (the titles)
Print (URL)
I =. 1
for Val in the titles:
URL = html.xpath ( '// div [@ class = "Day"] [' + the format (I) + '] // div [@class = "postTitle"] // a / @ href ') # the cyclic check list address
Print (Val)
Print (URL)
# where separate functions can be called to fetch the page content before
i + = 1

Guess you like

Origin www.cnblogs.com/mengzhilva/p/11059768.html