1, basic introduction and learning about python reptiles

About want to learn python reptile, first of all I think familiar with my way of data extraction, such as regular expressions (regex is one of the data filtering expression), then we start re talking about. I'll try to let you read here blog, can no longer strange to python reptile, crawling and can do simple data.

Please read the text

Regular expressions are typically used to 匹配, 检索, 替换and 分割that the text in line with a pattern (rule) is.

Case number one:

Import Re   # Import module regular 

STR = ' Beijing Olympic Games Will BE Held in 2018, the I Still Remember '   

PAT = the re.compile ( " [0-9] {}. 4 " )   # the compile function according to a pattern string and optionally the flags parameter to generate a regular expression object. This object has a series of methods for regular expression matching and replacement. 
= pat.findall year (STR) # find all matching data 
Print (year)
 # Results: [ '2018']

We put it another writing

import re

str = 'Beijing Olympic Games will be held in 2018, I still remember' 
res
= re.findall("[0-9]{4}",str)
print(res)
#结果:['2018']

For more regular grammar can be opened on this blog: https://www.cnblogs.com/12james/p/11787305.html

Regular expression functions, such as commonly used, may be explained in detail with reference to: https://www.cnblogs.com/12james/p/11787462.html

compile()、match()、search()、findall()、finditer()、split()、sub()等

Case two: regular application is in practice

 We have a local index.html file

<! DOCTYPE HTML > 
< HTML > 
< head > 
    < title > regular expression instance </ title > 
</ head > 
< body > 
    < h2 > Useful links </ h2 > 
    < ul > 
        < li > < A href = " https://www.python.org " > Python official website </ A > </ li > 
        < li > < A href ="https://www.djangoproject.com">Django official website </ A > </ li > 
        < li > < A href = "https://www.baidu.com" > Baidu search engine </ A > </ li > 
        < li > < A href = "HTTPS : //blog.csdn.net " > CSDN official website </ A > </ li > 
        < li > < A href =" https://edu.csdn.net/ " > CSDN College </ A > </ li > 
    </ul>
</body>
</html>

Use python crawling content in web pages

Re Mport 

F = Open ( " ./index.html " , " R & lt " ) # open the local file index.html 
Content = reached, f.read () # read the file content 
f.close () # close file operations 

title = Re .search ( " .? <title> (*) </ title> " , content) # regular content from the title match
 IF title:
     Print (title)
     Print ( " title: " + title.group ()) 

alist = re.findall ( ' <a href="(.*?)"> (. *?) </a> ' ,content) # match the name and website link 

for OV in alist:
    print(ov[1]+":"+ov[0])

python crawler application scenario is as follows:

  • Crawler technology in the 科学研究, Web安全, 产品研发, 舆情监控and other areas can do many things.

  • In data mining, machine learning scientific research, image processing, if there is no data from the Internet via the crawler can fetch;

  • In Web security, reptile can use the site if there is a loophole for bulk verification using;

  • In product development, prices can be collected each store items, provide users with the lowest market;

  • In terms of monitoring public opinion, we can crawl, analyze Sina microblogging data to identify whether a user is navy

python reptile learning technical preparations:

  • (1) Python language basic: basic syntax, operators, data types, flow control, function, object modules, file operations, multithreading, and other network programming ...

  • (2) W3C standards:. HTML, CSS, JavaScript, Xpath, JSON

  • (3) HTTP standards:. HTTP request process request method, the meaning of the status code, header information and state management Cookie

  • (4) database:. SQLite, MySQL, MongoDB, Redis ...

Web crawler works

Data request, fetch

  • In the reptile implementation, apart from scrapy the framework, python there are many associated with this library available. Wherein, in the data fetch aspects include: urllib2 (urllib3), requests, mechanize, selenium, splinter;

  • Wherein, urllib2 (urllib3), requests, mechanize used to obtain the original response content corresponding to the URL; and selenium, splinter driven by loading the browser, after the content acquisition response browser rendering a higher degree of simulation.

  • Consider efficiency, of course, can use urllib2 (urllib3), requests, mechanize and so far as possible be resolved without selenium, splinter, as the latter due to the need to load the browser results in lower efficiency.

  • For data capture, the process involved is to simulate the browser sends the constructed http request to the server, the common types are: get / post.

data analysis:

  • In terms of data analysis, the appropriate library comprising: lxml, beautifulsoup4, re, pyquery.

  • For data analysis, the main response is extracted from the data pages required, commonly used methods are: xpath path expression, CSS selectors, Re regular expressions.

  • Wherein, xpath path expression, CSS selectors mainly used for data extraction structured. The regular expression is mainly used for unstructured data extraction.

想深入了解urllib,可以打开urllib库官方链接:https://docs.python.org/3/library/urllib.html 

We mainly use the library requests today:

  • Installation: Install pip requests by using the command

  pip install requests

requests get request to use:

Import Requests
 Import Re 

URL = " http://www.baidu.com " 
# fetch information 
RES = requests.get (URL)
 # obtain HTTP status code 
Print ( " Status:% D " % res.status_code)
 # fetch response SUMMARY 
Data = res.content.decode ( " UTF-. 8 " )
 # use regular analytical results 
Print (the re.findall ( " <title> (. *?) </ title> " , Data))
#定义请求携带的数据 data = { 'key':'python', 'final':1, 'jump':1, }
#定义请求地址 url = "http://bj.58.com/job/" #使用get请求方式 res = requests.get(url,params=data) #对获取的页面数据转码 html = res.content.decode('utf-8') #定义正则表达式 pat = '<span class="address" >(.*?)</span> \| <span class="name">(.*?)</span>'
dlist = re.findall(pat,html) #匹配数据
for v in dlist: print(v[0]+" | "+v[1])

 See the next post on request

Guess you like

Origin www.cnblogs.com/12james/p/11912661.html