Python web crawler and information extraction (examples on)

Following small for everyone to bring a Python web crawler and information extraction (examples to explain). Xiao Bian feel very good, now for everyone to share, but also to be a reference. Come and see, together follow Xiaobian
course architecture:

1, Requests frame: automatic crawling pages and HTML request submitted to the network automatically

2, robots.txt: web crawler exclusion criteria

3, BeautifulSoup framework: parsing HTML page

4, Re framework: Regular framework, extract key information page

5, Scrapy framework: web crawler principle introduced professional reptile framework introduced

理念:The Website is the API …

Python language commonly used IDE tools

Text Tools IDE:

IDLE、Notepad++、Sublime Text、Vim & Emacs、Atom、Komodo Edit

Integration Tools IDE:

PyCharm、Wing、PyDev & Eclipse、Visual Studio、Anaconda & Spyder、Canopy

· IDLE is a common entry-level authoring tools that comes with default Python, which includes an interactive file in two ways. Suitable for shorter programs.

· Sublime Text is designed for third-party programmers to develop specific programming tools that can improve programming experience with a variety of programming styles.

· Wing is Wingware provided by the company charges IDE, debugging feature-rich, with version control, synchronization version, suitable for people to develop. Suitable for writing large programs.

· Visual Studio is Microsoft's maintenance, you can write Python configure PTVS, mainly Windows-based environment, rich debugging features.

· Eclipse is an open source IDE development tools, can be written in Python by configuring PyDev, but the configuration process is complex and requires a certain degree of development experience.

· PyCharm into Community Edition and Professional Edition, Community Edition free, simple, highly integrated, suitable for the preparation of more complex projects.

Suitable for scientific computing, IDE data analysis:

· Canopy is maintained by the company charges Enthought tool that supports nearly 500 third-party libraries, application development for scientific computing.

· Anaconda is a free open source, supports nearly 800 third-party libraries.

Requests Getting library

Requests installation:

Requests library is now recognized as the best web crawling Python third party libraries, simple, simple features.

Official Website: http: //www.python-requests.org

Find "cmd.exe", run as administrator, type in the command line: "\ Windows \ System32 C" : "pip install requests" to run. Here Insert Picture Description
Use IDLE Test Requests Library:

>>> import requests
>>> r = requests.get("http://www.baidu.com")#抓取百度页面
>>> r.status_code
>>> r.encoding = 'utf-8'
>>> r.text

Requests Library 7 primary method Here Insert Picture Description
get () method

r = requests.get(url)

get () method to construct a resource request Request object to the server, the server returns a Response object comprises the resource.

requests.get(url, params=None, **kwargs)

url: url intends to get links to pages

params: extra parameters in the url, dictionary or byte stream format, optional

** kwargs: 12 access control parameters

Two important objects Requests Library

· Request

· Response: Response object containing the contents of reptiles return

Response object attributes

r.status_code: HTTP request returns a status 200 indicates successful connection, 404 represents a failure

r.text: HTTP response string content, i.e., url corresponding page content

r.encoding: encoding the corresponding content from the HTTP header of guessing

r.apparent_encoding: encoding the corresponding content from the content analysis (alternative encoding)

r.content: Binary form of an HTTP response content

r.encoding: If charset header does not exist, encoding is considered as ISO-8859-1.

r.apparent_encoding: analyzing the content of the page can be viewed as encoding r.encoding alternative.

Response code:

r.encoding: guess from the HTTP header in the response content encoding; if charset header does not exist, encoding is considered as ISO-8859-1, r.text to display web content according r.encoding

r.apparent_encoding: analyzed based on page content encoding can be seen as an alternative r.encoding

Crawled pages generic code frame

Abnormal Requests library Here Insert Picture Description
Response exception

r.raise_for_status (): If not 200, an abnormality requests.HTTPError;

In the method for determining the interior r.status_code is equal to 200, if no additional statement that facilitates the use of exception handling try-except

import requests
 
def getHTMLText(url):
  try:
    r = requests.get(url, timeout=30)
    r.raise_for_status() # 如果状态不是200,引发HTTPError异常
    r.encoding = r.apparent_encoding
    return r.text
  except: 
    return "产生异常"
 
if __name__ == "__main__":
  url = "http://www.baidu.com"
  print(getHTMLText(url))

Common code frame, allows the user crawls web becomes more efficient, stable and reliable.

HTTP protocol

HTTP, Hypertext Transfer Protocol, Hypertext Transfer Protocol.

HTTP is based on a "request and response" mode, stateless application layer protocol.

HTTP protocol uses the URL as identification locate network resources.

URL format: http: // host [: port] [path]

· Host: legal Internet host domain name or IP address
· port: port number, the default port number is 80
· path: the path of the requested resource

HTTP URL of understanding:

URL is Internet access path resources via the HTTP protocol, a URL corresponding to a data resource.

HTTP protocol operation on the resources of Here Insert Picture Description
understanding of the difference between PATCH and PUT

Suppose a set of data Location URL UserInfo, including UserID, UserName, 20 fields.

Demand: the user to modify the UserName, the other unchanged.

* Use of PATCH, submitted UserName only partial update request to the URL.

* Use PUT, all 20 fields must be submitted to the URL, uncommitted field is removed.

The main advantage of PATCH: save network bandwidth

Requests Library main analytical method

requests.request(method, url, **kwargs)

· Method: mode request, the corresponding get / put / post 7 kinds

例: r = requests.request(‘OPTIONS’, url, **kwargs)

· Url: url intends to get links to pages

· ** kwargs: access control parameters, a total of 13, are optional

params: dictionary or sequence of bytes added to the url as parameters;

kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('GET', 'http://python123.io/ws',params=kv)
print(r.url)
'''
http://python123.io/ws?key1=value1&key2=value2
'''

data: dictionary, a sequence of bytes or a file object, as the contents of the Request;

json: JSON-formatted data, as the content of the Request;

headers: dictionary, HTTP custom header;

hd = {'user-agent':'Chrome/10'}
 
r = requests.request('POST','http://www.yanlei.shop',headers=hd)

cookies: a dictionary or CookieJar, Request of the cookie;

auth: tuples, support HTTP authentication;

files: a dictionary, file transfer;

fs = {'file':open('data.xls', 'rb')}
 
r = requests.request('POST','http://python123.io/ws',files=fs)

timeout: set the timeout time, in seconds;

proxies: a dictionary, set the access proxy server, you can increase the login authentication

allow_redirects: True / False, the default is True, the redirection switch;

stream: True / False, the default is True, immediate access to content downloads switch;

verify: True / False, the default is True, SSL certificate authentication switch;

cert: Local SSL certificate path

#方法及参数
requests.get(url, params=None, **kwargs)
requests.head(url, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.delete(url, **kwargs)

Here Insert Picture Description
Web crawler caused problems

Performance harassment:

Limited to the level and purpose of writing, the web crawler will be a huge resource overhead web server

legal risks:

Data on the server have property ownership, after obtaining data for profit web crawlers will bring legal risk.

Loss of privacy:

Web crawlers may have the ability to break through a simple access control, access to the protected data so that disclosure of personal privacy.

Web Crawler limit

· Source Review: User-Agent to determine restrictions

Inspection visit HTTP protocol header User-Agent field, the value in response to access a browser or crawler-friendly.

· Announcement: Roots agreement

All told website crawling reptiles tactics, reptiles comply with the requirements.

Robots agreement

Robots Exclusion Standard exclusion criteria crawlers

Role: website informing web crawler which pages can be crawled and what not.

Form: robots.txt file in the root directory of the site.

Case: Jingdong Robots agreement

http://www.jd.com/robots.txt

# 注释:*代表所有,/代表根目录
User-agent: * 
Disallow: /?* 
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider 
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

Use Robots agreement

Crawler: automatic or manual identification robots.txt, then the contents of crawling. Here Insert Picture Description
  Binding: Robots protocol is recommended but non-binding, the web crawler can not follow, but there are legal risks.

Requests Library Web crawler combat

1, Jingdong commodity

import requests
url = "https://item.jd.com/5145492.html"
try:
 r = requests.get(url)
 r.raise_for_status()
 r.encoding = r.apparent_encoding
 print(r.text[:1000])
except:
 print("爬取失败")

2, Amazon merchandise

# 直接爬取亚马逊商品是会被拒绝访问,所以需要添加'user-agent'字段
import requests
url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
 kv = {'user-agent':'Mozilla/5.0'} # 使用代理访问
 r = requests.get(url, headers = kv)
 r.raise_for_status()
 r.encoding = r.apparent_encoding
 print(t.text[1000:2000])
except:
 print("爬取失败"

3, Baidu / 360 search keywords submitted

Search engine keyword submit Interface

· Baidu keywords interfaces:

http://www.baidu.com/s?wd=keyword

· 360 interfaces Keywords:

http://www.so.com/s?q=keyword

# 百度
import requests
keyword = "Python"
try:
 kv = {'wd':keyword}
 r = requests.get("http://www.baidu.com/s",params=kv)
 print(r.request.url)
 r.raise_for_status()
 print(len(r.text))
except:
 print("爬取失败")
# 360
import requests
keyword = "Python"
try:
 kv = {'q':keyword}
 r = requests.get("http://www.so.com/s",params=kv)
 print(r.request.url)
 r.raise_for_status()
 print(len(r.text))
except:
 print("爬取失败")

4, the network picture of crawling and storage

Network image link format:

http://www.example.com/picture.jpg

National Geographic:

http://www.nationalgeographic.com.cn/

Select a picture link:

http://image.nationalgeographic.com.cn/2017/0704/20170704030835566.jpg

图片爬取全代码
import requests
import os
url = "http://image.nationalgeographic.com.cn/2017/0704/20170704030835566.jpg"
root = "D://pics//"
path = root + url.split('/')[-1]
try:
 if not os.path.exists(root):
  os.mkdir(root)
 if not os.path.exists(path):
  r = requests.get(url)
  with open(path,'wb') as f:
   f.write(r.content)
   f.close()
   print("文件保存成功")
 else:
  print("文件已存在")
except:
 print("爬取失败")

5, IP address attribution of automatic query

www.ip138.com IP inquiry

http://ip138.com/ips138.asp?ip=ipaddress

http://m.ip138.com/ip.asp?ip=ipaddress

import requests
url = "http://m.ip138.com/ip.asp?ip="
ip = "220.204.80.112"
try:
 r = requests.get(url + ip)
 r.raise_for_status()
 r.encoding = r.apparent_encoding
 print(r.text[1900:])
except:
 print("爬取失败")
# 使用IDLE
>>> import requests
>>> url ="http://m.ip138.com/ip.asp?ip="
>>> ip = "220.204.80.112"
>>> r = requests.get(url + ip)
>>> r.status_code
>>> r.text

Finally, I recommend a good reputation python gathering [ click to enter ], there are a lot of old-timers learning skills, learning experience, interview skills, workplace experience and other share, the more we carefully prepared the zero-based introductory information on actual project data every day Python programmers explain the timing of technology, sharing some learning methods and the need to pay attention to small details
than this Python web crawler and information extraction (examples to explain) is small series to share the entire contents of all of the

Released seven original articles · won praise 1 · views 3558

Guess you like

Origin blog.csdn.net/haoxun12/article/details/104954881