Python crawler learning - the use of urllib and urllib2 libraries

1. Example of using the urllib2 library

>>> import urllib2
>>> response=urllib2.urlopen('http://www.baidu.com')#获取网页信息
>>> print response.read()#打印出网页信息

First of all, we call the urlopen method in the urllib2 library, and pass in a URL. This URL is the Baidu homepage and the protocol is the HTTP protocol. Of course, you can also replace HTTP with FTP, FILE, HTTPS, etc., which only represents a kind of access. control protocol.

2. Analyze program information

(1) The first program is to import the urllib2 library

(2) In the second sentence, we mainly look at the urlopen function

Role: open a web page information

Format:

urlopen(url,data,timeout)

Parameters: three parameters, one must pass parameter URL, two parameters (date, timeout)

  • The first parameter url is the URL
  • The second parameter data is the data to be transmitted when accessing the URL
  • The third timeout is to set the timeout time

The second and third parameters are optional, data defaults to None, and timeout defaults to socket._GLOBAL_DEFAULT_TIMEOUT

The first parameter URL must be transmitted. In this example, we transmit the URL of Baidu. After executing the urlopen method, a response object is returned, and the returned information is stored in it.

(3) The third sentence is to print out the web page information obtained by urlopen (the response object has a read method that can return the obtained web page content.)

response.read()

The response object has a read method that can return the obtained web page content

Without adding the result printed by read, the description of the object is directly printed, so remember to add the read method

<addinfourl at 139728495260376 whose fp = <socket._fileobject object at 0x7f1513fb

3. Construct a request

import urllib2
request=urllib2.Request("http://www.baidu.com")
response=urllib2.urlopen(request)
print response.read()

It is recommended that you write this way, this program reflects the basic process of the crawler.

Dynamic web scraping information

The above program demonstrates the most basic web scraping. However, most websites are dynamic web pages now, and you need to dynamically pass parameters to it, and it will respond accordingly. So, when accessing, we need to pass data to it.

There are two ways of data transmission: POST and GET. The most important difference is that the GET method is accessed directly in the form of a link. The link contains all parameters. Of course, if it contains a password, it is an unsafe choice, but you can Visually see what you've submitted. POST will not display all the parameters on the URL, but if you want to directly view what was submitted, it is not very convenient, you can choose as appropriate.

post method:

1.首先构建一个url
2.传递动态信息给url
3.构建一个request
4.构建一个response
#代码——登陆知乎客户端
import urllib2
value={"username":"1590110xxxx","password":"xx12341234"}
date=urllib2.urlencode(value)
url="https://www.zhihu.com/"
request=urllib2.Request(url,date)
response=urllib2.urlopen(request)
print response.read()

get method:

import urllib2
values={'username':'159xxxxxxxx','password':'xx12341234'}
date=urllib2.urlencode(values)
url='https://www.zhihu.com/'
geturl=url+'?'+date
request=urllib2.Request(url)
response=urllib2.urlopen(request)
print response.read()

This article refers to Cui Qingcai's blog.

http://python.jobbole.com/81336/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325289876&siteId=291194637