Python-Urllib actual combat on the basis of crawlers

 

Let’s actually run some gadgets about crawlers today

First of all, we open IDLE, today we inquire about the title name of CSDN.

>>> import urllib
>>> import urllib.request
>>> data=urllib.request.urlopen("https://www.csdn.net").read().decode("utf-8","ignore")
>>> len(data)
385957
>>> import re
>>> pat="<title>(.*?)</title>"
>>> re.compile(pat,re.S).findall(data)
['CSDN-专业IT技术社区']


#爬到硬盘的文件中

>>> urllib.request.urlretrieve("http://www.jd.com",filename="本地路径(且要保存的文件名字") 
>>> urllib.request.urlretrieve("https://www.csdn.net",filename="E:\\IDLE文件\\csdn.html")



#浏览器伪装

>>> opener=urllib.request.build_opener()
>>> UA=("user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36")
>>> opener.addheaders=[UA]
>>> urllib.request.install_opener(opener)
>>> data=urllib.request.urlopen("https://www.csdn.net").read().decode("utf-8","ignore")
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    data=urllib.request.urlopen("https://www.csdn.net").read().decode("utf-8","ignore")
  File "D:\python\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "D:\python\lib\urllib\request.py", line 523, in open
    req = meth(req)
  File "D:\python\lib\urllib\request.py", line 1268, in do_request_
    for name, value in self.parent.addheaders:
ValueError: too many values to unpack (expected 2)

When the browser is disguised above, why is it wrong? There are too many values ​​to unpack

Guess you like

Origin blog.csdn.net/qq_39530692/article/details/104229341