Crawling Baidu homepage and using the standard library urllib.request.urlopen method

The urllib.request module defines functions and classes suitable for opening URLs (mainly HTTP) in various complex situations-such as basic authentication, digest authentication, redirection, cookies and others.
The official documentation suggests that for higher-level HTTP client interfaces, it is recommended to use the Requests package.
Use the urllib.request.urlopen (url, data = None, [timeout,] *, cafile = None, capath = None, cadefault = False, context = None) method in the standard library. A brief introduction to the parameters of this method is as follows: The
url parameter is the submitted network address (the entire address, the front end requires the protocol name, and the back end requires the port. For example, http://192.168.1.1:80).
The data parameter is optional and usually not used much. If you want to add data, if it is the content of the byte stream encoding format, that is, the bytes type, it can be converted by the bytes() function. In addition, if you pass the data parameter, its request method is no longer a GET request. It is POST.
timeout is an optional parameter that specifies the timeout used to prevent operations such as connection attempts in seconds (if not specified, the global default timeout setting will be used).
If context is specified, it must be an instance of ssl.SSLContext describing various SSL options. See HTTPSConnection for more details.
Other parameters: The
optional cafile and capath parameters specify a set of trusted CA certificates for HTTPS requests. cafile should point to a single file containing a bunch of CA certificates, and capath should point to the directory of hash certificate files. More information can be found in ssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.
See the official document for details

#爬取百度主页
import urllib.request

def clear():
    """清屏"""
    print('内容较多,显示3秒后翻页')
    time.sleep(3)
    OS = platform.system()
    if (OS == 'Windows'):
        os.system('cls')
    else:
        os.system('clear')
        
def linkBaidu():
        url = 'http://www.baidu.com'
        try:
            response = urllib.request.urlopen(url,timeout=3)
            #global result
            result = response.read().decode('utf-8')
        except:
            print("网络地址错误")
            exit()
        with open('baidu.txt','w',encoding='utf8') as f:
            f.write(result)
        print("获取url信息:response.geturl(): %s" %response.geturl())
        print("获取返回代码:response.getcode() %s" %response.getcode())
        print("获取返回信息:response.info() %s" %response.info())
        print("获取的网页内容已存入当前目录的baidu.txt中,请自行查看")
        
if __name__ == '__main__':
    linkBaidu()

Simply crawl specific information

import urllib.request
response = urllib.request.urlopen("https://www.baidu.com")
print(response.status)  #打印响应状态
print(response.getheaders())

More examples can refer to urllib.request for detailed introduction

Guess you like

Origin blog.csdn.net/qq_45701131/article/details/108739765