Getting started with python crawler (1)-crawling the source code of the entire webpage

One, source code

Use third-party library requests to crawl web pages

import requests
# encoding:utf-8  #默认格式utf-8

def get_html(url): #爬取源码函数
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0(Macintosh; Intel Mac OS X 10_11_4)\
        AppleWebKit/537.36(KHTML, like Gecko) Chrome/52 .0.2743. 116 Safari/537.36'

    }  # 模拟浏览器访问
    response = requests.get(url, headers=headers)  # 请求访问网站
    response.encoding = response.apparent_encoding #设置字符编码格式
    html = response.text  # 获取网页源码
    return html  # 返回网页源码

r = get_html('https://www.baidu.com/')
print(r) #打印网页源码

Two, code analysis

Digression

The Python language is very popular because it is simple and has a large number of third-party libraries. If you have a programming foundation, then you can understand the idea of ​​python language well. If you don’t have it, it’s best to buy a book and study it. I recommend "Python Programming From Entry to Practice". Both my roommate and I bought this book.

1. Import the module

import requests

The import statement allows us to open the requests module and use the methods in the module in the following code. Of course, the premise is that you have installed the requests module. The author of the request library is Kenneth Reitz, and a link to his GitHub requests library is attached . You can refer to the idea and code style of the great god.

2. Function

def get_html(url): #爬取源码函数
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0(Macintosh; Intel Mac OS X 10_11_4)\
        AppleWebKit/537.36(KHTML, like Gecko) Chrome/52 .0.2743. 116 Safari/537.36'

    }  # 模拟浏览器访问
    response = requests.get(url, headers=headers)  # 请求访问网站
    response.encoding = response.apparent_encoding #设置字符编码格式
    html = response.text  # 获取网页源码
    return html  # 返回网页源码

Use the get method of the request library to obtain the source code of the web page. Of course, the source code here is the real source code, if you want to get the information you want, you need to go through other processing.

Three, the introduction of the requests library

Click on my GitHub python crawler notes ~Continuously updating

Guess you like

Origin blog.csdn.net/Bob_ganxin/article/details/108720602