Acquaintance python reptile: 5 minutes climb all the information to take the blog

Climb the steps to take the site:

  1. Crawling goals set
    target website: my own blog, https://blog.csdn.net/csyifanZhang/article/list/1
    target data: All blog articles - links, title, tag
  2. Analysis of the target site
    to be crawled page: https: //blog.csdn.net/csyifanZhang/article/list/1 ~ https://blog.csdn.net/csyifanZhang/article/list/5
    be crawling data: HTML elements the h2 class = hyperlinks and links under the title entry-title, tag list
  3. HTML Batch download
    using the library to download requests, the official website: https: //2.python-requests.org//zh_CN/latest/user/quickstart.html
  4. Implement HTML parsing to obtain the objective data
    using database BeautifulSoup resolution, the official website: https: //beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
  5. The result data
    can use this data sequence json.dumps storage

First, the introduction of python library

1: request library

Requests library provides seven methods described below:

requests.request():构造一个请求,支撑以下各种方法的基础方法

requests.get():获取HTML网页的主要方法,对应于HTTP的GET

requests.head():获取HTML网页头信息的方法,对应HTTP的HEAD

requests.post():向HTML网页提交POST请求的方法,对应HTTP的POST

requests.put():向HTML网页提交PUT请求的方法,对应于HTTP的PUT

requests.patch():向HTML网页提交局部修改请求,对应于HTTP的PATCH

requests.delete():向HTML页面提交删除请求,对应于HTTP的DELETE

2:BeautifulSoup

BeautifulSoup: We called him Tortoise because he taught us

Means that we called him tortoise because he taught us, of course, here is Tortoise Taught us a homonym. BeautifulSoup The word comes from "Alice in Wonderland", which means "turtle soup." Above the official distribution map is from "Alice in Wonderland", appears not to run, it is estimated is the author of the novel might really like it, so therefore from the name.

Well, let's see what the real BeautifulSoup that?

BeautifulSoup Python language module is specifically configured to parse html / xml, very suitable for projects such as reptiles. It has the following powerful features make it:

  • Beautiful Soup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed data captured by the user to parse the document, because simple, so do not need much code to write a complete application.
  • Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then, Beautiful Soup can not automatically identify the encoding. Then, you just need to explain the original coding on it.
  • Beautiful Soup has become and lxml, html6lib as good as the python interpreter, provide users with different analytical strategies or strong rate flexibility.

Beautiful Soup basic usage

Html download for subsequent analysis


def downLoadHtmls():
    #下载所有列表页面的HTML,用于后续分析
    htmls = []
    for idx in range(2):#我的博客一共只有5页
        #设置初始页面
        url = f"https://blog.csdn.net/csyifanZhang/article/list/{idx+1}"
        print("crawl html: ",url)
        #request 相应的html
        r = req.get(url)
        if r.status_code!=200: #请求成功状态吗就是200
            raise Exception("error")
        #加入结果集合
        htmls.append(r.text)
    return htmls

Single HTML parsing

#解析单个html
def parseHtml(html):
    """
    解析单个HTML,得到数据
    @return list({"link", "title", [label]})
    """
    
    #使用 html.parser 来解析
    soup = BeautifulSoup(html, 'html.parser')
    #我们的标题处于head的title标签中
    title = soup.find("head").find("title").get_text()
    #我们的所有文章标题都在一个h4标签内部
    articles = soup.find_all("h4")
    datas = []
    for article in articles:
        # 查找超链接
        title_node = (
            article
            .find("a")
        )
        #拿出href元素
        link = title_node["href"]
        # 查找标签列表,我们的小标题统一在h4的a标签内部
        tag_nodes = (
            article
            .find("a")
        )
        tags = [tag_nodes.get_text()]
        datas.append(
            {"title":title, "link":link, "tags":tags}
        )
    return datas

Partial results show:
Here Insert Picture Description

save document

htmls = downLoadHtmls()
datas = parseHtml(htmls[0])
with open("data.text", "w") as fout:
    for data in datas:
        fout.write(json.dumps(data, ensure_ascii=False)+"\n")

Here Insert Picture Description

Published 186 original articles · won praise 13 · views 9282

Guess you like

Origin blog.csdn.net/csyifanZhang/article/details/105276795