Climb the steps to take the site:
- Crawling goals set
target website: my own blog, https://blog.csdn.net/csyifanZhang/article/list/1
target data: All blog articles - links, title, tag - Analysis of the target site
to be crawled page: https: //blog.csdn.net/csyifanZhang/article/list/1 ~ https://blog.csdn.net/csyifanZhang/article/list/5
be crawling data: HTML elements the h2 class = hyperlinks and links under the title entry-title, tag list - HTML Batch download
using the library to download requests, the official website: https: //2.python-requests.org//zh_CN/latest/user/quickstart.html - Implement HTML parsing to obtain the objective data
using database BeautifulSoup resolution, the official website: https: //beautifulsoup.readthedocs.io/zh_CN/v4.4.0/ - The result data
can use this data sequence json.dumps storage
First, the introduction of python library
1: request library
Requests library provides seven methods described below:
requests.request():构造一个请求,支撑以下各种方法的基础方法
requests.get():获取HTML网页的主要方法,对应于HTTP的GET
requests.head():获取HTML网页头信息的方法,对应HTTP的HEAD
requests.post():向HTML网页提交POST请求的方法,对应HTTP的POST
requests.put():向HTML网页提交PUT请求的方法,对应于HTTP的PUT
requests.patch():向HTML网页提交局部修改请求,对应于HTTP的PATCH
requests.delete():向HTML页面提交删除请求,对应于HTTP的DELETE
2:BeautifulSoup
BeautifulSoup: We called him Tortoise because he taught us
Means that we called him tortoise because he taught us, of course, here is Tortoise Taught us a homonym. BeautifulSoup The word comes from "Alice in Wonderland", which means "turtle soup." Above the official distribution map is from "Alice in Wonderland", appears not to run, it is estimated is the author of the novel might really like it, so therefore from the name.
Well, let's see what the real BeautifulSoup that?
BeautifulSoup Python language module is specifically configured to parse html / xml, very suitable for projects such as reptiles. It has the following powerful features make it:
- Beautiful Soup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed data captured by the user to parse the document, because simple, so do not need much code to write a complete application.
- Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then, Beautiful Soup can not automatically identify the encoding. Then, you just need to explain the original coding on it.
- Beautiful Soup has become and lxml, html6lib as good as the python interpreter, provide users with different analytical strategies or strong rate flexibility.
Html download for subsequent analysis
def downLoadHtmls():
#下载所有列表页面的HTML,用于后续分析
htmls = []
for idx in range(2):#我的博客一共只有5页
#设置初始页面
url = f"https://blog.csdn.net/csyifanZhang/article/list/{idx+1}"
print("crawl html: ",url)
#request 相应的html
r = req.get(url)
if r.status_code!=200: #请求成功状态吗就是200
raise Exception("error")
#加入结果集合
htmls.append(r.text)
return htmls
Single HTML parsing
#解析单个html
def parseHtml(html):
"""
解析单个HTML,得到数据
@return list({"link", "title", [label]})
"""
#使用 html.parser 来解析
soup = BeautifulSoup(html, 'html.parser')
#我们的标题处于head的title标签中
title = soup.find("head").find("title").get_text()
#我们的所有文章标题都在一个h4标签内部
articles = soup.find_all("h4")
datas = []
for article in articles:
# 查找超链接
title_node = (
article
.find("a")
)
#拿出href元素
link = title_node["href"]
# 查找标签列表,我们的小标题统一在h4的a标签内部
tag_nodes = (
article
.find("a")
)
tags = [tag_nodes.get_text()]
datas.append(
{"title":title, "link":link, "tags":tags}
)
return datas
Partial results show:
save document
htmls = downLoadHtmls()
datas = parseHtml(htmls[0])
with open("data.text", "w") as fout:
for data in datas:
fout.write(json.dumps(data, ensure_ascii=False)+"\n")