Articles starting address: like two-Fu Jun Cang personal blog
title: CSDN article crawling
DATE: 2019-06-09 13:17:26
Tags:
- CSDN
- Python
category: Technology
---
plan
Because some time ago a new personal blog, so csdn want to migrate to the blog here, do not use a function key to a successful migration, so the thought of a direct crawling, and then re-send
time: 3 hours
Expected results: blog post to save local
Implementation process
- Find a list of articles, article were crawling, url to extract information article.
- Parse the content of the article, extract the contents of the article.
- Save locally.
- Try to save paper style
Use of technology
Using python language to complete, using pyquery library crawling.
coding
- Page analysis of the article, crawling content code is as follows:
article = doc('.blog-content-box')
#文章标题
title = article('.title-article').text()
#文章内容
content = article('.article_content')
- Save articles
dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt'
with open(dir, 'a', encoding='utf-8') as file:
file.write(title+'\n'+content.text())
- Extraction of the article url
urls = doc('.article-list .content a')
return urls
- Paging crawling
for i in range(3):
print(i)
main(offset = i+1)
Code Integration
The complete code
#!/usr/bin/env python
# _*_coding:utf-8 _*_
#@Time :2019/6/8 0008 下午 11:00
#@Author :喜欢二福的沧月君([email protected])
#@FileName: CSDN.py
#@Software: PyCharm
import requests
from pyquery import PyQuery as pq
def find_html_content(url):
headers = {
'User-Agent': 'Mozilla/5.0(Macintosh;Inter Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gerko) Chrome/52.0.2743.116 Safari/537.36'
}
html = requests.get(url,headers=headers).text
return html
def read_and_wiriteblog(html):
doc = pq(html)
article = doc('.blog-content-box')
#文章标题
title = article('.title-article').text()
content = article('.article_content')
try:
dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt'
with open(dir, 'a', encoding='utf-8') as file:
file.write(title+'\n'+content.text())
except Exception:
print("保存失败")
def geturls(url):
content = find_html_content(url)
doc = pq(content)
urls = doc('.article-list .content a')
return urls
def main(offset):
url = '此处为博客地址' + str(offset)
urls = geturls(url)
for a in urls.items():
a_url = a.attr('href')
print(a_url)
html = find_html_content(a_url)
read_and_wiriteblog(html)
if __name__ == '__main__':
for i in range(3):
print(i)
main(offset = i+1)