CSDN article crawling

Articles starting address: like two-Fu Jun Cang personal blog


title: CSDN article crawling
DATE: 2019-06-09 13:17:26
Tags:

  • CSDN
  • Python
    category: Technology
    ---

plan

Because some time ago a new personal blog, so csdn want to migrate to the blog here, do not use a function key to a successful migration, so the thought of a direct crawling, and then re-send
time: 3 hours
Expected results: blog post to save local

Implementation process

  1. Find a list of articles, article were crawling, url to extract information article.
  2. Parse the content of the article, extract the contents of the article.
  3. Save locally.
  4. Try to save paper style

Use of technology

Using python language to complete, using pyquery library crawling.

coding

  1. Page analysis of the article, crawling content code is as follows:
   article = doc('.blog-content-box')
   #文章标题
   title = article('.title-article').text()
   #文章内容
   content = article('.article_content')
  1. Save articles
 dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt'
        with open(dir, 'a', encoding='utf-8') as file:
            file.write(title+'\n'+content.text())
  1. Extraction of the article url
urls = doc('.article-list .content a')
    return urls
  1. Paging crawling
    for i in range(3):
        print(i)
        main(offset = i+1)
  1. Code Integration

    The complete code

#!/usr/bin/env python
# _*_coding:utf-8 _*_
#@Time    :2019/6/8 0008 下午 11:00
#@Author  :喜欢二福的沧月君([email protected])
#@FileName: CSDN.py

#@Software: PyCharm

import requests
from pyquery import PyQuery as pq

def find_html_content(url):
    headers = {
                'User-Agent': 'Mozilla/5.0(Macintosh;Inter Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gerko) Chrome/52.0.2743.116 Safari/537.36'
            }
    html = requests.get(url,headers=headers).text
    return html
def read_and_wiriteblog(html):
    doc = pq(html)

    article = doc('.blog-content-box')
    #文章标题
    title = article('.title-article').text()

    content = article('.article_content')

    try:
        dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt'
        with open(dir, 'a', encoding='utf-8') as file:
            file.write(title+'\n'+content.text())
    except Exception:
        print("保存失败")


def geturls(url):
    content = find_html_content(url)
    doc = pq(content)
    urls = doc('.article-list .content a')
    return urls

def main(offset):
    url = '此处为博客地址' + str(offset)
    urls = geturls(url)
    for a in urls.items():
        a_url = a.attr('href')
        print(a_url)
        html = find_html_content(a_url)
        read_and_wiriteblog(html)
if __name__ == '__main__':
    for i in range(3):
        print(i)
        main(offset = i+1)

Guess you like

Origin www.cnblogs.com/miria-486/p/10993272.html