07 hot words classified information in the field analysis and interpretation of the first step in crawling blog news content recommendation Park

Functional requirements: 1, data acquisition, periodically crawling the field of information-related hot words from the network

      2, data cleaning: thermal word information data cleansing, and automatic classification and counting to generate heat automatically generated word directory information field.

      3, hot words to explain: automatically add the Chinese explanation for each hot word noun (refer to Baidu Encyclopedia or Wikipedia)

      4, hot words Quote: hot words and references to recent news articles or labeled to generate hyperlinks directory, the user can click to access;

      5, data visualization shows: ① visualization shows clouds with a character or word graph heat; ② with the closeness of the relationship diagram to identify hot words.
      6, data reporting: all can be hot words directories and Glossary generate WORD version of the report in the form of export.

The first step in the completion of some functions, crawling Park Recommended News blog title and content into the text,

      

 

 Ideas: The observation that law between the page and the page

 

 To change the page link by changing the page. Also found

 

 Web links href is the figure of the corresponding news details of address

 

 So recycling crawling href link for the specific address of the corresponding article. Specific code as follows

import requests
from lxml import etree
import time
import pymysql
import datetime
import urllib
import json


def getDetail(href, title):
    #print(href)
    print(title)
    head={
        Cookies = CfDJ8Nf-Z6tqUPlNrwu2nvfTJEgfH-Wr7LrYHIrX6zFY2UqlCesxMAsEz9JpAIbaPlpJgugnPrXvs5KuTOPnzbk1pa_VZIVlfx1x5ufN55Z8sb63ACHlNKd4JMqI93TE2ONBD5KSWd-ryP2Tq1WfI9e_uTiJIIO9vlm54pfLY0fIReGGtqJkQ5E90ahfHtJeDTgM1RHXRieqriLUIXRciu-3QYwk8x5vLZfJIEUMO5g_seeG6G6FW2kbd6Uw3BfRkkIi-g2O_LSlBqj0DdbJFlNmd-TnPmckz5AENnX9f3SPVVhfmg7zINi4G2SSUcYWSvtVqdUtQ8o9vbBKosXoFOTUNH17VXX_IX8V0ODbs8qQfCkPFaDjS8RWSRkW9KDPOmXyqrtHvRXgGRydee52XJ1N8V-Mu0atT0zMwqzblDj2PDahV1R0Y7nBvzIy8uit15vGtR_r0gRFmFSt3ftTkk63zZixWgK7uZ5BsCMZJdhqpMSgLkDETjau0Qe1vqtLvDGOuBZBkznlzmTa-oZ7D6LrDhHJubRpCICUGRb5SB6WcbaxwOqE1um40OSyila-PgwySA; .CNBlogsCookie = 9F86E25644BC936FAE04158D0531CF8F01D604657A302F62BA92F3EB0D7BE317FDE7525EFE154787036095256D48863066CB19BB91ADDA7932BCC3A2B13F6F098FC62FDA781E0FBDC55280B73670A89AE57E1CA5E1269FC05B8FFA0DD6048B0363AF0F08; _GID = GA1.2.1435993629.1581088378; __utmc = 66375729; = = __utmz 66375729.1581151594.2.2.utmcsr cnblogs.com | utmccn = (referral) | = utmcmd referral | utmcct = /; __utma = 66375729.617656226.1563849568.1581151593.1581161200.3; __utmb = 66375729.6.10.1581161200 '
    }
    url2 = "https://news.cnblogs.com"+href
    r2=requests.get(url2,headers=head)
    #print(r2.status_code)
    html = r2.content.decode("utf-8")
    #if title=='病毒,一条静止的河流':
        #print(html)
    html1= etree.HTML(html)
    #print(html1)
    content1 = html1.xpath('//div[@id="news_body"]')
    print(content1)
    if len(content1)==0:
        print("异常")
    else:
        content2 =content1[0].xpath('string(.)')
        #print(content2)
        content = content2.replace('\r', '').replace('\t', '').replace('\n', '').replace(' ','')
        print(content)
        f = open("news.txt", "a+",encoding='utf-8')
        f.write(title+' '+content+'\n')
        #&emsp
if __name__=='__main__':
    for i in range(0,100):
        print("***********************************")
        print(i)
        page = i+1
        url = "https://news.cnblogs.com/n/recommend?page="+str(page)

        r = requests.get(url)
        html = r.content.decode("utf-8")
        print("Status code:", r.status_code)
        html1 = etree.HTML(html)
        href = html1.xpath('//h2[@class="news_entry"]/a/@href')
        title =html1.xpath('//h2[@class="news_entry"]/a/text()')
        print(len(href))
        for a in range(0,18):
            getDetail(href[a],title[a])
       

  

Crawling results are as follows (display section, the title and the content separated by a space, each row of data is the title and content crawling):

 

 

  Summary: crawling encountered in the process of pocketing, are detected too many continuous access to, then go directly to the login screen page, after review by the plus head cookie can solve this problem,

 

Guess you like

Origin www.cnblogs.com/xcl666/p/12285733.html