Functional requirements: 1, data acquisition, periodically crawling the field of information-related hot words from the network
2, data cleaning: thermal word information data cleansing, and automatic classification and counting to generate heat automatically generated word directory information field.
3, hot words to explain: automatically add the Chinese explanation for each hot word noun (refer to Baidu Encyclopedia or Wikipedia)
4, hot words Quote: hot words and references to recent news articles or labeled to generate hyperlinks directory, the user can click to access;
5, data visualization shows: ① visualization shows clouds with a character or word graph heat; ② with the closeness of the relationship diagram to identify hot words.
6, data reporting: all can be hot words directories and Glossary generate WORD version of the report in the form of export.
The first step in the completion of some functions, crawling Park Recommended News blog title and content into the text,
Ideas: The observation that law between the page and the page
To change the page link by changing the page. Also found
Web links href is the figure of the corresponding news details of address
So recycling crawling href link for the specific address of the corresponding article. Specific code as follows
import requests from lxml import etree import time import pymysql import datetime import urllib import json def getDetail(href, title): #print(href) print(title) head={ Cookies = CfDJ8Nf-Z6tqUPlNrwu2nvfTJEgfH-Wr7LrYHIrX6zFY2UqlCesxMAsEz9JpAIbaPlpJgugnPrXvs5KuTOPnzbk1pa_VZIVlfx1x5ufN55Z8sb63ACHlNKd4JMqI93TE2ONBD5KSWd-ryP2Tq1WfI9e_uTiJIIO9vlm54pfLY0fIReGGtqJkQ5E90ahfHtJeDTgM1RHXRieqriLUIXRciu-3QYwk8x5vLZfJIEUMO5g_seeG6G6FW2kbd6Uw3BfRkkIi-g2O_LSlBqj0DdbJFlNmd-TnPmckz5AENnX9f3SPVVhfmg7zINi4G2SSUcYWSvtVqdUtQ8o9vbBKosXoFOTUNH17VXX_IX8V0ODbs8qQfCkPFaDjS8RWSRkW9KDPOmXyqrtHvRXgGRydee52XJ1N8V-Mu0atT0zMwqzblDj2PDahV1R0Y7nBvzIy8uit15vGtR_r0gRFmFSt3ftTkk63zZixWgK7uZ5BsCMZJdhqpMSgLkDETjau0Qe1vqtLvDGOuBZBkznlzmTa-oZ7D6LrDhHJubRpCICUGRb5SB6WcbaxwOqE1um40OSyila-PgwySA; .CNBlogsCookie = 9F86E25644BC936FAE04158D0531CF8F01D604657A302F62BA92F3EB0D7BE317FDE7525EFE154787036095256D48863066CB19BB91ADDA7932BCC3A2B13F6F098FC62FDA781E0FBDC55280B73670A89AE57E1CA5E1269FC05B8FFA0DD6048B0363AF0F08; _GID = GA1.2.1435993629.1581088378; __utmc = 66375729; = = __utmz 66375729.1581151594.2.2.utmcsr cnblogs.com | utmccn = (referral) | = utmcmd referral | utmcct = /; __utma = 66375729.617656226.1563849568.1581151593.1581161200.3; __utmb = 66375729.6.10.1581161200 ' } url2 = "https://news.cnblogs.com"+href r2=requests.get(url2,headers=head) #print(r2.status_code) html = r2.content.decode("utf-8") #if title=='病毒,一条静止的河流': #print(html) html1= etree.HTML(html) #print(html1) content1 = html1.xpath('//div[@id="news_body"]') print(content1) if len(content1)==0: print("异常") else: content2 =content1[0].xpath('string(.)') #print(content2) content = content2.replace('\r', '').replace('\t', '').replace('\n', '').replace(' ','') print(content) f = open("news.txt", "a+",encoding='utf-8') f.write(title+' '+content+'\n') #&emsp if __name__=='__main__': for i in range(0,100): print("***********************************") print(i) page = i+1 url = "https://news.cnblogs.com/n/recommend?page="+str(page) r = requests.get(url) html = r.content.decode("utf-8") print("Status code:", r.status_code) html1 = etree.HTML(html) href = html1.xpath('//h2[@class="news_entry"]/a/@href') title =html1.xpath('//h2[@class="news_entry"]/a/text()') print(len(href)) for a in range(0,18): getDetail(href[a],title[a])
Crawling results are as follows (display section, the title and the content separated by a space, each row of data is the title and content crawling):
Summary: crawling encountered in the process of pocketing, are detected too many continuous access to, then go directly to the login screen page, after review by the plus head cookie can solve this problem,