Python crawler (1)

        What is a web crawler? A crawler can be understood as a program that crawls network information according to a set of rules. For example, our commonly used search engine is a web crawler. Because python is more efficient to write crawlers, many crawlers are developed using python.

        The crawler simulates the browser to automatically interact with the server, what happens in the general web browser: open the browser, the browser sends a request to the server, the server responds to the client, and the browser displays the web page. We can simulate this process through code, such as downloading a novel, the code framework is as follows

 1 import requests
 2 import re
 3 
 4 url = 'http://www.jingcaiyuedu.com/book/15401.html'
 5 
 6 response = requests.get(url)
 7 response.encoding = 'utf-8'
 8 
 9 html = response.text
10 
11 title = re.findall(r'',html)[0]
12 
13 fb = open('%s.txt'%title,'w',encoding=''utf-8)
14 
15 download = re.findall(r'',html,re.S)[0]
16 chapter_info_list = re.findall(r'',download)
17 
18 for chapter_info in chapter_info_list:
19     chapter_url,chapter_title = chapter_info
20     chapter_url = ''
21 
22     chapter_content = chapter_content.replace(' ','')
23 
24     fb.write(chapter_title)
25     fb.write(chapter_content)
26     fb.write('/n')
27 
28     print(chapter_url)

          The request module provides the methods get, post, etc. that can send network requests. Idea: First get the code of this url page to the response, and modify the encoding form. Then use the findall method to retrieve the corresponding fields with regular expressions, and save the title, chapter number, and content of the novel. Finally clean the data and store it.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325299696&siteId=291194637