python crawler -1-hello

 

1. What is a reptile, reptile is a program used to crawl Web site data
2. What is the use reptiles, reptile is the basis for large data analysis
3. reptiles generally written in any language, Python, because Python has a rich library support reptiles program
4. Why reptile reptiles called, probably because of the Chinese meaning is the reason python snake it (laughs)
 
 
Reptile process
to obtain the page -> parse web page -> Save Data
 
 
Gets the page : request (request), urllib (url), selenium ( analog browser), multi-threaded, log in, breaking IP banned, crawl server
parses the page : re (regular expression), BeautifulSoup, lxml, to solve the Chinese garbled
store data : txt, csv, MySQL, MongoDB
more than the technology used to see on the line, do not strenuous search the Internet usage of their specific functions, will be introduced later to the
most important thing is the three processes of crawler
 
Ready to
use Python
installed request, selenium, bs4 (includes BeautifulSoup), lxml
not mount point here
 
Start
Import Requests 

Link = " https://www.cnblogs.com/jawide/ "             # access URL 
header = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows; the U-; the Windows NT 6.1; EN-US; RV: 1.9 .1.6) the Gecko / 20091201 Firefox / 3.5.6 ' }       # pretend to be a user 

Response = requests.get (Link, header = headers)       # send a request to the website, and receive a reply 

with Open ( ' 1.html ' , ' W ' ) AS File:         # saved locally 
    file.write (response.text)    

requests is a library used to send a request to the site

That site links to link crawling

header user head?

Object response in response to requests received from there

response.txt is the entire page's content

requests.get () sends a request to the Web site for pages

Guess you like

Origin www.cnblogs.com/jawide/p/11483421.html