1. What is a reptile, reptile is a program used to crawl Web site data
2. What is the use reptiles, reptile is the basis for large data analysis
3. reptiles generally written in any language, Python, because Python has a rich library support reptiles program
4. Why reptile reptiles called, probably because of the Chinese meaning is the reason python snake it (laughs)
2. What is the use reptiles, reptile is the basis for large data analysis
3. reptiles generally written in any language, Python, because Python has a rich library support reptiles program
4. Why reptile reptiles called, probably because of the Chinese meaning is the reason python snake it (laughs)
Reptile process
to obtain the page -> parse web page -> Save Data
to obtain the page -> parse web page -> Save Data
Gets the page
: request (request), urllib (url), selenium ( analog browser), multi-threaded, log in, breaking IP banned, crawl server
parses the page : re (regular expression), BeautifulSoup, lxml, to solve the Chinese garbled
store data : txt, csv, MySQL, MongoDB
more than the technology used to see on the line, do not strenuous search the Internet usage of their specific functions, will be introduced later to the
most important thing is the three processes of crawler
parses the page : re (regular expression), BeautifulSoup, lxml, to solve the Chinese garbled
store data : txt, csv, MySQL, MongoDB
more than the technology used to see on the line, do not strenuous search the Internet usage of their specific functions, will be introduced later to the
most important thing is the three processes of crawler
Ready to
use Python
installed request, selenium, bs4 (includes BeautifulSoup), lxml
not mount point here
use Python
installed request, selenium, bs4 (includes BeautifulSoup), lxml
not mount point here
Start
Import Requests Link = " https://www.cnblogs.com/jawide/ " # access URL header = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows; the U-; the Windows NT 6.1; EN-US; RV: 1.9 .1.6) the Gecko / 20091201 Firefox / 3.5.6 ' } # pretend to be a user Response = requests.get (Link, header = headers) # send a request to the website, and receive a reply with Open ( ' 1.html ' , ' W ' ) AS File: # saved locally file.write (response.text)
requests is a library used to send a request to the site
That site links to link crawling
header user head?
Object response in response to requests received from there
response.txt is the entire page's content
requests.get () sends a request to the Web site for pages