1.jupyter basic usage
Two modes: code and markdown
(1) code mode can write the code directly py
(2) markdown styles can be specified directly
(3) Double-click to re-edit
(4) Shortcuts Summary:
Insert cell: ab delete cell: x switch cell model: my execution cell: the Shift + the Enter the Tab: auto-complete the Shift + the Tab: Opens the help file
(5) ipynb files in the cache is equivalent to, in no particular order. Caching mechanism
2. The second open anaconda way:
(1) FIG. 1
(2) 2 in FIG.
(3) in FIG. 3, FIG lower two paths, is also turned on the browser content
Open the top, you do not need to configure environment variables.
2. Basic concepts: http review
1. What is a reptile?
We used a lot: is the browser itself
The concept: by writing a program to simulate the Internet browser, let go the process of obtaining data on the Internet.
2. Classification of reptiles
(1) General Reptile: get a whole page of data, such as Baidu, 360, Sogou browser (behind a set of gripping system)
(2) Focus crawler: obtaining local data page specified according to the specified requirements
(3) Incremental reptiles: to monitor the situation site data update, crawling out of the latest updates to the site data
(4) distributed reptiles: after completing explain scrapy, and then comes to
3. Anti-climb nature
Anti-climb mechanism: the site can take the relevant technical means or strategies to block crawler program website crawling data
Anti-anti-climbing strategy: Let the crawlers through the crack anti-climb mechanism to get data
4. Agreement
(1) robots protocol (can not comply with): an anti-climb agreement, specify which data can climb, which can not climb, both sides must abide by the job.
Anti-anti-villain is not a gentleman's agreement
https://www.taobao.com/robots.txt
(2) http protocol (Hypertext Transfer Protocol): client and server be in the form of data exchange (must be good at summing up)
https protocol: http security
In fact, during the data exchange between people.
- use the header information to
request headers:
- User-Agent: Request carrier identity (browser or crawler will do, by reptile camouflage)
For example, we installed the Google browser, and our visit is Baidu, vector request is "Google Chrome"
- Connection: Keep-Alive or Close
Close properties: after successful request, the request will immediately disconnect the link corresponding
keep-alive; after successful request, the request corresponding link will be disconnected, but not immediately disconnect
response headers:
--content-of the type: can be json or text or js, action: Note the server response back to the client data format or data type.
5.
https: secure http protocol
Certificate encryption keys?
Before understanding the encryption on top of that we first understand the "symmetrical secret key encryption", "asymmetric secret key encryption"
A preliminary understanding to
Three protection modes: certificate secret key encryption, symmetric encryption keys, asymmetric key encryption
(1) SSL encryption:
SSL encryption technology employed is called "Shared Key", also called "symmetrical secret key encryption."
Cons: Once-party interception, it will be to crack the secret key and public key cipher can be cracked
(2) asymmetric encryption
Disadvantages: (1) efficiency is relatively low, (2) the client does not know is the public key is not sent by the server.
(3) certificates secret key encryption: the capture of an asymmetric encryption secret key issues
Tripartite bodies: Certification Authority
Reference blog: https://www.cnblogs.com/bobo-zhang/p/9645715.html