day01, reptiles and data

1.1, the origin of the data

Some enterprise data platform and open government data, the data is basically nothing big role in enterprise applications.
Really useful data still need to engineer climbing reptiles.

1.2 What is a reptile

1, reptiles definition

  Scripts, programs ---> 自动抓取World Wide Web information 程序.

2, reptiles classification

  • -Class Reptilia
  • Focused crawler

3, the role of reptiles

  • Solve the problem of cold start.
  • Basic search engine. Do a search engine, you must use crawlers.

  • Machine learning to help build the knowledge map.
    • Machine learning is the ultimate training set. Training set can climb on reptiles
  • You can make comparison software.

1.3, development engineer of reptiles

1, junior engineer

  • web front-end knowledge: HTML, CSS, JavaSc1ipt, DOM, DHTML, Ajax, jQuery, json and so on;
  • Regular expressions, can extract the information you want in the normal web pages in general, such as some special text, links, information, know what is laziness, greed is what type of canonical;
  • Other uses XPath acquires node information of some of the DOM structure;
  • Know what is depth-first, breadth-first crawl algorithm, and using rules in practice;
  • Can analyze a simple structure of the site will be used urllib library or requests for simple data capture.

 In solving web project problems, process is as follows:

  The front end ---> javascript ---> python ---> sql query ---> Database

2, mid-level engineers

  • Learn what Hash, will simply use MD5, SHA1 Hash algorithms for data storage again

  • Familiar with HTTP, HTTPS protocol basics, understand GET, POST method to understand the HTTP header information, including the return status code, coding, user-agent, cookie, session, etc.

  • You can set user-agent data crawling, provided other agents

  • You know what happened Request, what response, use tools such as Fiddler simply crawl and analyze network packets; for dynamic reptiles, ajax request to learn to analyze, simulate manufacturing post request packet, grab the client session information for Some simple site can be automatically logged by the analog data packets.

  • For some sites difficult to get to learn to use phantomjs + selenium grab some dynamic web page information

  • Concurrent downloading, download acceleration data through the parallel crawler; using multiple threads.

    多线程就是利用计算机的cpu,使cpu的利用变高,使程序运行速度加快。

3, Senior Engineer

  • It can be used Tesseract, Baidu AI, HOG + SVM, CNN and the like library identification codes.
  • You can use data mining techniques, classification algorithms to avoid dead links.
  • We will use a common database for data storage, query. For example mongoDB, redis; learn how to avoid the problem by re-downloading the cache.
  • Dynamically adjusted to use machine learning techniques reptiles crawling strategy to avoid IP ban and other banned.
  • Could use some open source frameworks scrapy, scrapy-redis a distributed crawler, can control the deployment of large-scale distributed data crawling reptiles.

1.4, search engine

1, the search engine definition

A program saved after the search engine is to run a number of strategies and algorithms to obtain information from the web page on the Internet, and do something with that information, and then provide users with search services and systems.

2, the composition of the search engine

GM is the main component of reptiles.

3, Spoken reptiles

  • Define a common crawler
    is to web pages on the Internet [overall] climbing down, saved to a local program.

  • Search engines can get all the pages of the reasons

    Search engines by different url all pages are saved to the local.

    Explore search engine url Source:

    • The new website will automatically submit your URL to search engines.
    • Set on other sites outside the chain will be added to the search engine url queue.
    • Search engines and dns resolve to co-operate, if there is a new domain registration, search engine you can get the URL.

4, the working process of the search engine

  • (1) using a common web crawler to crawl
  • (2) storing data.
    First web content to some heavy operation, and finally save.
  • (3) Pretreatment
    • Extracted text
    • Chinese word.
    • Eliminate noise (advertising, navigation, copyright text.)
    • Indexing process.
  • (4) Set your site's ranking, to provide users with search services.

5, a defect common reptiles:
(1) the original page can crawl, but typically 90% of the content of the page is useless.
(2) can not meet the different industries, different needs of different people.
(3) can only get text content can not get video, audio, and other files.
(4) only through the keyword query, you can not support semantic query.
6, focused crawler:

In implementing web crawlers, the content will be screened, try to ensure that the crawl-related data needs.

1.5, reptiles preparations

1, robots protocol
definitions: Exclusion criteria crawlers
action: tell the search engine which can climb which can not climb.
2, sitemap: site map that can help us understand the structure of a website.
3, the estimated size of the site
site:www.taobao.com

1.6、Http和Https

1. What is the http protocol?
It is the norm. ---> constraints publishing and accepted norms html page.

2, http protocol port number: 80
HTTPS Port: 443

3, the http protocol characteristics:
(1) based application layer protocols.
(2) no connection
before each transmission http http1.0 are individually connected, since HTTP1.1, just set the Connection request header, it can be done a long link.
(3) stateless.
http protocol is not recorded state, each request, the requested content before if you want to, must be sent separately. To solve this problem, resulting in a technique called cookie and session.

4, url: Uniform Resource Locator

(1) the role of

Used to locate 互联网any resource on the 位置.
(2) Why url can be used to locate any resource?

http://127.0.0.1:8000/index.html

scheme: htttp agreement

netloc: Internet address (127.0.0.1)

Locate a computer on the Internet, mainly through ip: port.

path: the relative path to the resource in the server.

netloc url contains can navigate to any computer, you can find the resources you want by path into the computer.

(3) special symbols in the url.
?: Question mark is behind the request parameters get request.
&: Multiple splice get request parameters using the &
#: When anchor ---> Help us visit this link, the page can locate the anchor position.
Note: In reptiles, when we crawled the url anchor there, remember to delete.

(4) parse python modules under the urllib can help us resolve a url

from urllib import parse

url = 'https://localhost:9999/bin/index.html?usename=zhangsanY&password=123#bottom'
parse_result = parse.urlparse(url)
print(parse_result)
print(parse_result.netloc)
print(parse_result.path)

运行结果:
ParseResult(
    scheme='https', 
    netloc='localhost:9999', 
    path='/bin/index.html', 
    params='', 
    query='usename=zhangsanY&password=123', 
    fragment='bottom')

localhost:9999
    /bin/index.html

5, the working process http:
(1) address resolution: parsing the client portion of each of the url

(2) dispensing http request packet
(3) encapsulating tcp packets through three tcp connection establishment handshake.
(4) the client sends a request
(5) transmits the server response.
(6) to close the connection tcp

6, the working process of a few http pay special attention to the following points:
1, we enter a url in the browser, first through the first three steps, rather than directly sending the request.
2, to be understood that the client requests and server responses located.
7, when we enter a url in the browser, the client is how to load the entire page?
(1) client parses the url, encapsulating packets, establish a connection, send a request to the service.
(2) from the data packet service resolving the client wants to acquire a page, such as index.html, put the package into a corresponding page by layer passes the packet to the client. Parsing the client data packet, to get the corresponding index.html page.
(3) The client checks for static resources of the index.html page, such as js files, image files, css files, if any, in each request from the server access to these resources.
After (4) clients to obtain all of the resources, in accordance with the html syntax, the index.html full display.
8, the client requests

  • (1) consists of:
    • Request line: a protocol version numbers, addresses, request method
    • Request header
    • Blank line
    • Request data
  • (2) important request header:

    • User-Agent: client identification.
    • accept: Allow incoming file class -
    • Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8

    • Referer: page represents a request comes from that page.
      Anti-hotlinking: Sina now all pages are set to a judging process, when requesting news page, review referer this head is not that we Sina website, and if so, it means access Road King right before they get to the data, such that it should not, It is theft, not directly to the data.

    • cookie: some pages if you can not get data request module, the contents of his colleagues returned cookie generally have the word, then it means you need to wrap cookie header in the process of request; in the login process, the role is very cookie Big.
    • content-type: the type of data requested post.
    • content-length: length of the post request data requests.
    • x-requested-with: xmlhttprequest - xhr --- "When we send a ajax interface data, you want to get the data, make sure to package the head, the head would indicate that the request is an ajax.

9, the server responds

  • (1) consists of:

    • Status line: http protocol version, status code
    • Response header
    • Blank line
    • Body of the response: if the request is an html page, this is the response body html file.
  • (2) response header:

    • content-type: return data type, tells the client what type of resource file returns Yes.
  • (3) Status Code

    • 1XX-> 100 ~ 199: the request indicates that the server successfully receiving portion, also need to send a request to process the remaining whole process.

    • 2XX: Indicates that the server successfully accepted the request and has been processing the entire process.
      200 ok

    • 3XX: In order to complete the request, the client must further refine the request.
      If the requested resource has been moved to another location, 302 is used to redirect.
      304 cache resources.

    • 4XX: client request error. 404 - The server was unable to find the requested content.
      403 server denied access, access is not enough

    • 5XX: server error occurred.
      502 server error occurred.
      500, the request is not complete, server encountered an unknown problem.

      If you encounter face questions, request speak a common status codes?
      Answer idea should be, first say classification, enumerating a few.
      https://blog.csdn.net/qq_35689573/article/details/82120851

1.7, Hash algorithm

1. Definitions

Hash: Hash, by a function of the key (key), the mapping data to store a memory location to access. This process is called Hash, the mapping function called a hash function, recording storage array called a hash table (Hash Table), also known as a hash table.

Simply put, it is an important function in cryptography, generally imgexpressed. This function may be any piece of data (this data is generally referred to as a "message") compressed into fixed-length string (generally referred to as the output string "Summary"). The hash function needs to satisfy the following condition:

  • Uncertainty : the hash function algorithm is a deterministic algorithm, algorithm implementation process does not introduce any random amount. This means that the same hash result of the message necessarily the same.
  • Efficiency : any given message m, can quickly calculate

img

  • Target collision resistance : Given any message m1, hard to find another message m2, makingimg
  • Generalized collision resistance : the case is difficult to find two messages m0 m1 is not equal, such thatimg

2, the advantages

先分类,再查找,通过计算,缩小范围,加快查找速度

3, Hash role

  • Digital signature: to fight fingerprint data

    For example, we download a file, the file download process will go through a lot of network servers, routers transit, how to ensure that this document is what we need to do? We can not get them to detect each byte of the file, can not simply use the file name, file size of these very easy steganography, this time, we need a different fingerprint flag to check the reliability of documents this fingerprint is a hash algorithm we used (also called a hash algorithm).

  • The password is stored

    When users log on the site, if the server directly store user password, if the server is attacked by the attacker, the user's password would have been compromised. The most typical event CSDN password is stored in plain text incident. To solve this problem, the server can store only the hash result of the user's password. When the user inputs login information, the server may calculate a password hash result, and compared with the stored hash result, if the same result, the user is allowed to log on. Since the server does not directly store user passwords, so even if the server is attacked attacker, the user's password will not be leaked. This is why we use [Retrieve password] function, server direct request to enter a new password, rather than sending the original password to us.

Characteristics 4.hash algorithm

  • Fast forward: Given the plaintext and hash algorithm, for a limited time and limited resources can calculate the hash value.

  • Reverse problem: Given (a number of) the hash value, it is difficult (substantially impossible) Release expressly inverse in a finite time.

  • Enter sensitive: the original input information to modify that information, hash value generated should have looked very different.

  • Collision Avoidance: It's hard to find two different plaintext content, so that they are consistent with the hash value of the (conflict).

    I.e., for any two different data blocks, which is very unlikely that the same hash value; for a given block of data, and find that the same hash value is extremely difficult to block. So because of his irreversible , hash algorithm often used to give some of the information encryption, because this irreversibility, you not only impossible to get the original files through fingerprint hashing algorithm according to some, it is impossible to simply create a and file it with the fingerprint section of the target fingerprint match.

5.hash algorithm (understand)

(1) Direct setting method: Take Key or Key hash value of a linear function of the address.

(2) Digital Analysis: need to know a set of Key and Key bits than the number of address bits, selecting Key distribution of digital bits.

Hash(Key) 取六位:

列数 : 1   (2)   3   (4)   5   (6)   (7)   8   (9)   10   11   12   (13)

key1:  5    2    4    2    7    5     8     5     3     6     5     1      3

key2:  5    4    4    8    7    7     7     5     4     8     9     5      1

key3:  3    1    5    3    7    8    5      4     6     3      5     5     2

key4:  5    3    6    4    3    2      5    4     5      3      2    6     4


其中(2、4、6、7、9、13) 这6列数字无重复,分布较均匀,取此六列作为Hash(Key)的值。

Hash(Key1) :225833

Hash(Key2):487741

Hash(Key3):138562

Hash(Key4):342554

(3) middle-square method: take several square values Key intermediate address as Hash.

(4) folding method: split key sections into the same number of bits (number of bits may be different from the last part), and then take these fractions are superimposed and (rounding carry) as the hash address. When more Key bits when digital distribution for using this program.

(5) random method: a pseudo-random re-hash probe.

Specific implementation: establishing a pseudo-random number generator, Hash (Key) = random (Key) this pseudo-random number the hash address.

(6) In addition to leaving the remainder method: Take p keyword is seeking more than a divisor, was used as a hash address.

​ 即 H(Key) = Key % p;

What are popular algorithms have 6.Hash

Popular Hash algorithms include MD5, SHA-1 and SHA-2.

  • MD4 (RFC 1320) is Ronald L. Rivest of MIT in 1990, designs, MD is an abbreviation of Message Digest. Its output is 128 bits. MD4 has proved inadequate security.
  • MD5 (RFC 1321) is an improved version of MD4 Rivest in 1991. It is still an input packet 512, which output is 128 bits. MD5 complex than MD4, and calculates a little slower, safer. MD5 has been shown not to have "strong collision resistance."
  • SHA (Secure Hash Algorithm) is a Hash function group, the first algorithm published in 1993 by the NIST (National Institute of Standards and Technology ). Currently known SHA-1 available in 1995, it is a 160-bit output of the hash value, and therefore better anti exhaustive. Based on the same principles and design MD4 SHA-1, the algorithm and the mimic. SHA-1 has been shown to not having "strong collision resistance."

  • To improve security, NIST also has designed a SHA-224, SHA-256, SHA-384, SHA-512 algorithm, and (collectively referred to as SHA-2), SHA-1 algorithm with a similar principle. SHA-3-correlation algorithm has also been proposed.

7. What is the Hash algorithm "collision"

If, after two key obtained through the same hash value hash function processing, this situation is called a collision of the hash algorithm (collision).

Modern reason for the existence of hash algorithm is that it can get on the irreversibility achieve greater probability, that is, find the probability of collision is small, the probability of such a collision can be utilized smaller.

7.1MD5 collision case

import hashlib

#  两段HEX字节串,注意它们有细微差别
a = bytearray.fromhex("0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef")

b = bytearray.fromhex("0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef")

#  输出MD5,它们的结果一致
print(hashlib.md5(a).hexdigest())
print(hashlib.md5(b).hexdigest())

There are many such examples. Therefore, MD5 few years ago it has not been recommended as the application of a hashing algorithm program, replace it with a family SHA algorithm, which is a secure hash algorithm (Secure Hash Algorithm, abbreviated SHA).

7.2 SHA algorithm and SHA1 collisions family

Many different types of SHA family of algorithms, there SHA0, SHA1, SHA256, SHA384, etc., they are calculated and the calculation speed difference. SHA1 is now among the most extensive use of an algorithm. Many versions, including GitHub, including control tools and a variety of cloud synchronization service to distinguish files are SHA1, many safety certificate or signature also uses SHA1 to ensure uniqueness. For a long time, people think that SHA1 is very safe, at least we have not found a collision cases.

hash over classic article on the know almost: https://www.zhihu.com/question/56234281/answer/148349930

Guess you like

Origin www.cnblogs.com/xiaoywu/p/11960561.html