Spider reptiles

Today the reptile sorted out:

Now the transition from the era of mobile Internet to the big data era, data is the core of a large number of

According to acquire way data are the following:

(1) Production of user data: large Internet companies have a mass of users, they

The accumulation of data has a natural advantage, such as Baidu index, the index Ali Sina Weibo index.

(2) Data Management Consulting: usually only large companies have the data collection team,

According to market research, surveys, model testing and all walks of life to cooperate peer companies

Type, and the collection of class data base.

Publicly available data (3) Government / organization: open government data reported are based on the country

Data consolidation, such as People's Republic of China National Bureau of Statistics data.

(4) third-party data platform Buy Data: AI now need to use a lot of people face

Data, behavior actions require a lot of data, but also have a special platform to buy, such as Guiyang

Data exchange etc.

  1. HTTP & HTTPS

    1. In the Baidu home page https://www.baidu.com/ in the beginning there will be a URL http or https, this is the type of protocol to access the resource needs, of course, the beginning of the URL of other, often in the reptile crawled pages usually httphuozhehttps agreement
    2. HTTP is called Chinese <Hypertext Transfer Protocol>, HTTP protocol is used to transmit network protocol transmitted from the hypertext data to the client's local browser,
    3. HTTPS is a secure HTTP targeted pipelines, SSl layer is referred to under the HTTP HTTPS
  2.  HTTP request process

    1. When we enter a carriage return URL in the browser will get the appropriate content, this process is the browser sends a request to the server, the web server receives a request processes the request and returns the corresponding data is then parsed to the browser , a response contains the source code for the page, the browser parses after show in front of customers 

  3. request

    1. A request issued by the client to the server, can be divided into four parts

      • Request method <request method>

        • There are two common request methods GET \ POST <Enter the url in the browser and press Enter to send a request>
        • POST request parameters contained in the url, then data can be seen in url, url and POST request does not contain the data because the data transmitted by the url sheet form; additional data is a GET request submitted at most 1024 bytes, and no limitation POST
      • Requested URL <request url>

        • The requested URL is the only resource you can determine the desired request
      • Request header <request hearders>

        • Request header described additional information server is to be used and the like such as Cookie.User-Agent
      • Request body <request body> 

        • Form data content request body are generally POST request

  4. The response status code

  5. Automation of selenium reptiles

    1. selenium is a tool for web application testing, run directly in the browser, just as real users operate as supported browsers include IE.chrome, etc.,
      but no matter when using any browser must download a corresponding browser drive <is webdriver>
    2. selenium driver object manipulation
      1. Address get (url) is the incoming access
      2. close () Close the browser window
      3. quit () to exit the current webdriver and close all windows
      4. current_url get url of the current page   

            

    

 

 

 

 

 

 

 

end.....

 

  

 

 

Guess you like

Origin www.cnblogs.com/xiaolizikj/p/11654172.html