Python crawler interview selection 01 episode

Python crawler interview selection 01 episode

python learning directory portal

Network basic topics

Five-layer network model

  • Application layer—http ftp dns nfs

  • Transport layer-tcp --udp

  • Network layer—ip icmp igmp

  • Link layer—data link

  • Physical layer—media

What is 2MSL?

  • 2MSL is twice the MSL. The TIME_WAIT state of TCP is also called the 2MSL waiting state. When one end of TCP initiates an active shutdown, after sending the last ACK packet, that is, after the third handshake is completed, the ACK packet of the fourth handshake is sent Then it enters the TIME_WAIT state. You must stay in this state for twice the MSL time. The main purpose of waiting for the 2MSL time is to fear that the last ACK packet is not received by the other party. Then the other party will resend the FIN packet of the third handshake after the timeout. The active closing terminal can send another ACK response packet after receiving the retransmitted FIN packet. In the TIME_WAIT state, the ports at both ends cannot be used, and can only be used until the end of the 2MSL time. When the connection is in the 2MSL waiting phase, any late segment will be discarded. However, in actual applications, you can set the SO_REUSEADDR option to achieve that you don't have to wait for the end of the 2MSL time to use this port.

TCP server creation process

​ 1.Socket Create a socket

​ 2.bind bind ip and port

​ 3.listen makes the socket become passively linkable

​ 4.accept waiting for the client's link

​ 5.recv/send receive and send data

What are TTL, MSL, RTT?

​ (1) MSL: message maximum survival time", it is the longest time any message exists on the network, and the message will be discarded after this time.

​ (2) TTL: TTL is the abbreviation of time to live, which can be translated as "time to live" in Chinese. This time to live is the initial value set by the source host but not the specific time saved, but an ip datagram can be stored. The maximum number of routes is reduced by 1 each time a router is processed. When this value is 0, the datagram will be discarded, and an ICMP message will be sent to notify the source host. RFC 793 stipulates that MSL is 2 minutes. In practical applications, 30 seconds, 1 minute and 2 minutes are commonly used. TTL and MSL are related but not simply equal. MSL must be greater than or equal to TTL.

​ (3) RTT: RTT is the round-trip time (RTT) from the client to the server. TCP contains an algorithm for dynamically estimating RTT . TCP also continuously estimates the RTT of a given connection, because the RTT changes due to changes in network transmission congestion procedures.

The difference between HTTP/HTTPS

  • The difference between HTTPS and HTTP:

    (1) The https protocol needs to go to ca to apply for a certificate. Generally, there are few free certificates and fees are required.

    (2) http is a hypertext transmission protocol, information is transmitted in plain text , and https is a secure ssl encrypted transmission protocol

    (3) http and https use completely different connection methods and use different ports. The former is 80 and the latter is 443.

    (4) The http connection is very simple and stateless

    (5) The HTTPS protocol is a network protocol constructed by the SSL+HTTP protocol that can carry out encrypted transmission and identity authentication, which is safer than the http protocol

  • Application occasions:

    (1) http: suitable for applications that have low requirements for transmission speed and security, and require rapid development. Such as web applications, small mobile games, etc.

    (2) https: https should be used in any scenario!

## How HTTPS realizes secure data transmission

  • HTTPS actually adds a layer of encryption TLS/SSL between HTTP and TCP. SSL is an encryption suite responsible for encrypting HTTP data. TLS is an upgraded version of SSL. Now referring to HTTPS, the encryption suite basically refers to TLS. Originally, the application layer sent the data directly to TCP for transmission, but now the application layer sends the data to TLS/SSL, encrypts the data, and then sends it to TCP for transmission.

Origin of HTTPS security certificate

  • How to apply? Which third-party organizations at home and abroad provide safety certificate certification?

    • domestic:

      • WoSign
      • Financial CFCA established by the People's Bank of China and 12 banks
      • China Telecom Certification Center (CTCA)
      • Customs Certification Center (SCCA)
      • Guofuan CA Security Certification Center established by the EDI Center of the Ministry of Foreign Trade
      • UCA Association Card Certification System headed by SHECA (Shanghai CA)
    • foreign:

      • StartSSL
      • GlobalSign
      • GoDaddy
      • Symantec

Common HTTP status codes

  • 200 status code server request is normal

  • 301 status code: The requested resource has been permanently moved to a new location. When the server returns this response (response to a GET or HEAD request), it will automatically redirect the requester to the new location.

  • 302 status code: the requested resource temporarily responds to the request from a different URI, but the requester should continue to use the original location for future requests

  • 401 status code: The request requires authentication. For web pages that need to log in, the server may return this response.

  • 403 status code: The server has understood the request, but refused to execute it. Unlike the 401 response, authentication does not provide any help, and the request should not be repeated.

  • 404 status code: The request failed, and the requested resource was not found on the server.

  • Status code 500: The server encountered an unexpected situation that caused it to fail to complete the processing of the request. Generally speaking, this problem will occur when there is an error in the server's program code.

  • 503 status code: The server is currently unable to process the request due to temporary server maintenance or overload.

Primary reptile topic

What is a crawler?

  • Crawlers are automated programs that request websites and extract data

The difference between urllib and urllib2 in pythoon2.x

  • Similarities and differences: They all do url request operations, but the difference is obvious.

  • urllib2 can accept an instance of the Request class to set the headers of the URL request, and urllib can only accept URLs. This means that you cannot disguise your User Agent strings etc. (disguise the browser) through the urllib module.

  • Urllib provides the urlencode method for generating GET query strings, but urllib2 does not. This is why urllib is often used together with urllib2.

  • The comparative advantage of the module is that urlliburllib2.urlopen can accept the Request object as a parameter, so that it can control the header of the HTTP Request.

  • But a series of quote and unquote functions such as urllib.urlretrieve function and urllib.quote have not been added to urllib2, so the assistance of urllib is sometimes needed.

What is the robots protocol?

  • Robots protocol (also called crawler protocol, crawler rules, robot protocol, etc.) is robots.txt. The website tells search engines which pages can be crawled and which pages cannot be crawled through the robots protocol.

  • The Robots protocol is a common ethical code in the Internet community for websites. Its purpose is to protect website data and sensitive information, and to ensure that users' personal information and privacy are not infringed. Because it is not a command, it needs to be consciously followed by search engines.

Request和Response

  • Send a Request to the server locally, the server returns a Response according to the request, and the page is displayed on the page

    1. The browser sends a message to the server where the URL is located. This process is called Http Request

    2. After the server receives the message sent by the browser, it can process the message according to the content of
    the message sent by the browser, and then send the message back to the browser. This process is called HTTP Response

    3. After the browser receives the Response message from the server, it will process the information accordingly and then display it

What is included in the request?

​ 1. Request method: There are mainly two methods, GET and POST. The parameters of the POST request will not be included in the url.
2. Request URL
URL: Uniform resource locator, such as a web document, a picture, a video, etc. The URL can be used to uniquely determine
​ 3. The request header information, including User-Agent (browser request header), Host, Cookies information,
4. The request body, generally not included in the GET request, and the request body in the POST request Generally contains form-data

Why requests need to bring header

  • Simulate the browser, deceive the server, and obtain the form of the content header consistent with the browser: dictionary

  • headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
    (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36”}

  • 用法: requests.get(url,headers=headers)

What information is contained in the response?

​ 1. Response status: normal status code responds with 200 redirection.
2. Response header: such as content type, content length, server information, cookie setting, etc.
3. Response body information: response source code, image binary data, etc.

What are the contents of HTTP requests and responses

  • HTTP request header

    Accept: the type of content that the browser can handle
    Accept-Charset: the character set that the browser can display
    Accept-Encoding: the compression encoding that the browser can handle
    Accept-Language: the language currently set by the
    browser Connection: the connection between the browser and the server The type of
    Cookie: any cookie set on the current page
    Host: the domain where the requesting page is located
    Referer: the URL of the requesting page
    User-Agent: the browser’s user agent string
    HTTP response header information:
    Date: the message sent Time, the description format of time is defined by rfc822
    server: server name.
    Connection: the type of connection between the browser and the server
    content-type: MIME type of the following document
    Cache-Control: control HTTP caching

The difference between get and post requests

  • the difference:
    • get:
      • Get data from the specified server.
      • GET requests can be cached
      • GET requests will be saved in the browser’s browsing history
      • The URL requested by GET can be saved as a browser bookmark
      • GET request has a length limit
      • GET request is mainly used to obtain data
    • post:
      • POST request cannot be cached
      • POST requests will not be saved in the browser browsing history
      • The URL requested by POST cannot be saved as a browser bookmark
      • There is no length limit for POST requests
      • The POST request will place the requested data in the body of the HTTP request packet. POST is more secure than GET. It may modify the request for resources on the server.
  • Application occasions:
    • post:
      • The requested result has persistent side effects (new rows of data are added to the database)
      • If you use the GET method, the data collected on the form may make the URL too long.
      • The data to be transmitted is not encoded in 7-bit ASCII.
    • get:
      • The request is to find resources, and the HTML form data is only used to help search.
      • The request result has no persistent side effects.
      • The total length of the collected data and the input field name in the HTML form does not exceed 1024 characters
  • What information will be sent to the background server in the HTTP request
    • Request line (request method, resource path and HTTP protocol version) POST /demo/login HTTP/1.1
    • Request header
    • Message body (also called entity content) username=xxxx&password=1234

What are the commonly used techniques for python crawlers?

  • Scrapy,Beautiful Soup, urllib,urllib2,requests

The basic process of crawling

​ 1. Initiate a request to the target site through the http library, that is, send a Request, which can contain additional headers and other information, and wait for the server to respond.
2. If the server can respond normally, you will get a Response. The content of the Response is more than a request Obtained page content
​ 3. Analysis content: regular expressions, page analysis library, json
​ 4. Save data: text or save to database

Realize the way to simulate login

​ ①Using a cookie with login status and sending it together with the request header, you can send a get request directly to access the page that can only be accessed after logging in.

​ ② First send the get request of the login interface, get the data needed for login in the HTML of the login page (if necessary), then combine the account password, and then send the post request to log in successfully. Then according to the obtained cookie information, continue to visit the next page

Automated crawler interview topic

How to deal with dynamic loading that requires high timeliness?

  • Selenium+Phantomjs

  • Try not to use sleep and use WebDriverWait

Understanding of Selenium and PhantomJS

  • Selenium is an automated testing tool for the Web. According to our instructions, the browser can automatically load the page, obtain the required data, even take a screenshot of the page, or determine whether certain actions on the website have occurred. Selenium does not have a browser and does not support the functions of the browser. It needs to be combined with a third-party browser to use it. But we sometimes need to make it run embedded in the code, so we can use a tool called PhantomJS instead of the real browser. There is an API called WebDriver in the Selenium library. WebDriver is a bit like a browser that can load a website, but it can also be used like BeautifulSoup or other Selector objects to find page elements, interact with elements on the page (send text, click, etc.), and perform other actions to run the network reptile.

  • PhantomJS is a Webkit-based "headless" browser. It loads the website into memory and executes JavaScript on the page. Because it does not display a graphical interface, it runs more efficiently than a full browser. Compared with traditional Chrome or Firefox browsers, the resource consumption will be less.

  • If we combine Selenium and PhantomJS, we can run a very powerful web crawler that can handle JavaScript, Cookies, headers, and anything that our real users need to do. After the main program exits, selenium does not guarantee that phantomJS will exit successfully. It is best to close the phantomJS process manually. (It may cause multiple phantomJS processes to run, occupying memory). Although WebDriverWait may reduce the delay, there are currently bugs (various errors). In this case, sleep can be used. phantomJS crawling data is slow, you can choose multi-threading. If you find that some can run while some are not, you can try to change phantomJS to Chrome.

What are the commonly used anti-crawler strategies? And what are the coping strategies?

  • Usually use headers to anti-crawler

  • Crawlers based on user behavior:

    • The same IP visits the same page multiple times in a short period of time,
    • Perform the same operation multiple times in the same account within a short period of time
  • Dynamic web anti-crawler

    • The data is obtained through an ajax request
    • Generated by JavaScript
  • Encrypt some data

    • The part of the data that can only be captured, and the other part is encrypted, showing garbled characters, etc.
  • preventive solution

    • For the crawling of basic web pages, you can customize headers, add header data, and proxy to solve it.
    • Data capture of some websites must be simulated login to capture complete data, so simulated login is required.
    • For those that limit the grabbing frequency, you can set the grabbing frequency to be lower.
    • For restricted ip crawling, multiple proxy ips can be used for crawling, and polling uses a proxy
    • For dynamic web pages, you can use selenium+phantomjs for crawling, but it is slow, so you can also use the search interface to crawl.
    • To encrypt part of the data, you can use selenium to take a screenshot, and use the pytesseract library that comes with python for identification after a meal, but the slowest and most direct method is to find an encryption method for reverse reasoning.

Scrapy interview topics

Scrapy advantages and disadvantages:

  • advantage:

    • scrapy is asynchronous

    • Use more readable xpath instead of regular

    • Powerful statistics and log system

    • Crawl on different URLs at the same time

    • Support shell mode, convenient for independent debugging

    • Write middleware to facilitate writing some unified filters

    • Stored in the database via pipeline

  • Disadvantages: Python-based crawler framework, poor scalability. Advantages:

    • Based on the twisted framework, the running exception will not kill the reactor, and the asynchronous framework will not stop other tasks after an error, and it is difficult to detect data errors.
  • Understand the description

    • scrapy is a Python crawler framework with extremely high crawling efficiency and highly customizable, but it does not support distributed. And scrapy-redis is a set of components based on the redis database and running on the scrapy framework, which allows scrapy to support distributed strategies. The slave side shares the item queue, request queue, and request fingerprint collection in the redis database on the Master side.

    • Why choose redis database, because redis supports master-slave synchronization, and the data is cached in memory, so the distributed crawler based on redis has very high efficiency in high-frequency reading of requests and data

    • scrapy is a fast, high-level python-based web crawler framework.

      Used to download and parse web pages, the parse->yield item->pipeline process is the inherent mode of all crawlers.

    • The construction forms are mainly divided into spider, pypipeline.py, item.py, decorator.py, middlewares.py, setting.py.

What are the main problems that distributed crawlers solve?

​ (1)ip

​ (2) Bandwidth

​ (3)cpu

(4) i

Scrapy framework operating mechanism?

  • Get the first batch of urls from start_urls and send the request. The request is sent by the engine to the scheduler to enter the request queue. After the request is completed, the scheduler passes the request in the request queue to the downloader to obtain the response resource corresponding to the request and respond Hand over to the analytical method written by yourself for extraction processing:

  • If the required data is extracted, it is handed over to the pipeline file for processing;
    if the url is extracted, the previous steps (send url request, and the engine will send the request to the scheduler into the queue...) until there is no request in the request queue , The program ends.

Principle of distributed crawler?

  • Scrapy-redis realizes distributed, in fact, it is very simple in principle. For the convenience of description, we call our core server master, and the machine used to run crawler programs as slave.

  • We know that to use the scrapy framework to crawl webpages, we need to first give it some start_urls. The crawler first visits the url in start_urls, and then according to our specific logic, it performs the analysis on the elements inside or other secondary and tertiary pages. Crawl. To achieve distributed, we only need to make an article in this starts_urls.

  • We build a redis database on the master (note that this database is only used as URL storage, and does not care about the specific data that is crawled, and should not be confused with mongodb or mysql later), and open up every type of website that needs to be crawled A separate list field. The address of the url obtained by setting scrapy-redis on the slave is the master address. The result of this is that even though there are multiple slaves, there is only one place for everyone to get the URL, and that is the redis database on the server master. Moreover, due to scrapy-redis's own queue mechanism, the links obtained by slaves will not conflict with each other. In this way, after each slave completes the crawling task, the obtained results are summarized on the server (the data storage at this time is no longer redis, but a database that stores specific content such as mongodb or mysql). The advantage is that the program is highly portable. As long as the path problem is handled well, the program on the slave is transplanted to another machine to run, which is basically a matter of copying and pasting.

Asynchronous processing of scrapy

  • The asynchronous mechanism of the scrapy framework is based on the twisted asynchronous network framework. The specific concurrency value can be set in the settings.py file (the default is 16 concurrency).

Guess you like

Origin blog.csdn.net/weixin_38640052/article/details/107481491