Common Python crawler interview questions, call the interviewer to sing conquest


Do you understand the synchronization and asynchrony of threads?

Thread synchronization: Multiple threads access the same resource at the same time and wait for the end of resource access, which is a waste of time and inefficient.
Thread asynchrony: When accessing resources, they access other resources at the same time when they are idle and waiting.
Do you understand the synchronization and asynchrony of the network when implementing the multi-threading mechanism?
Synchronous: submit the request -> wait for the server to process -> the client browser can't do anything during this period after the processing is completed.
Asynchronous : the request is triggered by an event -> the server processes (this is that the browser can still do other things) -> the processing is completed
Linked list What are the advantages of each and the sequence table storage?
1. Sequence table storage
principle: Sequence table storage is to put data elements into a continuous memory storage space, with high access efficiency and fast speed. However, the length cannot be dynamically increased
    . Advantages: efficient access speed, direct storage through subscripts
    Disadvantages: 1. Insertion and deletion are relatively slow, 2. The length cannot be increased. For
     example : when inserting or deleting an element, the entire table needs to traverse and move elements To rearrange the order
2. Linked list storage
principle: Linked list storage is a dynamic allocation of space during program operation. As long as there is still space in the memory, there will be no storage overflow problem.
    Advantages: fast insertion and deletion, retain the original physical Order, for example: when inserting or deleting an element, you need to change the pointer point
     . Disadvantage: The search speed is slow, because when searching, you need to access the circular linked list
. How to deal with network delays and network exceptions when using redis to build a distributed system?
Due to the existence of network anomalies, the request result in the distributed system has the concept of "three states", that is, three states: "success", "failure", "timeout (unknown)".
When "timeout" occurs, you can initiate a read by Manipulation of data to verify that the RPC was successful (e.g. banking system practice)
Another simple way to design a distributed protocol is to design the execution steps to be retryable, i.e. what is a
data warehouse with so called "idempotency"?
A data warehouse is a subject-oriented, integrated, stable collection of data that reflects historical changes and changes over time. It mainly supports the decision analysis of managers.
The data warehouse collects a series of historical data such as data sources and archive files of various internal and external business systems of the enterprise, and finally converts it into the strategic decision-making information required by the enterprise.
Features:

  1. Topic-oriented: content division according to different businesses;
  2. Integration features: Because different business source data has different data characteristics, when business source data enters the data warehouse, it needs to use a unified encoding format for data loading to ensure the uniqueness of the data in the data warehouse;
  3. Non-volatile: The data warehouse does not perform any update operations on the data by saving the various states of the different histories of the data.
  4. Historical features: The data retains a timestamp field to record the various states of each data at different times.

Suppose there is a crawler, the frequency of obtaining data from the network is fast, and the frequency of writing data locally is slow, what data structure is better to use?
    Solve online (o°ω°o)
Do you know Google's headless browser?
A headless browser, or headless browser, is a browser without an interface. Since it is a browser, it should have everything that the browser should have, but it can't see the interface.
PhantomJS in the selenium module of Python is a browser without interface (headless browser): it is a headless browser based on QtWebkit.
Do you know several engines of MySQL database?
    InnoDB:
    InnoDB is a robust transactional storage engine, which has been used by many Internet companies and provides a powerful solution for users to operate very large data storage.
Using InnoDB is ideal in the following situations:

Recommend a place for everyone to learn and communicate: 719139688, where you can learn python knowledge very well, especially for beginners and advanced learners.
1. Update the dense table. The InnoDB storage engine is particularly suitable for handling multiple concurrent update requests.
2. Transactions. The InnoDB storage engine is the standard MySQL storage engine that supports transactions.
3. Automatic disaster recovery. Unlike other storage engines, InnoDB tables can automatically recover from disasters.
4. Foreign key constraints. The only storage engine that supports foreign keys in MySQL is InnoDB.
5. Support to automatically increase the column AUTO_INCREMENT attribute.
In general, if transaction support is required and there is a high frequency of concurrent reads, InnoDB is a good choice.
MEMORY:
The starting point for using the MySQL Memory storage engine is speed. For the fastest response time, the logical storage medium used is system memory.
While storing table data in memory does provide high performance, when the mysqld daemon crashes, all Memory data is lost.
Gaining speed also comes with some drawbacks.  
The Memory storage engine is generally used in the following situations:
1. The target data is small and accessed very frequently. Storing data in memory will cause memory usage. You can control the size of the Memory table through the parameter max_heap_table_size. By setting this parameter, you can limit the maximum size of the Memory table.
2. If the data is temporary and must be available immediately, it can be stored in the memory table.
3. If the data stored in the Memory table is suddenly lost, it will not have a substantial negative impact on the application service.
What kind of data structures does the redis database have?
5 data structures
string
When using string, redis **mostly** will not understand or parse its meaning, no matter whether you use json, xml or plain text, it is the same in redis, it is just a string, only strlen, append Other operations are common to strings, and no further operations can be performed on their contents. Its basic operation commands are set, get, strlen, getrange, append:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325528405&siteId=291194637