Several common Python crawler interview questions allowed me to successfully win 20k offers


Do you understand the synchronization and asynchrony of threads?

Thread synchronization: Multiple threads access the same resource at the same time and wait for the end of resource access, which is a waste of time and inefficient.
Thread asynchrony: When accessing resources, they access other resources at the same time when they are idle and waiting.
Do you understand the synchronization and asynchrony of the network when implementing the multi-threading mechanism?
Synchronous: submit the request -> wait for the server to process -> the client browser can't do anything during this period after processing is completed.
Asynchronous : the request is triggered by an event -> the server is processed (this is that the browser can still do other things) -> the processing is completed
Linked list What are the advantages of each and the sequence table storage?
1. Sequence table storage
principle: Sequence table storage is to put data elements into a continuous memory storage space, with high access efficiency and fast speed. But the length cannot be increased dynamically
    . Advantages: efficient access speed, direct storage through subscripts
    Disadvantages: 1. Insertion and deletion are relatively slow, 2. The length cannot be increased. For
     example : when inserting or deleting an element, the entire table needs to traverse and move elements To rearrange the order
2. Linked list storage
principle: Linked list storage is a dynamic allocation of space during the running of the program. As long as there is still space in the memory, there will be no storage overflow problem.
    Advantages: fast insertion and deletion, retain the original physical Order, for example: when inserting or deleting an element, you need to change the pointer point
     . Disadvantage: The search speed is slow, because when searching, you need to access the circular linked list
. How to deal with network delays and network exceptions when using redis to build a distributed system?
Due to the existence of network anomalies, the request result in the distributed system has the concept of "three states", that is, three states: "success", "failure", "timeout (unknown)".
When "timeout" occurs, you can initiate a read by Manipulation of data to verify that the RPC was successful (e.g. banking system practice)
Another simple way to design a distributed protocol is to design the execution steps to be retryable, i.e. what is a
data warehouse with so-called "idempotency"?
A data warehouse is a subject-oriented, integrated, stable collection of data that reflects historical changes and changes over time. It mainly supports the decision analysis of managers.
The data warehouse collects a series of historical data such as data sources and archived files of various internal and external business systems of the enterprise, and finally converts it into the strategic decision-making information required by the enterprise.
Features:

  1. Topic-oriented: content division according to different businesses;
  2. Integration features: Because different business source data has different data characteristics, when business source data enters the data warehouse, it needs to use a unified encoding format for data loading to ensure the uniqueness of the data in the data warehouse;
  3. Non-volatile: The data warehouse does not perform any update operations on the data by saving the various states of the different histories of the data.
  4. Historical features: The data retains a timestamp field to record the various states of each data at different times.

Suppose there is a crawler, the frequency of obtaining data from the network is fast, and the frequency of writing data locally is slow, what data structure is better to use?
    Solve online (o°ω°o)
Do you know Google's headless browser?
A headless browser, or headless browser, is a browser without an interface. Since it is a browser, it should have everything that the browser should have, but it can't see the interface.
PhantomJS in the selenium module of Python is a browser without interface (headless browser): it is a headless browser based on QtWebkit.
Do you know several engines of MySQL database?
    InnoDB:
    InnoDB is a robust transactional storage engine, which has been used by many Internet companies and provides a powerful solution for users to operate very large data storage.
Using InnoDB is ideal in the following situations:

推荐给大家一个学习交流的地方:719139688,,里面可以很好的学到python的知识,特别是对于初学者和进阶的学习者来说。
1.更新密集的表。InnoDB存储引擎特别适合处理多重并发的更新请求。
2.事务。InnoDB存储引擎是支持事务的标准MySQL存储引擎。
3.自动灾难恢复。与其它存储引擎不同,InnoDB表能够自动从灾难中恢复。
4.外键约束。MySQL支持外键的存储引擎只有InnoDB。
5.支持自动增加列AUTO_INCREMENT属性。
一般来说,如果需要事务支持,并且有较高的并发读取频率,InnoDB是不错的选择。
MEMORY:
使用MySQL Memory存储引擎的出发点是速度。为得到最快的响应时间,采用的逻辑存储介质是系统内存。
虽然在内存中存储表数据确实会提供很高的性能,但当mysqld守护进程崩溃时,所有的Memory数据都会丢失。
获得速度的同时也带来了一些缺陷。  
一般在以下几种情况下使用Memory存储引擎:
1.目标数据较小,而且被非常频繁地访问。在内存中存放数据,所以会造成内存的使用,可以通过参数max_heap_table_size控制Memory表的大小,设置此参数,就可以限制Memory表的最大大小。
2.如果数据是临时的,而且要求必须立即可用,那么就可以存放在内存表中。
3.存储在Memory表中的数据如果突然丢失,不会对应用服务产生实质的负面影响。
redis数据库有哪几种数据结构?
5种数据结构
string
使用string时,redis**大多数情况下**并不会理解或者解析其含义,无论使用json、xml还是纯文本在redis看来都是一样的,只是一个字符串,只能进行strlen、append等对字符串通用的操作,无法针对其内容进一步操作。其基本操作命令有set、get、strlen、getrange、append:

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326724439&siteId=291194637