Data Analysis Interview Questions (2023.09.08)

Data analysis process

It is generally divided into four layers: demand layer, data layer, analysis layer and conclusion layer.

1. Statistical issues

1. Restate Bayesian formula and explain application scenarios

  • Formula: P(A|B)= P(B|A)*P(A) / P(B)
  • Application scenario: For example, search query error correction, assuming A is the correct word and B is the input word, then:

      a. P(A|B) represents the probability that input word B is actually A.

      b. P(B|A) represents the probability that word A is mistakenly entered as B, which can be calculated based on the similarity of AB (such as edit distance)

      c. P(A) is the frequency of word A, obtained statistically

      d. P(B) is the same for all candidate A, so it can be omitted
     

  • Naive Bayes is a method based on causal factors when some prior probabilities are known. Naive means that events are assumed to be independent of each other.

2. Parameter estimation

Parameter estimation refers to a method of estimating unknown parameters contained in a population distribution based on samples drawn from the population. It is a basic form of statistical inference and an important branch of mathematical statistics. It is divided into two parts: point estimation and interval estimation.

  • Point estimate: A function that estimates unknown parameters or unknown parameters contained in a population distribution based on a sample. 

  •  Interval estimation (confidence interval estimation): Based on the extracted samples and certain accuracy and accuracy requirements, an appropriate interval is constructed as an estimate of the range of the true value of the unknown parameter or function of the parameter of the overall distribution. For example, people often talk about the percentage of certainty that a certain value is within a certain range, which is the simplest application of interval estimation. 

3. Maximum likelihood estimation

 Maximum likelihood estimation uses known sample results to infer the parameter values ​​that are most likely (with the greatest probability) to lead to such results.

4. Hypothesis testing

Parameter estimation and hypothesis testing are two components of statistical inference. They both use samples to make some inferences about the population, but from different angles.

  • Parameter estimation discusses the method of using samples to estimate population parameters. The population parameter μ is unknown before estimation.
  • Hypothesis testing is to first formulate a hypothesis about the value of μ, and then use sample information to test whether the hypothesis is true. 

5. What is the P value? 

The P value is a parameter used to determine the results of hypothesis testing. It can also be compared using the rejection region of the distribution based on different distributions.

The P value is the probability that the sample observation result or more extreme result will occur when the null hypothesis is true . If the P value is very small, it means that the probability of the null hypothesis occurring is very high. If it occurs, according to the principle of small probability, we have reason to reject the null hypothesis. The smaller the P value, the stronger the reason for rejecting the null hypothesis. In short, the smaller the P value, the more significant the result. However, whether the test result is "significant", "moderately significant" or "highly significant" needs to be solved by ourselves based on the size of the P value and the actual problem.

6. Confidence and confidence intervals

  • Confidence interval: the range within which the variable we calculated exists
  • Confidence: This is how confident we are that this value exists within the range we calculated.
  • Example: ① We are 95% sure that the real value is within the range we calculated. 95% is the confidence level, and the calculated range is the confidence interval. ② If the confidence level is 95%, then 100 samples are taken to estimate the population mean. Among the 100 intervals constructed from 100 samples, approximately 95 intervals contain the population mean.

7. The difference and connection between covariance and correlation coefficient

  • Covariance: Covariance represents the overall error of two variables, which is different from variance, which represents the error of only one variable. If the changing trends of the two variables are consistent, that is, if one of them is greater than its own expected value and the other is greater than its own expected value, then the covariance between the two variables is positive. If the changing trends of the two variables are opposite, that is One of them is greater than its own expected value, and the other is less than its own expected value, then the covariance between the two variables is negative.
  • Correlation coefficient: A quantity that studies the degree of linear correlation between variables. The value range is [-1,1]. The correlation coefficient can also be regarded as covariance - a special covariance that eliminates the dimensional influence of two variables and is standardized. variance.

8. Central limit theorem

  • Definition: ① The mean of any sample will be approximately equal to the mean of the population in which it is located; ② Regardless of the distribution of the population, the sample mean of any population will surround the mean of the population and be normally distributed.
  • Function: ① When there is no way to obtain all the data of the population, we can use samples to estimate the population; ② Based on the mean and standard deviation of the population, judge whether a sample belongs to the population.

2. Probability issues

1. 54 playing cards are divided into 2 parts. Find the probability that both parts have 2 Aces.

M represents the situation where each of the two card piles has 2 Aces: M=4(25!25!)

N represents the situation where the two card piles are completely random: N=27!27!

The probability is: M/N=926/53*17

2. The click-through rate for boys increases, and the click-through rate for girls increases. Why does the overall click-through rate decrease?

Because there may be a big difference in click-through rates between men and women, and at the same time, the proportion of groups with low click-through rates increases.

For example, there are 20 men and 1 click; there are 100 women and 99 clicks, and the total click rate is 100/120

Now there are 100 men, 6 clicks; 20 women, 20 clicks, the total click rate is 26/120

3. Database

1. What is a database, database management system, database system, database administrator?

  • Database: Database DataBase is a collection of information or a database is a collection of data managed by a database management system.
  • Database management system: A database management system is a large-scale software that manipulates and manages databases. It is usually used to establish, use and maintain databases.
  • Database System: A database system usually consists of software, a database, and a database administrator.
  • Database Administrator: A database administrator is responsible for overall management and control of the database system.

2. What are tuples, codes, candidate codes, primary codes, foreign codes, primary attributes, and non-primary attributes?

  • Tuple: Tuple is the basic concept in relational database. A relation is a table. Each row in the table (that is, each record in the database) is a tuple, and each column is an attribute. In a two-dimensional table, Tuples are also called rows.
  • Code: The code is the attribute that uniquely identifies the entity, corresponding to the column in the table.
  • Candidate code: If the value of a certain attribute or attribute group in the relationship can uniquely identify a tuple, but any subset of it can no longer be represented, the attribute group is called a candidate code. Among student entities, "student number" can uniquely distinguish student entities. At the same time, assuming that the attribute combination of "name" and "class" is sufficient to distinguish student entities, then {student number} and {name, class} are both candidates. code.
  • Primary code: The primary code is also called the primary key. The primary code is selected from the candidate codes. There can only be one primary code in an entity set, but there can be multiple candidate codes.
  • Foreign key: Foreign key is also called foreign key. If an attribute in a relationship is the primary key of another relationship, then this attribute is a foreign key.
  • Main attributes: The attributes that appear in the candidate code are called main attributes, such as worker (employee number, ID number, name, gender, department). Obviously, both the job number and the ID number can uniquely indicate this relationship, so they are both candidate codes. The two attributes of job number and ID number are the main attributes. If the primary key is an attribute group, then the attributes in the attribute group are all primary attributes.
  • Non-primary attributes: Attributes that are not included in any candidate code are called non-primary attributes. For example, in the relationship - student (student number, name, age, gender, class), the main code is "student number", then the other "name", "age", "gender" and "class" can be called Is a non-primary attribute.

3. What is the difference between primary key and foreign key?

  • Primary key: The primary key is used to uniquely represent a tuple. It cannot be repeated or empty. A table can only have one primary key.
  • Foreign key: Foreign key is used to establish relationships with other tables. The foreign key is the primary key of another table. The foreign key can be repeated or null. A table can have multiple foreign keys.

4. Database paradigm 

  • First Normal Form (1NF): Attributes (fields in the response table) can no longer be divided, that is, this field can only have one value and cannot be divided into multiple other fields (atomicity). 1NF is the most basic requirement for all relational databases, which means that tables created in relational databases must meet the first normal form.
  • Second normal form (2NF): 2NF is based on 1NF and eliminates part of the functional dependence of non-primary attributes on the code. The second normal form adds a column based on the first normal form. This column is called the primary key. Non-primary attributes depend on the primary key.
  • Third Normal Form (3NF): 3NF is based on 2NF and eliminates the transitive dependence of non-primary attributes on codes. It solves the problems of excessive data redundancy, insertion exceptions, modification exceptions, and deletion exceptions. For example, in the relationship R (student number, name, department name, department chair), student number → department name, department name → department chair, so there is a transfer function dependence of the non-primary attribute department chair on the student number, so the design of the table , does not meet the requirements of 3NF.
  • Summary: 1NF: attributes cannot be further divided. 2NF: Based on 1NF, it eliminates the partial functional dependence of non-primary attributes on the code. 3NF: 3NF is based on 2NF and eliminates the transfer function dependence of non-primary attributes on the code.

5. What is functional dependency? Partial functional dependency? Full functional dependency? Transitive functional dependencies?

  • Functional dependency: If in a table, when the value of attribute (attribute group) X is determined, the value of attribute Y must be determined, then it can be said that the Y function depends on X, written as X → Y.
  • Partial functional dependence: If X → Y, and there is a proper subset X0 of X such that X0 → Y, then Y is said to be partially functionally dependent on X. For example, in the student basic information table R (student number, ID number, name), of course the value of the student number attribute is unique. In the R relationship, (student number, ID number) -> (name), (student number) -> (name), (ID number) -> (name); so the name part of the function depends on (student number, ID number).
  • Full functional dependency: In a relationship, if a certain non-primary attribute data depends on all keywords, it is called full functional dependency. For example, the student basic information table R (student number, class, name) assumes that different classes have the same student number, and the student numbers within the class cannot be the same. In the R relationship, (student number, class) -> (name), but ( Student number) -> (name) does not hold, (class) -> (name) does not hold, so the name is completely functionally dependent on (student number, class).
  • Transitive functional dependency: In the relational pattern R(U), suppose X, Y, and Z are different attribute subsets of U. If X determines Y and Y determines Z, and X does not contain Y, Y If X is uncertain, (X∪Y)∩Z=empty set, then the Z transfer function is said to depend on X. Transitive functional dependencies can lead to data redundancy and anomalies. The Y and Z subsets of transitive function dependencies often belong to the same thing, so they can be combined into one table. For example, in the relationship R (student number, name, department name, department chair), student number → department name, department name → department chair, so there is a transfer function dependence of the non-primary attribute department chair on the student number.

6. What is a stored procedure?

We can think of stored procedures as a collection of SQL statements with some logical control statements added in the middle. Stored procedures are very practical when the business is more complex. For example, many times we may need to write a large series of SQL statements to complete an operation. At this time, we can write a stored procedure, which will also facilitate our next call. Once debugging is completed, the stored procedure can run stably. In addition, using stored procedures is simply faster than SQL statement execution because the stored procedures are precompiled.

However, some companies do not use stored procedures much because stored procedures are difficult to debug and expand, are not portable, and consume database resources.

7. What is the difference between drop, delete and truncate?

  • drop (discard data): drop table (table name), delete the table directly.
  • Truncate (clear data): truncate table (table name), only deletes the data in the table. When inserting data, the self-increasing ID starts from 1. It is used when clearing the data in the table.
  • delete (delete data): delete from table name where column name = value, delete the data of a certain column. If the where clause is not added, the effect is similar to that of truncate table table name.
  • Summary: truncate, delete without where clause, and drop will delete the data in the table, but truncate and delete only delete the data and not the structure (definition) of the table. When the drop statement is executed, the structure of the table will also be deleted, and That is, the corresponding table no longer exists after executing drop.
  • Truncate and drop are DDL (data definition language) statements, and the operations take effect immediately. The original data is not placed in the rollback segment, cannot be rolled back, and the operation does not trigger triggers. The delete statement is a DML (database operation language) statement. This and the operation will be placed in the rollback segment and will only take effect after the transaction is submitted.
  • Execution speed: drop>truncate>delete

8. The difference between DML statements and DDL statements.

  • DML is the abbreviation of Database Manipulation Language (Data Manipulation Language). It refers to the operation of table records in the database, mainly including insert, update, delete and select. It is used daily by developers. The most frequent operations.
  • DDL (Data Definition Language) is the abbreviation of data definition language. Simply put, it is an operation statement for creating, deleting, and modifying objects within the database. The biggest difference between it and the DML language is that DML only operates on the internal data of the table, and does not involve the definition of the table, modification of the structure, nor other objects. DDL statements are more commonly used by database administrators (DBAs) and are rarely used by general developers.

9. What are the usual steps of database design?

  • Requirements analysis: Analyze user needs, including data, functional and performance requirements.
  • Conceptual structure design: mainly using ER model for design, including drawing ER diagram.
  • Logical structure design: realize the conversion from ER model to relational model by converting ER diagram into table.
  • Physical structure design: mainly selecting the appropriate storage structure and access path for the designed database.
  • Database implementation: including programming, testing and commissioning.
  • Database operation and maintenance: system operation and daily maintenance of the database.

10. What are the ACID characteristics of transactions?

  • Atomicity: A transaction is the smallest unit of execution and does not allow splitting. The atomicity of transactions ensures that actions either complete completely or have no effect at all.
  • Consistency: The data remains consistent before and after a transaction is executed, and the results of multiple transactions reading the same data are the same.
  • Isolation: When accessing the database concurrently, a user's transaction will not be interfered by other transactions, and the database is independent between concurrent transactions.
  • Durability: after a transaction is committed. Its changes to the database data are permanent. Even if the database fails it should not have any impact on it.

11. What problems do concurrent transactions bring?

In a typical application, multiple transactions run concurrently and often operate on the same data to complete their respective tasks (multiple users operate on unified data). Although concurrency is necessary, it may cause the following problems:

  • Dirty read: When a transaction is accessing data and modifies the data, but the modification has not yet been mentioned in the database, another transaction also accesses the data and then uses the data. Because this data has not yet been committed, the data read by another transaction is "dirty data", and operations based on dirty data may be incorrect. (Other transactions read the data before modification, and the operation involved is modification-query)
  • Lost to modify: means that when one transaction reads a piece of data, another transaction also accesses the data, then after the first transaction modifies the data, the second transaction also modifies the data. In this way, the modification results in the first transaction are lost, so it is called lost modification. For example: transaction 1 reads data A=20 in a table, transaction 2 also reads A=20, transaction 1 modifies A=A-1, transaction 2 also modifies A=A-1, the final result is A=19, transaction The modification of 1 was lost. (Two transactions are modified at the same time, and the modification result of the latter transaction overwrites the result of the previous modification. The operation involved is modify-modify)
  • Unrepeatable read: refers to reading the same data multiple times within a transaction. Before the transaction ends, another transaction also accesses the data and modifies it, causing the first transaction to read the data twice. Not the same, it is called non-repeatable read. (The gap between multiple reads of the first transaction and the modification of the second transaction resulted in non-repeatable reading of the first transaction. The operations involved are query-modify-query)
  • Phantom read: Phantom read is similar to non-repeatable read. It happens when one transaction (T1) reads a few rows of data, and then another concurrent transaction (T2) inserts some data. In subsequent queries, the first transaction will find some more records that did not exist originally, as if there is an illusion, so it is called phantom reading. (During the reading process of the first transaction, the second transaction is added or deleted, resulting in inconsistent reading results of the first transaction. The operations involved are query-add/deletion-query)

12. What is the difference between non-repeatable reading and phantom reading?

The focus of non-repeatable reading is modification, and the focus of phantom reading is addition or deletion.

13. What are the isolation levels of transactions? What is MySQL's default isolation level?

The SQL standard defines four isolation levels:

  • Read-uncommitted: The lowest isolation level, allowing reading of uncommitted data changes, which may lead to dirty reads, non-repeatable reads, and phantom reads)
  • Read-committed: Allows reading of data that has been committed by concurrent transactions and prevents dirty reads. However, non-repeatable reads and phantom reads may still occur.
  • Repeatable-read: The results of multiple reads of the same field are consistent. Unless the data is modified by its own transaction, dirty reads and non-repeatable reads can be prevented, and phantom reads may still occur.
  • Serializable (serializable): the highest isolation level, fully compliant with the ACID isolation level, all transactions are executed one by one, so that there is no interference between transactions, and dirty reads, non-repeatable reads and phantom reads can be prevented.
  • The default isolation level of mySQL is repeatable read.

14. The difference between optimistic locking and pessimistic locking

  • Pessimistic lock: Always assume the worst case scenario. Every time you go to get the data, you think that others will modify it, so you will lock it every time you get the data. In this way, if others want to get the data, they will block until they get the lock (shared resource It is only used by one thread at a time, other threads are blocked, and the resources are transferred to other threads after use). Many such lock mechanisms are used in traditional relational databases, such as row locks, table locks; read locks, write locks, etc., which are all locked before operations. Exclusive locks such as synchronized and ReentrantLock in Java are the implementation of the pessimistic lock idea.
  • Optimistic locking: always assume the best situation. Every time you go to get the data, you think that others will not modify it, so you will not lock it. However, when updating, you will judge whether others have updated the data during this period. You can Implemented using version number mechanism and CAS algorithm. Optimistic locking uses a multi-read application type, which can improve throughput. Think of the write_condition mechanisms provided in the data, which are actually optimistic locks provided. In Java, the atomic variable class under the java.util.concurrent.atomic package is implemented using CAS, an implementation method of optimistic locking.
  • Two lock usage scenarios: optimistic lock is suitable for multi-read scenarios; pessimistic lock is suitable for multi-write scenarios.

15. Two ways to implement optimistic locking.

  • Version number mechanism: Generally, a data version number version field is added to the data table to indicate the number of times the data has been modified. When the data is modified, the version value is increased by one. When thread A wants to update the data, it reads the data at the same time. The version value will also be read. When submitting the update, it will be updated only if the version value just read is equal to the version value in the current database. Otherwise, the update operation will be retried until the update is successful. (The commit version must be greater than the record version to perform the update)

    Let's take a simple example: Suppose there is a version field in the account information table in the database, the current value is 1; and the current account balance field (balance) is $100. Operator A reads it out at this time (version=1) and deducts (50 (100-$50)) from his account balance. During the operation of operator A, operator B also reads in this user information (version =1), and deducted (20 (100-$20)) from his account balance. Operator A completed the modification work and added one to the data version number (version=2), together with the account balance after deduction (balance=$50), Submit to database update. At this time, because the submitted data version is greater than the current version of the database record, the data is updated, and the database record version is updated to 2. Operator B completed the operation and also increased the version number by one (version=2) and tried to submit to the database. data (balance=$80), but when comparing the database record versions at this time, it was found that the data version number submitted by operator B is 2, and the current version of the database record is also 2, which does not meet the requirement that "the submitted version must be greater than the current version of the record before the update can be performed "The optimistic locking strategy, therefore, the submission of operator B is rejected. In this way, the possibility of operator B overwriting the operation results of operator A with the results of the old data modification based on version=1 is avoided.

  • CAS algorithm: compare and swap (compare and exchange), is a famous lock-free algorithm. Lock-free programming means synchronizing variables between multiple threads without using locks. That is to say, variable synchronization is achieved without threads being blocked, so it is also called non-blocking synchronization. The CAS algorithm involves three operands: ① the memory value V that needs to be read and written, and ② the value A for comparison. , ③The new value B to be written. ④If and only if the value of V is equal to A, CAS atomically updates the value of V with the new value B, otherwise no operation will be performed (compare and replace is an atomic operation), generally a spin operation , that is, constantly retrying.

16. Disadvantages of optimistic locking.

  • ABA question: If a variable V has a value of A when it is first read, and it is checked that it is still A when preparing to assign a value, can we prove that its value has not been modified by other threads? Obviously not, because during this period, its value may be changed to other values, and then changed back to A, then the CAS operation will mistakenly think that it has never been modified. This problem is known as the "ABA" problem of CAS operations. The AtomicStampedReference class after JDK 1.5 provides this capability. The compareAndSet method first checks whether the current reference is equal to the expected reference, and whether the current flag is equal to the expected flag. If all are equal, the reference and the flag are atomically combined. The value is set to the given updated value.
  • The cycle time is expensive: spin CAS (that is, if it is unsuccessful, it will continue to loop until it succeeds). If it is unsuccessful for a long time, it will bring a very large execution overhead to the CPU. If the JVM can support the pause instruction provided by the processor, it will be more efficient. There will be a certain improvement. The pause instruction has two functions. First, it can delay the pipeline execution instruction (de-pipeline) so that the CPU does not consume too much execution resources. The delay time depends on the specific implementation version. In some cases The latency on the processor is zero. Secondly, it can avoid the CPU pipeline flush caused by memory order violation when exiting the loop, thereby improving the execution efficiency of the CPU.
  • Only atomic operations on one shared variable can be guaranteed: CAS is only valid for a single shared variable, and CAS is invalid when the operation involves multiple shared variables. But starting from JDK 1.5, the AtomicReference class is provided to ensure the atomicity between reference objects. You can put multiple variables in one object to perform CAS operations. So we can use locks or use the AtomicReference class to combine multiple shared variables. Merge into a shared variable for operation.

17. What is a database index?

  • Database index: It is a sorted data structure in the database management system to assist in quickly querying and updating data in database tables.
  • Implementation: The implementation of indexes usually uses B-trees and B+ trees, which speeds up data access because the storage engine no longer scans the entire table to obtain the required data; instead, it starts from the root node, which stores the pointers to the child nodes. , the storage engine will quickly find data based on pointers.

The image above shows one way of indexing. On the left is the data table in the database, with two fields col1 and col2, with a total of 15 records; on the right is the B_TREE index with col2 as the index column. Each node contains the index value and the pointer to the corresponding data table address. , so that the corresponding data can be obtained within O(logn) time complexity through B_TREE, which significantly speeds up the retrieval. There is a price to pay for setting an index for a table. First, it increases the stored procedures of the database. Second, it takes more time to insert and modify data (because the index also changes accordingly).

18. What are the advantages and disadvantages of indexing?

  • Advantages: ① Creating an index can greatly improve the performance of the system; ② By creating a unique index, the uniqueness of each row of data in the database table can be guaranteed. ③ It can greatly improve the retrieval speed of the database, which is the main reason for creating indexes; ③ It can speed up the connection between tables, especially in achieving the referential integrity of data; ④ When using grouping and sorting When retrieving data using clauses, it can also significantly reduce the time of grouping and sorting in queries; ⑤ By using indexes, optimization hiders can be used during the query process to improve system performance.
  • Disadvantages: ① Creating and maintaining indexes takes time, and this time increases as the amount of data increases. ②Indexes need to occupy physical space. In addition to the data space occupied by the data table, each index also occupies a certain amount of physical space. If a clustered index is to be established, the space required will be larger. ③When adding, deleting and modifying data in the table, the index must be dynamically maintained, which reduces the data maintenance speed.

19. Generally speaking, on which columns do you need to create an index?

  • You can speed up searches on frequently searched columns.
  • On a column that serves as a primary key, enforces the uniqueness of the column and organizes the structure of the data in the table.
  • On the columns that are often used in connections, these columns are mainly foreign keys, which can speed up the connection.
  • Create indexes on columns that frequently need to be retrieved based on ranges, because the index is sorted and its specified range is contiguous.
  • Create indexes on columns that often need to be sorted, because the index is already sorted, so that queries can take advantage of the index's sorting and speed up sorted query times.
  • Create indexes on columns that are frequently used in where clauses to speed up condition judgment.

20. Generally speaking, indexes should not be created on those columns?

  • Columns that are rarely used or referenced in queries should not be indexed.
  • Indexes should not be added to columns with few data values.
  • Indexes should not be added to columns defining text, image, and bit data types, because the data volume of these columns is either quite large or the value is very small.
  • When modification performance is far less than retrieval performance, indexes should not be created. Modification performance and retrieval performance are contradictory. When adding indexes, retrieval performance will be improved, but modification performance will be reduced. When reducing indexes, modification performance will increase and retrieval performance will decrease. Therefore, when modification performance is much greater than retrieval performance, indexes should not be created.

21. Create three types of indexes in the database designer: unique index, primary key index and clustered index.

  • Unique index: An index in which any two rows have the same index value is not allowed. When duplicate key values ​​exist in existing data, most databases do not allow newly created unique indexes to be saved with the table. The database may also prevent adding new data that would create duplicate key values ​​in the table. For example, if a unique index is created on the employee's last name (lname) in the employee table, no two employees can have the same last name.
  • Primary key index: There is often a column or combination of columns in a database table, the value of which uniquely identifies each row in the table. This column is called the primary key of the table. Defining a primary key for a table in a database diagram automatically creates a primary key index, which is a specific type of unique index. The index requires each value in the primary key to be unique. It also allows fast access to data when using a primary key index in a query.
  • Clustered index: In a clustered index, the physical order of the rows in the table is the same as the logical (index) order of the key values. A table can only contain one clustered index. If an index is not a clustered index, the physical order of the rows in the table does not match the logical order of the key values. Clustered indexes generally provide faster data access than nonclustered indexes.

22. Indexes of the two storage engines MyISAM and InnoDB

  • MyISAM index implementation: Use B+ tree as the index structure, and the address of the data record stored in the data field of the leaf node. The following figure is the schematic diagram of the MyISAM index. Assume that the table has three columns. Assume that we use COL1 as the primary key. The following figure is a diagram of the primary index (Primary key) of a MyISAM table. It can be seen that the MyISAM index file only saves data records . address :

  • Auxiliary index: In MyISAM, there is no structural difference between the primary index and the secondary index (Secondary key), except that the primary index requires the key to be unique, while the key of the secondary index can be repeated. If we create an auxiliary index on COL2, and the structure of this index is as shown in the figure below, it is also a B+ tree, and the data field stores the address of the data record. Therefore, the index retrieval algorithm in MyISAM is to first search the index according to the B+Tree search algorithm. If the specified key exists, the value of its data field is taken out, and then the value of the data field is used as the address to read the corresponding data record:
  • The indexing method of MyISAM is also called "non-clustered index" . The reason why it is called so is to distinguish it from the clustered index of InnoDB.
  • InnoDB index: B+ tree is also used as the index structure. The first major difference from MyISAM is that the InnoDB data file itself is an index file. As we know from the above, the MyISAM index file and data file are separated, and the index file only saves the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+Tree, and the leaf node data field of this tree saves complete data records. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index. This type of index is called a clustered index because the InnoDB data files themselves are clustered by primary key.
  • InnoDB requires that the table must have a primary key (MyISAM may not have one). If it is not specified explicitly, the MySQL system will automatically select a column that can uniquely identify the data record as the primary key. If such a column does not exist, MySQL will automatically generate one for the InnoDB table. The hidden field is used as the primary key and its type is long. At the same time, please try to use auto-increment fields as the primary key of the table on InnoDB. Because the InnoDB data file itself is a B+Tree, non-monotonic primary keys will cause the data file to be frequently split and adjusted to maintain the characteristics of the B+Tree when inserting new records, which is very inefficient. Using an auto-increment field as the primary key will is a good choice. If the table uses an auto-incrementing primary key, then each time a new record is inserted, the records will be sequentially added to the subsequent position of the current index node. When a page is full, a new page will be automatically opened. This results in a compact index structure that is filled approximately sequentially. Since there is no need to move existing data every time it is inserted, it is very efficient and does not add a lot of overhead to maintaining the index. As shown below:
  • The second difference from the MyISAM index is that the data field of InnoDB's auxiliary index stores the value of the corresponding record's primary key instead of the address . In other words, all secondary indexes in InnoDB reference the primary key as the data field. For example, the following figure shows an auxiliary index defined on Col3:

23. The difference between B-tree and B+ tree

  • B-tree is a multi-way balanced search tree . Each node has at most m-1 keywords (key-value pairs that can be stored); the root node can have at least 1 keyword, and non-root nodes can have at least m/2 keywords. The keywords of each node are arranged in ascending order. All keywords in the left subtree of each keyword are smaller than it, while the left and right keywords in the right subtree are greater than it. All leaf nodes are located on the same level, or the length from the root node to each leaf node is the same. Each node stores indexes and data. That is, the corresponding key and value. Therefore, the keyword range of the root node is: 1<=k<=m-1, and the keyword number range of the non-root node is m/2<=k<=m-1. In addition, when describing a B-tree, you need to specify its order. The order indicates the maximum number of child nodes a node has. Generally, the letter m is used to represent the order.
  • The same points between B+ tree and B tree: ① The root node has at least one element; ② The range of non-root node elements: m/2<=k<=m-1.
  • The difference between B+ tree and B-tree: ① B+ tree has two types of nodes: internal nodes (also called index nodes) and leaf nodes. Internal nodes are non-leaf nodes. Internal nodes do not store data, only indexes, and data are stored in leaf nodes; ② The keys of internal nodes are arranged in order from small to large. For a key of an internal node, all the keys in the left tree The keys are all smaller than it, the keys in the right subtree are all greater than or equal to it, and the records in the leaf nodes are also arranged according to the size of the key. ③Each leaf node stores pointers to adjacent leaf nodes, and the leaf nodes themselves are linked in order from small to large according to the size of the keywords. ④The parent node stores the index of the first element of the right child.
  • The advantages of B+ tree over B tree: ① The non-leaf nodes of B+ tree only store keys and occupy very little space, so the nodes at each layer can index a wider range of data. In other words, more data can be searched for each IO operation. It is more suitable as the underlying data structure of the database MySQL. ②All queries must find leaf nodes, and the query performance is stable. Data can be found in each node of the B-tree, so it is unstable. ③All leaf nodes form an ordered linked list, which is easier to find. ④Leaf nodes are connected in pairs, which conforms to the read-ahead characteristics of the disk. For example, the leaf node stores 50 and 55, and it has a pointer pointing to the leaf nodes 60 and 62. Then when we read the data corresponding to 50 and 55 from the disk, due to the read-ahead feature of the disk, 60 and 62 will be The corresponding data is read out. At this time, it is a sequential read instead of a disk seek, which speeds up the speed. ⑤ Supports range queries, and partial range queries are very efficient. The range that each node can index is larger and more accurate, which also means that the amount of information in a single disk IO of B+ tree is greater than that of B tree, and the I/O efficiency is higher. ⑥The data is stored at the leaf node level, and there are pointers pointing to other leaf nodes. In this way, the range query only needs to traverse the leaf node level, without traversing the entire tree. ⑦Due to the gap between disk access speed and memory, in order to improve efficiency, disk I/O must be minimized. Disks are often not read strictly on demand, but are pre-read every time. After the disk reads the required data, Data of a certain length will be read backwards sequentially and put into memory. The theoretical basis for this is the well-known locality principle in computer science: when a piece of data is used, nearby data is usually used immediately, and the data required during program running is usually concentrated.

24. The difference between B+ tree and hash index

  • Hash index uses a certain hash algorithm to convert the key value into a new hash value. When retrieving, there is no need to search step by step from the root node to the leaf node like a B+ tree. It only needs one hash algorithm to locate it immediately. Getting to the corresponding location is very fast.
  • If it is an equivalent query, then the hash index obviously has an absolute advantage, because the corresponding key value can be found only through one algorithm; of course, this premise is that the key values ​​are all unique. If the key value is not unique, you need to find the location of the key first, and then scan backwards according to the linked list until the corresponding data is found.
  • If it is a range query retrieval, the hash index is useless at this time, because the originally ordered key values ​​may become discontinuous after the hash algorithm, and there is no way to use the index to complete the range. Query retrieval.
  • Hash indexes cannot use indexes to complete sorting, as well as partial fuzzy queries like 'xxx%' (this kind of partially fuzzy query is actually a range query in nature)
  • Hash indexes also do not support the leftmost matching rule for multi-column joint indexes.
  • The keyword retrieval efficiency of the B+ tree index is relatively average and does not fluctuate as much as the B-tree. When there are a large number of duplicate key values, the efficiency of the hash index is also extremely low because of the so-called hash collision problem.

25. What are the differences between MyISAM and InnoDB?

  • InnoDB supports transactions , but MyISAM does not. This is one of the reasons why MySQL chooses InnoDB as the default storage engine.
  • InnoDB supports foreign keys , but MyISAM does not. If a table contains foreign keys and the storage engine is InnoDB, converting it to MyISAM will fail.
  • InnoDB uses clustered indexes and MyISAM uses non-clustered indexes . The files of the clustered index are stored on the leaf nodes of the primary key index, so InnoDB must have a primary key, and indexing through the primary key is very efficient. However, the auxiliary index requires two queries, first to query the primary key, and then to query the data through the primary key. Therefore, the primary key should not be too large, because if the primary key is too large, other indexes will also be large. In the case of a non-clustered index, the data files are separated, and the index saves the pointer to the data file. Primary key indexes and secondary indexes are independent.
  • InnoDB does not save the specific number of rows in the table . When executing select count(*) from table, a full table scan is required. However, MyISAM uses a variable to save the number of rows in the entire table. When executing the above statement, you only need to read the variable, which is very fast.
  • InnoDB's minimum lock granularity is row-level locking , and MyISAM 's minimum lock granularity is table-level locking . An update statement will lock the entire table, causing other queries and updates to be blocked, so concurrent access is greatly restricted.

26. How to choose a storage engine?

  • Whether to support transactions . If so, please choose InnoDB. If not, consider MyISAM.
  • If most of the tables are only read queries , you can consider MyISAM . If both reads and writes are quite frequent, use InnoDB .
  • After the system crashes, MyISAM will be more difficult to recover. Whether you can accept it or not, choose InnoDB. ,
  • Starting with MySQL version 5.5, InnoDB has become the default engine for MySQL.

27. How is MySQL master-slave replication done?

  • Master-slave replication mainly involves three threads: binlog thread, I/O thread and SQL thread. This process relies on the close cooperation of these three processes.
  • Binlog thread: Responsible for writing data changes on the main server to the binary log (Binary log).
  • I/O thread: Responsible for reading binary logs from the master server and writing to the relay log (Relay log) of the slave server.
  • SQL thread: Responsible for reading the relay log, parsing out the data changes that have been performed by the master server and replaying them (Replay) on the slave server.
  • Master Server - Binary Log (Master Server) - Relay Log (Slave Server) - Slave Server (Replay)

28. How to optimize large tables?

When the number of records in a single MySQL table is too large, the CRUD performance of the database will significantly decrease. Some common optimization measures are as follows:

  • Limit the scope of data: Be sure to prohibit query statements without any conditions that limit the scope of data. For example: when users query order history, we can control it within a month.
  • Read/write separation: A classic database splitting scheme, where the master database is responsible for writing and the slave database is responsible for reading.
  • Vertical partitioning: Split according to the correlation of data tables in the database. For example, the user table contains both the user's login information and the user's basic information. The user table can be split into two separate tables, or even put into a separate database for sub-databases. Simply put, vertical splitting refers to the splitting of data table columns, splitting a table with many columns into multiple tables. Advantages of vertical splitting: It can make column data smaller, reduce the number of blocks read during query, and reduce the number of I/Os. In addition, vertical partitioning can simplify the structure of the table and make it easier to maintain. Disadvantages of vertical splitting: The primary key will be redundant, redundant columns need to be managed, and it will cause Join operations, which can be solved by performing Join at the application layer. In addition, vertical partitioning can make matters more complex.
  • Horizontal partitioning: Keep the data table structure unchanged and store data fragments through a certain strategy. In this way, each piece of data is dispersed into different tables or libraries, achieving the purpose of distribution. Horizontal splitting can support very large amounts of data. Horizontal split refers to the splitting of data table rows. When the number of rows in the table exceeds 2 million, it will slow down. At this time, the data of one table can be split into multiple tables for storage. For example: We can split the user information table into multiple user information tables, so as to avoid the performance impact of excessive data volume in a single table. Split horizon can support very large data volumes. One thing to note is that table splitting only solves the problem of too large data in a single table. However, since the data in the table is still on the same machine, it actually has no meaning in improving the concurrency of MySQL. Therefore, horizontal splitting is best to split the database. . Horizontal splitting can support very large amounts of data storage and requires little application-side modification, but it is difficult to solve sharding transactions, has poor cross-node Join performance, and has complex logic.

29. What is the difference between balanced binary search tree and red-black tree?

AVL and RBT are both optimizations of binary search trees, and their performance is much better than binary search trees.

  • Structural comparison: AVL's structure is highly balanced, RBT's structure is basically balanced, and the balance degree is AVL>RBT.
  • Search comparison: AVL search time complexity is the best, and the worst case is O(logN); RBT search time complexity is best O(logN), and the worst case is slightly worse than AVL.
  • Insertion/deletion comparison: ① Insertion and deletion of AVL can easily cause imbalance in the tree structure, while RBT has lower balance requirements. Therefore, in the case of large amounts of data insertion, RBT needs to re-reach balance through rotation and color-changing operations less frequently than AVL.
  • Balanced processing: RBT has one more color-changing operation than AVL, and the time complexity of color-changing is on the order of O(logN). But due to the simplicity of the operation, this discoloration is still very rapid in practice.
  • When inserting a node causes imbalance in the tree, both AVL and RBT require at most 2 rotation operations. However, after deleting a node to cause imbalance, AVL requires at most logN rotation operations, while RBT only requires at most 3 rotation operations. Therefore, the cost of inserting a node is almost the same between the two, but the cost of deleting a node RBT is lower.
  • The insertion and deletion costs of AVL and RBT are mainly consumed in finding the node to be operated, so the time complexity is basically proportional to O(logN).
  • A large amount of data has been proven in practice that the overall statistical performance of RBT is better than that of balanced binary trees.

30. The difference between Redis and Memcached

  • Data type: Redis supports richer data types (supports more complex scenarios): Redis not only supports simple k/v type data, but also provides storage of data structures such as list, set, zset, and hash . memcached only supports simple string types.
  • Data persistence: Redis supports data persistence, which can keep the data in the memory on the disk and load it again for use when restarting. Redis not only supports simple k/v type data, but also provides list and set , zset, hash and other data structures storage. memcached only supports simple string types.
  • Cluster mode: Redis currently supports cluster mode natively. Memcached does not have a native cluster mode and needs to rely on the client to write data into shards in the cluster.
  • Threads: Redis uses a single-threaded multi-channel I/O reuse model. Memcached is a multi-threaded, non-blocking I/O multiplexing network model.

31. Analysis of common data structures and usage scenarios of Redis

  • String: The String data structure is a simple Key-value type. In fact, value can be not only String, but also a number. It can be used as a regular key-value cache, or it can be used for counting, such as recording the number of Weibo and fans.
  • Hash: Hash is a mapping table of string type fields and values. Hash is particularly suitable for storing objects. During subsequent operations, you can directly modify only the value of a certain field in this object. For example, we can use hash data structures to store user information, product information, etc.
  • List: List is a linked list. There are many application scenarios for list, and it is also one of the most important data structures of Redis. For example, functions such as Weibo's follow list, fan list, and message list can all be implemented using Redis' list structure. The bottom layer of List is a doubly linked list, which can support reverse lookup and whitening capabilities, making it more convenient to operate, but it brings additional memory overhead.
  • Set: The external function provided by set is similar to the function of list. The special thing is that set can automatically eliminate duplicates. When you need to store a list of data and do not want duplicate data, set is a good choice, and set provides an important interface for determining whether a member is in a set collection, which list cannot provide. Intersection, union, and difference operations can be easily implemented based on set. In the Weibo application, all the followers of a user can be stored in a collection, and all the fans can be stored in a collection. Redis can very conveniently implement functions such as common following, common fans, and common preferences. This process is also the process of finding intersection.
  • Sorted Set: Compared with set, sorted set adds a weight parameter score, so that the elements in the set can be arranged in order according to score. For example, in the live broadcast system, real-time ranking information includes the list of online users in the live broadcast room, various gift rankings, barrage messages (can be understood as message rankings by message dimension) and other information, which is suitable for storage using the Sorted Set structure in Redis. .

32. What is AOF rewriting?

AOF rewriting can generate a new AOF file. This new AOF file has the same database status as the original AOF file, but is smaller in size. When executing the BGREWRITEAOF command, the Redis server will maintain an AOF rewrite buffer, which will record all write commands executed by the server during the creation of a new AOF file by the child process. After the child process completes the work of creating a new AOF file, the server will append all the contents in the rewrite buffer to the end of the new AOF file, so that the database states saved in the old and new AOF files are consistent. Finally, the server replaces the old AOF file with the new AOF file to complete the AOF file rewriting operation.

33. How to solve cache avalanche and cache penetration?

  • Cache avalanche: refers to the cache failure in a large area at the same time, so that subsequent requests will fall on the database, causing the database to withstand a large number of requests in a short period of time and collapse.
  • Solution: ① Beforehand: Try to ensure the high availability of the entire Redis cluster, and make up for machine downtime as soon as possible. Choose an appropriate memory elimination strategy. ②In progress: local cache + Hystrix current limiting and downgrading to prevent MySQL from crashing. ③Use the Redis persistence mechanism to restore the saved data to the cache as soon as possible.
  • Cache penetration: Hackers deliberately request data that does not exist in the cache, causing all requests to fall on the database, causing the database to withstand a large number of requests in a short period of time and collapse.
  • Solution: There are many ways to effectively solve the cache penetration problem. The most common one is to use a Bloom filter to hash all possible data into a large enough bitmap. A data that definitely does not exist will be It is intercepted by this bitmap, thereby avoiding query pressure on the underlying storage system. Another simpler and more crude method is that if the data returned by the query is empty (whether the data does not exist or the system fails), we will still cache the empty result, but its expiration time will be very short, no more than five minutes at most. .

34. The difference between database and data warehouse

  • In a data warehouse, multiple databases are organized in a way. The data warehouse emphasizes the speed of query analysis and optimizes read operations. The main purpose is to quickly query a large amount of data; the data warehouse writes new data regularly, but does not overwrite the original data. Instead of adding timestamp tags to the data, data warehouses generally use column storage. The characteristics of data warehouse are subject-oriented, integrated, relatively stable, reflecting historical changes, and storing historical data. The two basic elements of a data warehouse are dimension tables and fact tables.
  • The database emphasizes paradigm and reduces redundancy as much as possible. The database uses row storage. The database is transaction-oriented and stores online transaction data.

35. SQL data types

  • String: char, varchar, text
  • Binary string: binary, varbinary
  • Boolean type: boolean
  • Numeric types: integer, smallint, bigint, decimal, numeric, float, real, double
  • Time type: date, time, timestamp, interval

36. What are the differences between left join, right join, inner join and full join?

  • Inner join (inner join), when two tables are connected for query, only the completely matching result sets in the two tables are retained.
  • left join: when performing a join query between two tables, all rows in the left table will be returned, even if there are no matching records in the right table.
  • right join: when performing a join query between two tables, all rows in the right table will be returned, even if there are no matching records in the left table.
  • When full join performs a connection query between two tables, it returns all unmatched rows in the left table and right table.

37. What is the difference between having and where?

  • The essential difference is that where filters are fields that already exist in the database table, while having filters fields that are previously filtered.
  • Scenarios where both where and having can be used:
    select goods_price,goods_name from sw_goods where goods_price>100
    
    select goods_price,goods_name from sw_goods having goods_price>100
    

    The reason is that goods_price also appears in the query field as a condition.

  • Only where can be used, not having scenarios:

    select goods_name,goods_number from sw_goods where goods_price>100
    
    select goods_name,goods_number from sw_goods having goods_price>100(X)
    

    The reason is that goods_price does not appear in the query field as a filter condition, so an error will be reported. The principle of having is to select first and then filter, while where is to filter first and then select.

  • You can only use having but not where:

    select goods_category_id,avg(good_price) as ag from sw_goods group by goods_category having ag>1000
    
    select  goods_category_id,avg(goods_price) as ag from sw_goods where ag>1000 group by goods_category(X)报错,这个表里没有这个ag这个字段。
    

    Aggregate functions are generally not used in where clauses.

38. What is the difference between not in and not exists?

  • If the query statement uses not in, then a full table scan will be performed on both the inner and outer tables, and no index will be used.
  • The not exists subquery can still use the index on the table. No matter how big the table is, not exists is bigger than not in.

39. Set row number in mysql?

SET @row_number = 0; 
SELECT (@row_number:=@row_number + 1) AS num FROM table

40. What is the difference between null and '' in SQL?

  • null means empty, expressed by is null
  • '' represents an empty string, use ='' to judge

41. What is the difference between views and tables in mysql?

  • Essence: The table is the content and the view is the window. A view is a compiled SQL statement, which is a visual table based on the result set of the SQL statement, but the table is not.
  • The table is the real table of the global schema, and the real table is the virtual table of the local schema.
  • Views do not have physical records, while tables do.
  • Tables occupy physical space, views do not. The existence of the logical concept of view knowledge, the table can modify it in time, but the view can only be modified with the created statement.
  • The creation and deletion (drop) of a view only affects the view itself and does not affect the corresponding basic table.
  • A view is a way to view a data table. It can query data composed of certain fields in the data table. It is just a collection of SQL statements. From a security perspective, views prevent users from touching the data table, so users do not know the table structure. 

42. What is the relationship between views and tables?

A view is a table built on a basic table. Its structure (i.e., defined columns) and content (i.e., all records) come from the basic table, and it exists based on the existence of the basic table. A view can correspond to one basic table or multiple basic tables. Views are abstractions of basic tables and new relationships established in a logical sense.

43. How to write SQL to find the median, mean and mode (other than using count)?

  • Median number:

    Option 1 (not considering even numbers):

    set @m = (select count(*)/2 from table)
    
    select column from table order by column limit @m, 1

Option 2 (considering even numbers, the median is the average of the middle two numbers):

set @index = -1

select avg(table.column)

from(select @index:=@index+1 as index, column

from table order by column) as t

where t.index in (floor(@index/2),ceiling(@index/2))
  • Average:
select avg(distinct column) from table
  • mode
select column, count(*) from table group by column order by column desc limit 1

4. Machine Learning

1. How to avoid overfitting of decision trees?

  • Limit tree depth, prune, limit number of leaf nodes
  • Regularization, adding data, data augmentation (adding noisy data)
  • bagging (subsample, subfeature, low-dimensional space projection), early stopping

2. Please explain some reasons why random forests are more stable than general decision trees?

  • Bagging method, multiple tree voting improves generalization ability
  • Randomness (parameters, samples, features, spatial mapping) is introduced in bagging to avoid overfitting of a single tree and improve the overall generalization ability.

3. What are the advantages and disadvantages of SVM?

  • Advantages: ① Can be applied to nonlinear separable situations ② The final classification is determined by the support vector, and the complexity depends on the number of support vectors rather than the dimension of the sample space, avoiding the disaster of dimensionality ③ It is robust because only a small amount is used Support vectors capture key samples and eliminate redundant samples. ④ Good performance under high-dimensional and low-sample conditions, such as text classification.
  • Disadvantages: ① The model training complexity is high ② It is difficult to adapt to multi-classification problems ③ There is no good methodology for kernel function selection.

4. What are the common kernel functions of SVM?

  • Linear Kernel: The linear kernel function is the default kernel function of SVM. It is better used when it is linearly separable. Its form is K(x, y) = x * y, where x and y are the inner products of the input feature vectors.
  • Polynomial Kernel: Polynomial kernel function can handle some nonlinear problems, and it introduces the concept of polynomial. Its form is K(x, y) = (ax * y + c)^d, where a, c and d are user-defined parameters, and d represents the degree of the polynomial.
  • Radial Basis Function Kernel (RBF Kernel) or Gaussian kernel function: RBF kernel function is one of the most commonly used nonlinear kernel functions in SVM, also called Gaussian kernel function. Its form is K(x, y) = exp(-γ * ||x - y||^2), where γ is a user-defined parameter, ||x - y|| represents the feature vector between x and y Euclidean distance.
  • Bryman kernel function (Bhattacharyya Kernel): Bryman kernel function is usually used to deal with the classification problem of probability distribution. Its form is K(x, y) = exp(-D(x, y)), where D(x, y) is the Breman distance between x and y.
  • Sigmoid kernel function (Sigmoid Kernel): The form of the Sigmoid kernel function is K(x, y) = tanh(ax * y + c), where a and c are user-defined parameters, which can be used for some nonlinear classification problems.
  • Custom Kernel: In addition to the common kernel functions mentioned above, SVM also supports the use of custom kernel functions. As long as the conditions for a positive definite kernel function are met, it can be used in SVM.

5. Principle of K-means

  • Initialize k points
  • Classify points into k categories according to distance
  • Update the class centers of k classes
  • Repeat (2) (3) until convergence or the number of iterations is reached

6. Principle and improvement of K-Means algorithm, what to do when encountering outliers? What are the metrics for evaluating algorithms?

  • Kmeans principle: Given a K value and K initial cluster center points, each point (i.e., data record) is assigned to the cluster represented by the nearest cluster center point, and all points are assigned After completion, recalculate the center point of the cluster based on all points in the cluster (take the average), and then iteratively proceed with the steps of assigning points and updating the cluster center point until the change in the cluster center point is very small. , or reaches the specified number of iterations.
  • Improve:

    a. kmeans++: The initial random points are selected as far away as possible to avoid falling into local solutions. The method is that when n+1 center points are selected, the probability of selecting the first n points is greater.

    b. mini batch kmeans: Only use one subset at a time to re-enter the class and find the class center (increase training speed).

    c. ISODATA: Use this method when it is difficult to determine k. The idea is that when the samples under a class are small, eliminate them; when the number of samples under a class is large, split them.

    d. Kernel kmeans: kmeans uses Euclidean distance to calculate similarity, and can also use kernel to map to high-dimensional space and then cluster.

  • Outliers encountered:

    If possible, use density clustering or some soft clustering methods to cluster first and eliminate outliers. However, the original purpose of using kmeans is to be fast, but this is somewhat the opposite.

    b. Local outlier factor LOF: If the density of point p is significantly smaller than the density of its neighbor points, then point p may be an outlier
    (reference: https://blog.csdn.net/wangyibo0201/article/details/51705966)

    c. Multivariate Gaussian distribution outlier detection

    d. Use PCA or automatic encoding machine for outlier detection: use the dimension after dimensionality reduction as the new feature space, and the dimensionality reduction result can be considered to eliminate the influence of outliers (because the process is to retain the projection direction that maximizes the variance after projection) )

    e. Isolation forest: The basic idea is to build a tree model. The lower the tree depth of a node, the easier it is to divide it from the sample space, so it is more likely to be an outlier. It is an unsupervised method that randomly selects n sumsampes and randomly selects one feature and one value.
    (Reference: https://blog.csdn.net/u013709270/article/details/73436588)

    f. winsorize: For simplicity, you can intercept the upper and lower dimensions of a single dimension.

  • Metrics for evaluating clustering algorithms:

    a. External method (based on labeling): Jaccard coefficient, purity

    b. Internal method (unlabeled): inner sum of squares WSS and outer sum of squares BSS

    c. In addition, the time and space complexity of the algorithm, clustering stability, etc. must also be considered.

7. Principal component analysis PCA method

Principal component analysis is a dimensionality reduction method. The idea is to transform the sample from the original feature space to a new feature space , and the projection variance of the sample on the coordinate axis of the new feature space is as large as possible , so that the most important information of the sample can be covered.

  • Feature normalization
  • Find the covariance matrix A of sample features
  • Find the eigenvalues ​​and eigenvectors of A, that is, AX=λX
  • Arrange the eigenvalues ​​from small to large and select topK. The corresponding feature vector is the new coordinate axis.

8. What are the data preprocessing procedures?

  • Missing value processing: deletion, insertion
  • Outlier handling
  • Feature transformation: sinized representation of temporal features
  • Standardization: maximum and minimum normalization, z normalization, etc.
  • Normalization: For text or scoring features, there may be overall differences between different samples. For example, text a has a total of 20 words, text b has 30,000 words, and the frequency of each dimension in text b is likely to be much higher than a text
  • Discretization: one-hot, binning, etc.

9. In data cleaning, what is the method to deal with missing values?

Due to survey, coding and entry errors, there may be some invalid values ​​and missing values ​​in the data, which need to be dealt with appropriately. Commonly used processing methods include: estimation, whole case deletion, variable deletion and pairwise deletion.

(1) Estimation . The simplest way is to replace invalid and missing values ​​with the sample mean, median or mode of a variable. This method is simple, but it does not fully consider the existing information in the data, and the error may be large. Another approach is to estimate through correlation analysis or logical inference between variables based on the respondents' answers to other questions. For example, the ownership of a certain product may be related to household income, and the likelihood of owning the product can be inferred based on the respondent's household income.

(2 ) Casewise deletion is to eliminate samples containing missing values. Since many questionnaires may have missing values, the result of this approach may lead to a significant reduction in the effective sample size and the inability to fully utilize the data that has been collected. Therefore, it is only suitable for situations where key variables are missing, or the proportion of samples containing invalid or missing values ​​is very small.

(3) Variable deletion . If a variable has many invalid and missing values ​​and the variable is not particularly important to the problem being studied, you can consider deleting the variable. This approach reduces the number of variables available for analysis without changing the sample size.

(4) Pairwise deletion uses a special code (usually 9, 99, 999, etc.) to represent invalid values ​​and missing values, while retaining all variables and samples in the data set. However, only samples with complete answers are used in specific calculations. Therefore, the effective sample size will be different in different analyzes due to different variables involved. This is a conservative approach that retains the maximum amount of information available in the data set.

10. What is the difference between cosine distance and Euclidean distance in finding similarity?

  • Euclidean distance can reflect the absolute difference of individual numerical characteristics , so it is more used in analyzes that need to reflect differences in the numerical size of dimensions , such as using user behavior indicators to analyze the similarity or difference of user values.
  • Cosine distance is more about distinguishing differences in direction , and is not sensitive to absolute values. It is more used to use users' content ratings to distinguish similarities and differences in interests, and at the same time corrects possible metric inconsistencies between users. Uniform problem (because cosine distance is not sensitive to absolute values).
  • In general, Euclidean distance reflects the absolute difference in value, while cosine distance reflects the relative difference in direction.

    (1) For example, when counting user viewing behaviors of two dramas, the viewing vector of user A is (0,1) and that of user B is (1,0); at this time, the cosine distance between the two is very large, while the Euclidean distance is very small. Small; when we analyze the preferences of two users for different videos and pay more attention to the relative difference, it is obvious that the cosine distance should be used.
    (2) When we analyze user activity and use the number of logins (unit: times) and average viewing time (single: minutes) as features, the cosine distance will think that the two users (1,10) and (10,100) are very far apart. close; but obviously there is a huge difference in the activity of these two users. At this time, we are more concerned about the absolute difference in values ​​and should use Euclidean distance.

11. GBDT (Gradient Boosting Tree)?

  • First, we introduce Adaboost Tree, which is a boosting tree integration method. The basic idea is to train multiple trees in sequence, and each tree weights the incorrectly classified samples during training. The weighting of samples in the tree model is actually the weighting of the sampling probability of the sample. When sampling with replacement, the wrong sample is more likely to be drawn.
  • GBDT is an improvement of Adaboost Tree. Each tree is a CART (Classification Regression Tree). The tree outputs a numerical value at the leaf node. The classification error is the true value minus the output value of the leaf node to obtain the residual. What GBDT has to do is to use the gradient descent method to reduce the classification error value.
  • In the iteration of GBDT, assuming that the strong learner we obtained in the previous iteration is ft−1(x), the loss function is L(y,ft−1(x)), our goal in this iteration is to find a CART The weak learner ht(x) of the regression tree model minimizes the loss of this round L(y,ft(x)=L(y,ft−1(x)+ht(x)). In other words, this round Find the decision tree through rounds of iterations, and make the loss of samples as small as possible.
  • The idea of ​​GBDT can be explained with a popular example. If a person is 30 years old, we first use 20 years old to fit and find that the loss is 10 years old. Then we use 6 years old to fit the remaining loss and find that there is still a gap. 4 years old, in the third round we used 3 years old to fit the remaining gap, and the gap was only one year old. If our iteration rounds are not over yet, we can continue to iterate. With each iteration, the age error of the fitting will decrease. After obtaining multiple trees, weighted voting is performed based on the classification error of each tree.

5. Business scenario issues

1. What is the general idea when dealing with requirements?

  • Clarify the needs and what is the purpose of the demand side
  • disassembly task
  • Develop executable plans
  • advance
  • acceptance

2. How to analyze the problem of the decline in retention rate the next day?

  • Two-layer model : Segment it from the perspectives of user portraits, channels, products, behavioral links, etc. to clarify where the next-day retention rate has dropped.
  • Indicator breakdown: Next-day retention rate = Σ Next-day retention number / Today’s number of customers
  • Reason analysis :

     (1) Internal:

         a. Operational activities

         b. Product changes

         c. Technical failure

         d. Design loopholes (such as creating a design that can be used to tease wool)

    (2) External:

          a. Competing products

         b. User preferences

         c. Holidays

         d. Social events (such as generating public opinion)
     

3. Given an unordered array, how can we sample it reasonably?

  • Unordered array is relative to ordered array. Unordered array is not equal to random . What we have to do is to shuffle the unordered array to get a random arrangement.
  • For an unordered array, n elements can produce n! kind of sorting. If the shuffling algorithm can produce n! different results, and the probabilities of these results are equal, then this shuffling algorithm is correct. Method: for i in range(len(n)): swap(arr[i], arr[random(i,n)]), this code randomly determines the first value of the array, and then recursively The same process goes through the array, which can yield n! Moderately possible sorting situation.

4. How to analyze if the user retention rate drops by 5% the next day?

  • First, use the "two-layer model" analysis: segment users, including new and old, channels, activities, portraits and other dimensions, and then calculate the next-day retention rate of different users under each dimension. Use this method to locate the user groups that are causing the retention rate to decline.
  • Regarding the problem of declining retention of the target group the next day, the specific situation must be analyzed in detail. Specific analysis can use "internal-external" factors to consider. Internal factors are divided into customer acquisition (low quality of channels, acquisition of non-target users through activities), meeting needs (new function changes cause dissatisfaction of certain types of users), and activation methods (sign-in and other promotions). The means of living have not achieved the goal, the natural use cycle of the product is low, resulting in a large number of users acquired last time that do not need to be used in the short term, etc.); external factors are analyzed using PEST, politics (policy influence), economy (in the short term, mainly the competitive environment, such as activities against competitors), society (preference changes such as public opinion pressure, changes in user lifestyles, changes in consumer psychology, changes in values, etc.), technology (emergence of innovative solutions, changes in distribution channels, etc.) (continued tomorrow...

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/132756431