Java interview must-test points--Lecture 05: Data structures and algorithms

The topic of this class is data structure and algorithm. There is a popular saying in the industry: program = data structure + algorithm. Although it is a bit exaggerated, it is enough to illustrate the importance of data structures and algorithms. This lesson focuses on four knowledge points:

  1. From search trees to B+ trees, explain the data structures related to trees;

  2. Questions related to string matching;

  3. TopK questions frequently examined in algorithm interviews;

  4. Several common problem-solving methods for algorithmic problems.

Data structure knowledge points

First, let’s look at the knowledge points of data structure, as shown in the figure below.

  1. Queues and stacks are frequently used data structures, and you need to understand their characteristics. The queue is first in, first out, and the stack is last in, first out.

  2. Tables include many types, including arrays occupying continuous space, one-way and two-way linked lists linked by pointers, circular linked lists connected end to end, and hash tables, also called hash tables.

  3. Graphs are often used in specific fields. For example, they are often used in routing algorithms. Graphs are divided into directed graphs, undirected graphs and weighted graphs. This part requires mastering the depth traversal and breadth traversal algorithms of graphs and understanding the shortest path. algorithm.

  4. The content of the tree. The tree is generally used as an auxiliary structure for search and sorting. The remaining two parts are related to the tree, one is a binary tree and the other is a multi-fork tree.

    1. Multi-trees include the B-tree family, including B-tree, B+ tree, and B* tree, which are more suitable for file retrieval; the other is the dictionary tree, which is suitable for multi-mode matching of strings.
    2. Binary trees include balanced binary trees, red-black trees, Huffman trees, and heaps, which are suitable for data search and sorting. In this part, you need to understand the implementation of the construction, insertion, and deletion operations of the binary tree, and you need to master the pre-order, in-order, and post-order traversal of the binary tree.
Algorithm knowledge points

Let’s look at the summary of knowledge points in the algorithm part, as shown in the figure below.

  1. Commonly used problem-solving methods for algorithmic problems.

  2. Complexity is one of the criteria for measuring the quality of an algorithm. We need to master the method of calculating the time complexity and space complexity of an algorithm. The method of calculating time complexity is generally to find the statement with the most execution times, then calculate the order of magnitude of the number of executions of the statement, and finally use capital O to represent the result.

  3. Commonly used string matching algorithms and understand the matching ideas of different algorithms.

  4. Sorting is also a knowledge point that is often examined. Sorting algorithms are divided into five categories: insertion, exchange, selection, merge, and radix. Among them, quick sort and heap sort are examined most frequently. To master them, you need to be able to implement handwritten algorithms.

  5. Commonly used search algorithms include binary search, binary sorting tree, B-tree, Hash, BloomFilter, etc. You need to understand their applicable scenarios. For example, binary search is suitable for small-number set memory searches, B-tree is suitable for file indexing, and Hash constant-level time Complexity is more suitable for occasions that require high search efficiency, and BloomFilter is suitable for data existence filtering of large data sets.

Detailed explanation of binary search trees
binary search tree

As shown in the figure below, a binary search tree satisfies the condition that each node contains a value and each node has at most two subtrees. The value of each node's left subtree node is less than its own value, and the value of each node's right subtree node is greater than its own value.

The query time complexity of a binary tree is log(N), but with the continuous insertion and deletion of nodes, the height of the binary tree may continue to increase. When all nodes of a binary search tree have only left subtrees or only right subtrees, tree, its search performance degrades to linear.

balanced binary tree

A balanced binary tree can solve the above problem. A balanced binary tree ensures that the absolute value of the height difference between the left and right subtrees of each node does not exceed 1, such as an AVL tree. The AVL tree is a strictly balanced binary tree. When inserting or deleting data, it may often need to be rotated to maintain balance. It is more suitable for scenarios with relatively few insertions and deletions.

red black tree

Red-black trees are a more practical non-strictly balanced binary tree. The red-black tree pays more attention to local balance rather than overall balance, ensuring that no path will be twice as long as other paths, so it is close to balance, but it reduces many unnecessary rotation operations and is more practical. As mentioned earlier, red-black trees are used in Java 8's HashMap to solve the search problem when hash conflicts occur. TreeMap also uses red-black trees to ensure orderliness.

In addition to the characteristics of a binary search tree, a red-black tree also has the following rules, as shown in the figure below.

  1. Each node is either red or black.

  2. The root node is black.

  3. Each leaf node is a black empty node, such as the black triangle in the picture.

  4. Both child nodes of the red node are black.

  5. Every path from any node to its leaf node contains the same number of black nodes.

Detailed explanation of B-tree
B-tree

B-tree is a multi-tree, also called multi-way search tree. Each node in the B-tree can store multiple elements, which is very suitable for use in file indexes and can effectively reduce the number of disk IOs. The maximum number of child nodes of all nodes in the B-tree is called the order of the B-tree. As shown in the figure below, it is a 3-order B-tree, also called a 2-3 tree.

An m-order B-tree has the following characteristics:

  1. Non-leaf nodes have at most m subtrees;

  2. The root node has at least two subtrees, and the non-root and non-leaf nodes have at least m/2 subtrees;

  3. The number of keywords stored in non-leaf nodes is equal to the number of subtrees of the node -1. That is to say, if a node has 3 subtrees, then it must contain 2 keywords;

  4. The keyword sizes in non-leaf nodes are in order. For example, the two elements 37 and 51 in the left node in the above figure are in order;

  5. For each keyword in the node, the keywords in the left subtree are smaller than the keyword, and the keywords in the right subtree are greater than the keyword. As shown in the figure above, the left subtree of keyword 51 has 42 and 49, both less than 51, and the right subtree has 59 nodes, which is greater than 51;

  6. All leaf nodes are on the same level.

When searching in the B-tree, it starts from the root node and performs a binary search on the ordered keyword sequence within the node. If it is found, it ends. If not found, it enters the subtree of the range to which the query keyword belongs and searches until the leaf node. .

in conclusion:

  • The keywords of the B-tree are distributed throughout the tree, and a keyword only appears in one node;

  • The search may stop at non-leaf nodes;

  • B-trees are generally used in file systems.

B+ tree

The figure below is a variant of B-tree, called B+ tree.

The definition of B+ tree is basically the same as B-tree, except for the following characteristics.

  1. The number of keywords in a node is the same as the number of subtrees. For example, if there are 3 keywords in a node, then there are 3 subtrees;

  2. The nodes in the subtree corresponding to the keyword are all greater than or equal to the keyword, and the subtree includes the keyword itself;

  3. All keywords appear in leaf nodes;

  4. All leaf nodes have pointers to the next leaf node.

Different from the B-tree, the B+ tree will not hit non-leaf nodes when searching, and will definitely query the leaf nodes; on the other hand, the leaf nodes are equivalent to the data storage layer, saving the data corresponding to the keywords, while the non-leaf nodes only save the key Words and pointers to leaf nodes do not save data corresponding to keywords, so for the same number of non-leaf nodes with keywords, the B+ tree is much smaller than the B tree.

B+ tree is more suitable for indexing systems, and the index of MySQL database provides B+ tree implementation. There are three reasons:

  1. Since there are pointers connecting leaf nodes, the B+ tree is more suitable for range retrieval;

  2. Since non-page nodes only store keywords and pointers, with the same size of non-leaf nodes, the B+ tree can accommodate more keywords, reduce the tree height, and reduce the cost of disk read and write during query;

  3. The query efficiency of B+ tree is relatively stable. Any keyword search must take a path from the root node to the leaf node. The path length of all keyword queries is the same and the efficiency is equivalent.

Finally, you can simply understand that there is also a variant of the B* tree. On the non-leaf nodes of the B+ tree, a pointer to the next non-leaf node in the same layer is also added.

Detailed explanation of string matching
String matching problem

During interviews, string-related questions are often used as algorithm test questions. Let’s look at string matching questions. Let’s first understand a frequently asked interview question: “Determine whether the brackets in a given string match.”

Generally, the descriptions of interview questions are relatively simple. Before answering, you can further communicate with the interviewer about the question requirements and details. Taking this question as an example, you can confirm with the interviewer the range of parentheses, whether only large, medium and small brackets are considered, including angle brackets; are there any requirements for the input parameters and return values ​​of the function; are they required? Consider operations on large files, etc.

We assume that the requirements of this question after refinement are: only large, medium and small brackets are considered; operations on large files are not considered, strings are used as input parameters, and the return value is of Boolean type; the absence of brackets is also counted as a match. So, the solution is as follows.

  • Character matching problems can be handled by using the features of the stack.

  • When a left bracket is encountered, it is pushed onto the stack. When a right bracket is encountered, it is popped out of the stack and compared to see if it is a paired bracket.

  • When the match is completed, if the stack is empty, it means a match, otherwise it means there are more left brackets than right brackets.

string code

Let’s look at the actual implementation code, as shown in the figure below.

According to the above idea, the string needs to be traversed, so the trigger condition for the stack operation must be determined first, which is to define the bracket pair to facilitate the matching of push and pop. It should be noted here that you must pay attention to coding style and specifications when implementing coding. For example, variable naming must have a clear meaning. Do not simply use variable names such as a and b that have no clear meaning.

We first define the map of brackets. The key is all the right brackets and the value is the corresponding left bracket. This definition makes it easier to compare whether the brackets are in pairs when popping the stack.

Let’s take another look at the logic of the matching function. It should also be noted here that as a tool function, it is necessary to do a good job in robustness defense. First, the input parameters must be null-checked.

Then we define a stack to save the character type and start traversing the input string.

If the current character is the value in brackets, that is, the left bracket, it is pushed onto the stack. It should be noted here that the value query method of map is O(N). Because there are very few types of brackets in this question, this method is used to make the code more concise. If the current character is not a left bracket, use containskey to determine whether it is a right bracket. If it is a right parenthesis, you need to check whether it matches. If the stack is empty, it means that there are more right parentheses than left parentheses. If the stack is not empty, but the left parentheses popped out of the stack do not match, both cases indicate that the parentheses in the string do not match. of.

When the traversal completes, it matches if there are no extra left parentheses on the stack.

Finally, I would like to emphasize: In addition to programming ideas, coding questions must also pay attention to programming style and handling of details.

String problem solving ideas

Next, let’s summarize the problem-solving skills for string matching problems.

  • First, review the questions carefully to avoid incorrect answers. You can first determine whether it is a single pattern matching problem or a multi-pattern matching problem, and whether there are multiple hit conditions.

  • Then determine whether there are any additional requirements on algorithm time complexity or memory usage.

  • Finally, it is necessary to clarify what the expected return value is. For example, when there are multiple hit results, should the first hit be returned, or all of them should be returned.

About problem-solving ideas.

  • If it is a single pattern matching problem, you can consider using the BM or KMP algorithm.

  • If it is multi-mode matching, you can consider using Tire tree to solve it.

  • When implementing the matching algorithm, you can consider using prefix or suffix matching.

  • Finally, you can consider whether data structures such as stacks, binary trees, or multi-trees can be used to assist in solving the problem.

It is recommended to understand the processing ideas of common string single-mode and multi-mode matching algorithms.

Detailed explanation of TopK
TopK questions

The TopK problem is a typical problem that often occurs in actual business. For example, the popular ranking of Weibo belongs to the TopK problem.

TopK is generally required to find the smallest or largest K values ​​in a set of N numbers. Usually N is very large. TopK can be solved by sorting, but the time complexity is high, usually O(nk). Here we look at a more efficient method.

As shown in the figure below, first take the first K elements to build a large root heap, and then traverse the remaining NK elements. If it is smaller than the element at the top of the heap, replace the top element of the heap, and then adjust the heap. When all traversals are completed, the K elements in the heap are the smallest K values.

The time complexity of this algorithm is N*logK. The advantage of the algorithm is that it does not need to read all the elements in memory and can be applied to very large data sets.

TopK variant problem

The problem of the TopK variant is to find the smallest or largest K values ​​from N ordered queues. This problem is different in that it is sorting multiple data sets. Since the initial data set is ordered, there is no need to traverse all the elements in the N queues. Therefore, the problem-solving idea is how to reduce the elements to be traversed.

The problem-solving idea is shown in the figure below.

  1. The first step is to use the head elements of N queues, that is, the smallest elements of each queue, to form a small root heap with K elements. The method is the same as that in TopK.

  2. The second step is to obtain the top value of the heap, which is the smallest element in all queues.

  3. The third step is to put the next value in the queue where the top element of the heap is placed on the top of the heap, and then adjust the heap.

  4. Finally, repeat this step until enough K numbers are obtained.

There is also a small optimization here. When adding a new value to the top of the heap in the third step, compare it with the maximum value of the heap. If it is already greater than the maximum value in the heap, the loop can be terminated early. The time complexity of this algorithm is (N+K-1)*logK. Note that this has nothing to do with the length of the queue.

Detailed explanation of commonly used algorithms

There are many knowledge points about algorithms. To improve the problem-solving ability of algorithms, you need to brush up on the questions appropriately, but you cannot just rely on brushing up on the questions to solve the problem. It is necessary to master several commonly used problem-solving ideas and methods in order to remain unchanged in the face of ever-changing situations. Let’s talk about five commonly used algorithm problem-solving methods: divide and conquer, dynamic programming, greedy, backtracking and branch definition. Let’s see what scenarios they are suitable for and how to apply them.

divide and conquer

The idea of ​​the divide and conquer method is to divide a complex or large problem that is difficult to solve directly into a number of smaller identical problems, and then divide and conquer. For example, quick sort, merge sort, etc. all apply the divide and conquer method.

Scenarios suitable for using the divide-and-conquer method need to meet three requirements:

  1. Can be decomposed into sub-problems;

  2. Solutions to subproblems can be combined into solutions to the original problem;

  3. The sub-questions are not related to each other.

The general steps for solving problems using the divide-and-conquer method are shown in the table below.

  1. The first step is to find a solution to the minimum subproblem;

  2. The second step is to find a way to combine solutions to sub-problems;

  3. The third step is to find the recursion termination condition.

dynamic programming

Dynamic programming, similar to the divide-and-conquer method, also decomposes the problem into multiple sub-problems. Unlike the divide-and-conquer method, the solutions to sub-problems are related. The solution to the former sub-problem provides useful information for the solution to the latter sub-problem. The dynamic programming method solves each sub-problem in turn. When solving each sub-problem, all local solutions are listed, and those local solutions that are likely to reach the global optimum are retained through decision-making. The solution to the last subproblem is the solution to the initial problem.

Scenarios using dynamic programming need to meet three conditions:

  1. Subproblems must be solved sequentially;

  2. There are correlations between adjacent sub-problems;

  3. The solution to the last subproblem is the solution to the initial problem.

When using dynamic programming to solve the problem, as shown in the second row of the table above.

  1. The first step is to analyze the properties of the optimal solution;

  2. The second step is to recursively define the optimal solution;

  3. The third step is to record the optimal values ​​at different stages;

  4. The fourth step is to select the global optimal solution based on the stage optimal solution.

greedy algorithm

The third greedy algorithm, because it considers the local optimal solution, the greedy algorithm cannot obtain the overall optimal solution for all problems. The key to the greedy algorithm is the choice of greedy strategy. The greedy strategy must have no aftereffects, which means that the process after a certain state will not affect the previous state, but is only related to the current state.

The scenario used by the greedy algorithm must meet two points:

  1. Local optimal solutions can produce global optimal solutions;

  2. That is what I just said must have no aftereffects.

As shown in the figure below, the general steps for using the greedy algorithm to solve problems are:

  1. The first step is to decompose it into sub-problems;

  2. The second step is to calculate the local optimal solution of each sub-problem according to the greedy strategy;

  3. The third step is to merge local optimal solutions.

Backtracking algorithm

The backtracking algorithm is actually a depth-first search algorithm that searches forward according to the optimal selection conditions. When the exploration reaches a certain step, it is found that the original choice is not optimal or cannot achieve the goal, so it goes back to the previous step and makes a new choice. The method of going back and trying again when you can't make it is the backtracking method.

The backtracking method is suitable for situations where depth-first search is possible and all solutions of the solution space need to be obtained, such as maze problems.

As shown in the figure above, the general problem-solving steps of the backtracking method are:

  1. The first step is to determine the solution space of the given problem;

  2. The second step is to determine the expanded search rules for the node;

  3. The third step is to search the solution space in a depth-first manner, and use pruning functions to avoid invalid searches during the search process.

branching method

Finally, there is the branch and bound method, which has different solution goals from the backtracking method. The goal of the backtracking method is to find all solutions that satisfy the constraints, while the goal of the branch-and-bound method is to find a solution that satisfies the constraints.

The branch-and-bound method is suitable for breadth-first search and when it is sufficient to obtain any solution in the solution space, such as solving integer programming problems.

As shown in the figure above, the general problem-solving steps of the branch and bound method are:

  1. The first step is to determine the characteristics of the solution;

  2. The second step is to determine the child node search strategy, such as first in, first out, or first in, last out;

  3. The third step is to find the solution through breadth-first traversal.

Inspection points and bonus points
Inspection point

The above are the key points focused on the content of data structures and algorithms. Next, from the interviewer’s perspective, we summarize the relevant interview points:

  1. Understand the basic data structures and characteristics, such as what binary trees are in the data structure and what characteristics these trees have;

  2. You must be proficient in tables, stacks, queues, and trees, and have a deep understanding of the usage scenarios of different types of implementations. For example, red-black trees are suitable for searching, and B+ trees are suitable for indexing;

  3. To understand commonly used search and sorting algorithms, as well as their complexity and stability. In particular, the implementation of quick sort and heap sort must be mastered;

  4. It is necessary to understand the commonly used string processing algorithms and processing ideas. For example, the BM algorithm uses suffix matching for string matching;

  5. Be able to analyze the complexity of algorithm implementation, especially the time complexity, such as the time complexity calculation of the TopK problem;

  6. You need to understand the five commonly used problem-solving methods, the ideas for solving problems and the types of problems to be solved, as well as the steps to solve problems.

bonus

To get extra points from the interviewer on algorithm-related questions, keep the following points in mind:

  1. Ability to combine data structures with actual usage scenarios, for example, when introducing red-black trees, combine them with the implementation of TreeMap; when introducing B+ trees, combine them with the index implementation in MySQL, etc.;

  2. Be able to know the applications of different algorithms in business scenarios, such as the application of the TopK algorithm in popular sorting;

  3. Ability to proactively communicate and confirm conditions and boundaries when faced with ambiguous questions. For example, the details listed in the bracket matching problem introduced earlier can be reconfirmed with the interviewer;

  4. Before writing the algorithm code, first talk about the problem-solving ideas, rather than burying your head in writing as soon as you start. Generally, when there are problems with problem-solving ideas, the interviewer will provide appropriate guidance;

  5. Can discover some problems in the solution and give ideas for improvement. For example, due to time constraints during the interview, everyone may choose a more conservative problem-solving idea, which may not necessarily be the optimal solution. In this case, after answering, you can point out some problems with the current algorithm and ideas for improvement. For example, you can consider using multi-threading to improve solution performance.

Summary of real questions

Finally, let’s look at common real interview questions. The first part is summarized as follows.

  • Questions 1 and 2 are all basic algorithms, which must be firmly mastered. Some questions require remembering the implementation of recursion and non-recursion, such as tree traversal, quick sort, etc.;

  • For questions like Question 5 that limit the use of memory, consider using the divide-and-conquer idea for decomposition;

  • Question 6: Array deduplication can be sorted or hashed.

The second part of the real questions is summarized as follows.

  • Question 9: Idiom Solitaire, you can consider using depth-first search to solve it;

  • Question 10: To find the common ancestor of two nodes, we can consider two methods: recursive and non-recursive.

This class has completed the basic knowledge learning module, and the next class will begin to explain the applied knowledge module, and the common tools set of the subject in the next class.

Guess you like

Origin blog.csdn.net/g_z_q_/article/details/129826265