Summary of the review of Meituan and Ali

foreword

Meituan will be on March 22, and Ali will be on March 24.
The interviewers on the two sides have completely different styles. On the one hand, Meituan mainly wanted to see if my basic knowledge was solid, and would not continue to ask in-depth questions. I felt that I had almost mastered the principles and jumped directly to another question. On the other hand, Ali gave me the feeling that the way of thinking is very detached. The interviewer will continue to ask questions according to what I said, and the questions are often not in the standard way of asking.
Relatively speaking, my Meituan side performed pretty well. I basically answered all the questions, and I did a lot of extensions. Many times, the interviewer interrupted me and said it was okay. The interviewer in Ali would not listen to me too much (hh, because I answer questions based on my own understanding and then express them in colloquialism, so I can talk for a long time), and often mention a point in me When interrupted, I continued to go deeper according to this point.
Meituan met for an hour, and Ali met for an hour and a half. Because I didn’t have much relevant interview experience, I don’t know if the interviewer asked me if these were common difficulties.

PS: I am a 23-year double non-beneficiary, looking for a summer internship

1. Interview content

Put the interview first

Meituan side

Self-introduction
usual learning methods
jvm memory area and garbage collection
CMS garbage collector recycling process
Spring IOC,
whether the object initialization (life cycle) project uses distributed
redis data type
redis persistence method
redis deployment method
thread usage
thread pool principle
mysql aggregation The principle of clustered index and non-clustered index
B+ tree, how to realize data storage and search ,
primary key auto-increment and customization can) Do you have any questions?



PS: On the whole, it is quite satisfactory. The questions are all common interview questions, and the questioning method is also very standard.

Ali side

Self-introduction
Java memory model (free play)
where is the cache (interrupted when introducing the hardware, and asked about the cache)
Why should CAS be proposed, how to achieve synchronization when changing the cache (in fact, it is asking about the specific steps of the cache)
CAS Specific steps
Why does CAS need to compare the process of
CAS's ABA problem (extended by itself)
What are the types of thread pools?
ThreadPoolExcutor parameter
Bounded queue and unbounded queue scenarios
The difference between hashmap 1.7 and 1.8
In what scenario does the bug that the hashmap causes an infinite loop occurs?
What is the use of mysql insert buffer (change buffer)?
What does double write do?
Why does mysql write logs, why not write data directly? You don't think that the database is down at the moment of saving the data, and it is not written to the hard disk.
Do you know the four commits? (The seven rounds and eight rounds turned out to be about the transaction isolation level, and the default isolation level was also asked later)
The principle of transaction isolation level (MVCC)
The index in mysql (after speaking a little, then ask the next question)
B+ tree and B tree have What is the difference?
How is the joint index search in mysql?
Why is it useless for me to check the joint index of the middle field
? A public opinion analysis problem in a real scene, they are also doing it themselves (100w+ text 5-100 words per hour produces 20W+ data similarity (you can use tags, location , time (24 hours)) requires deduplication of data)
Algorithm question - find all the "redundant words" in the string and
optimize it (directly change the title...)

PS: There are many bad answers here, and it also exposes some of my knowledge blind spots. It was Ali who came up with the idea of ​​writing a blog post.

Second, the wrong answer

1. Meituan side

1.1 CMS garbage collector recycling process

I just said that I don’t understand this directly, and I also sorted out the relevant knowledge points after the interview.
insert image description here

1.2 Object Creation in Spring

I guess the interviewer should want to ask about the life cycle of ioc.
For Spring's object creation, I should answer the three-level cache set
insert image description here
. After confirming, I still answered the life cycle, but I really can't remember the process of initializing beans. I only answered a few roughly, and then said that it is mainly spring function. some initialization operations. Because there is really no relevant practice, and my memory is extremely poor, unless I understand the specific details and why, I really can't remember them.
insert image description here

1.3 B+ tree principle

At that time, it was still relatively vague, and there was only one picture in my heart, so the answer was not very good.
After the interview, I spent half the night to understand the B+ tree, and organized it into my knowledge brain map
insert image description here

2. Ali side

1.1 What kinds of thread pools are there?

I knew before that there are different methods under the Executors class to encapsulate the creation of different thread pools, but that's all I know, and I don't know the specifics. I will sort it out after the interview.
Most are also wrappers for ThreadExecutorPool.
insert image description here

1.2 In what scenarios does the bug that the hashmap leads to an infinite loop occur?

Well, I don't know, I told the interviewer that I never encountered this bug.

Later, I will also check many blog posts and update them into my knowledge map with the words I understand.
insert image description here

1.3 What is the use of mysql insert buffer (change buffer)?

I really don't know about this, I can only blame myself for being too naive. Therefore, I made up a wave of databases and consulted a lot of blog posts.

When the interviewer asked me at that time, I really didn't know insert buffer, but when I heard the name a bit like database cache, I asked the interviewer if it was a database cache.

In fact, the two are very different, but they do have some similar functions. They mainly use cache to speed up reading and writing efficiency. However, the insert buffer is set at the engine level to prevent too much random IO, while the database cache is used. (query cache) is at the Server layer to reduce repeated database lookups (version 8 was removed because it was too tasteless)

To sum up, the insert buffer in the buffer pool mainly reduces the number of disk IOs through a cache-like idea, puts the update of the non-unique index in the memory (insert buffer object), and then takes out the corresponding index page at the right time for merging , and then find the right time to write to the disk. At this time, the random write is changed to the sequential write of data page flushing. This is especially effective for mechanical hard disks.

In various articles, Mysql buffer pool detailed explanation - Yi Feng Blog - Blog Park (cnblogs.com) is very good.

img
img
After going through a lot of articles, I finally sorted it out.
insert image description here

insert image description here

From here, I also correct my understanding of redo log. I always thought it was a transaction and accelerated writing efficiency. Sometimes I do wonder a little bit - when there is too much redo log backlog (in the geek time, MySQL combat 45 mentioned that the maximum will be 4G), with such a large amount of data, the actual storage data is not a few behind. version? How long ago does the data read when MVCC reads the version chain? I did have such doubts when reviewing. I did read about the existence of such buffers in some articles, but I couldn’t integrate it into my knowledge system, so I didn’t pay much attention until This time I was stunned by the interviewer.

Redo log does have the ability to ensure transactions, but its initial role is only to ensure a crash-safe ability, that is, to ensure that when the database restarts abnormally, the previously submitted records will not be lost.

In fact, I have always wondered about the writing order and persistence method of undo log and redo log in specific transactions.
Finally, by consulting various articles, I found that the undo log is operated before the transaction and before the specific update, and the redo log is recorded afterward. The more interesting point is that the undo log is also data in the eyes of MySQL, and in the undo log When generating, an additional redo log is also generated, so the undo log itself is not persistent, and its persistence depends on the redo log.

But it's not over yet, I still can't connect the buffer pool with undo log and redo log. In my knowledge system, the two are split.

This brings some questions, for example, I will wonder - why does the undo log need to be persistent?
Because in my original understanding, the data will not be placed on the disk until it is submitted, it is in memory, so there is no need to place the data.

Finally I found the answer in a Zhihu answer. Why does undo log need to be placed? - Zhihu (zhihu.com)

The undo log persistence steps are as follows:

1 Write the redo of the undo into the redo buffer
2 Modify the memory page
3 Write the redo of the dirty page into the redo buffer
4 The redo in the redo buffer is flushed
5 At this time, the dirty page has a chance to drop to the disk
6 Follow-up commit and other operations

So why does the undo log need to be placed (persistent)?

  • Before the transaction is committed, the data page may be flushed, so when we are down, I need undo log persistence to ensure that the transaction is correctly rolled back
  • When the memory is not enough, the undo log will be changed to the disk first

insert image description here
insert image description here

insert image description here
insert image description here

1.4 Why does mysql write logs and why not write data directly? Don't you think that the database is down at the moment of saving the data, and it is not written to the hard disk?

I didn't answer well before because I didn't know the existence of buffer pool and misunderstanding of redo log.
To answer here again,
writing redo log is mainly to ensure crash-safe, that is, to ensure that the previously submitted records will not be lost when the database is down.
At the moment of writing data, the dirty pages are not flushed to the hard disk, but before the data is modified, the corresponding undo log and redo log have been written to the redo log file sequentially (unless you modify the default database redo log Persistence strategy will lead to the loss of records within 1s), so when the database is restored, mysql will redo the redo log to ensure the correctness of the data, and the final data will not be lost.

1.5 Do you know the four commits?

I learned later that it was read commit and read uncommitted. I didn’t realize it in English for the first time. It was the transaction isolation level. When I described the stage of transaction commit (several states of the transaction), I thought it was wrong, thinking I didn’t know. Some of the transaction submission methods, and finally asked me to say the transaction isolation level before I reacted.

I have also answered the specifics, so I won't write more.

1.6 Why is it useless for me to check the joint index of the intermediate field

The principle is answered, but it is difficult to express that meaning in words. The interviewer also knew that I understood and moved on to the next question.

In hindsight, because the joint index is sorted according to the order of the fields, the sorting of the latter fields only makes sense when the former fields are equal. MySQL will prioritize the previous fields, and only when they are equal will they be sorted according to the latter fields. (It feels similar to what was answered during the interview~ _ ~ !)

1.7 Scenario question - a public opinion analysis problem in a real scene, public opinion denoising

100w+ text 5-100 words (guess is the title of the article) The text has a certain degree of similarity, "an epidemic occurred in Hangzhou", "original title: an epidemic occurred in Hangzhou", and "an epidemic occurred in Hangzhou" is one thing.
Labels can be proposed in the text, such as "Hangzhou", "epidemic". All epidemics will have two latitudes - label and location
. The data is added at the hourly level, and 20W+ data is generated every hour. There is a lot of duplicate data in this 20W+ data.
How to deduplicate?

In fact, when I first heard this topic, I wanted to know a question - how to judge repetition?
Although he gave examples - "An epidemic occurred in Hangzhou", "Original title: An epidemic occurred in Hangzhou", and "An epidemic occurred in Hangzhou", there was no clear standard.
So I asked "how to define repetition?" The
interviewer added that a standard is time (day level or hour level), this time is equivalent to an attribute of the text, the definition of repetition is - "label, location and time should be consistent ", and gave an example, "an epidemic occurred in Hangzhou on March 21" and "an epidemic occurred in Hangzhou on March 27" are two different events.

So I confirmed to the interviewer the definition of repetition - "the label, the place and the time should be the same, and the event that happens the next day counts as two things".
The interviewer said, "Not necessarily, there is a ratio of deduplication, missed deduplication and excessive deduplication."

PS: There are some bad things on both sides here. As a question maker, I need to clarify the boundary of the question, such as the definition of repetition here, so that I can find a solution according to the meaning of the interviewer, and at the same time buy myself some time to think. But in fact, if you are doing a real project, this question should not be asked, because in a real scene, for example, the interviewer also said at the end that this is a scene where public opinion is de-emphasized. In this scenario, we should understand the definition of repetition, but how to use computer language to define repetition with a quantifiable indicator is itself a problem that we should think about and solve.
But in the interview scenario, we don't know what scenario this problem is solving. Although we can probably guess it in the process of giving examples, the boundaries and usable conditions are also very vague. The method adopted by the interviewer is indeed a limited condition, and it is suggested that the label, location, and time can be used to judge repetition, which is equivalent to repeating a specific scene and simplifies the original problem, so what is asked here is a specific solution implementation. .

In fact, I can almost guess at this time-repetition or not is actually a question that designers should think about, and there is no absolutely correct answer. The interviewer's understanding of this question is that the label, location and time are the same, but when I confirmed this condition to the interviewer, the interviewer's answer was - "not necessarily", which means that the interviewer actually knows that There are exceptions, which cannot be used as a certain correct criterion.

This is why I confirmed to the interviewer later - "repetition or not is a somewhat subjective thing, we need to use a computer to make judgments, as for the evaluation standard is a more subjective thing".

In fact, the term subjective is not properly used here.
Let's think about it carefully here. The headlines of the two news, or the sentences that came from nowhere (here, because we don't know the means of obtaining the data, let's just make assumptions), can really judge whether what they said is the standard of one thing what is it? Is it difficult to define? If we think from our human perspective, we actually need to understand what the sentence is talking about, and this involves semantics. Does this remind us of artificial intelligence?
But in fact, in this problem scenario, we do not need to accurately analyze the semantics of the sentence, and we do not need to accurately deduplicate. As engineers, all we need to do is to design solutions that weigh the benefits and costs .


Well, after a lot of talk, the above are all thinking behind the scenes, and they are purely afterthoughts.

As for the plan, I really don't have any particularly good means and standards for judging repetition. The conditions given by the interviewer are indeed a good solution.

At that time, my answering idea was not good. After all, I was thinking of a scene that I had not touched in a limited time, and I had not touched many things. For example, tagging was recognized by the nlp model, which means that the process of tagging is actually There will also be performance consumption, which I have not considered at all.

However, the interviewer was very good. After I kept confirming it, he also revealed a lot of information to me, including some of his solutions.

Note that the following is purely an afterthought. It was written after I reorganized my thoughts. Just watch it and have a good time.

First of all, for the design of the scheme, especially for this kind of big data processing scheme, the first thing to do is to observe the data characteristics . Just like when I participated in the Huawei Software Elite Challenge before, I also need to find ideas and prescribe medicines based on the characteristics of the data. design scheme.

If there is a lot of repetition in the sample data, for example, there are many pieces of data that are literally "the epidemic occurred in Hangzhou", then it is necessary for us to take the first step to deduplicate them, first to remove the exact same ones. But if this kind of data is very small, then we don't need to deduplicate, because maintaining such a data structure (such as Map) in memory is very expensive, especially in the scenario of such a large amount of data .

Deduplication by labels is a good method thanks to the npl model. But here we also need to know a piece of information - is it efficient to obtain labels through the npl model? If it is high, we can put it in the front to remove the weight, because in many cases, the repetition of the text depends to a large extent on the label; but if the efficiency is not high, then we need to put the operation of fetching the label to the back, First filter out some of them, such as location and time (but in fact, there should be very little data filtered out. Of course, this depends on the characteristics of the data set, I think it is useless).

For the specific design, we can roughly know - the same label, the same location, and the same time within a range may be the same thing. In other words, we can think of a text as a permutation of labels and places (places can also be treated as labels), and repetition means the same combination.
For the convenience of handling, we can add the corresponding number to the label and location, and then sort by the serial number when combining, separated by ",", for example, "An epidemic has occurred in Hangzhou" can be changed to "3, 9", and 3 is Hangzhou number, 9 indicates the label of the epidemic. Then put it in a map for deduplication, the key is the converted " serial number ", and the value is the specific text (for subsequent queries of non-repeated text).

Or you can use the dictionary tree method. Different from the usual processing, we still need to sort the numbers when adding data, because we only need to combine the same, not the arrangement.

The above will solve the deduplication of locations and labels.

Next is the processing of time, because we cannot determine whether the news reported at different times is the same news, and the time latitude is largely just a reference, so it cannot be used as a decisive factor, so the design here should consider what we want. what effect to achieve . Would you rather kill a thousand by mistake than let one go? Or the pursuit of maximum non-repetition?

For the sake of versatility, we can set the effective time when designing the time to de-duplicate. This time means that when both news are "an epidemic in Hangzhou", how long does it take to report is not a news.
Or we can design a bit more complicated, learn from the idea of ​​aging algorithm, set up a map/dictionary tree for every day/hour, the longer the map/dictionary tree weight is, the easier it is to be judged as different news. Each deduplication needs to be judged according to the time sequence to determine whether there is any repetition.

In short, the meaning of empty talk here is not very big. No matter how much you talk about it, it is only on paper. It depends on the characteristics of the data set. Many ideas are derived from the characteristics of the data set.

Summarize

First of all, I am very grateful to Meituan and Ali for giving me the opportunity to be interviewed as a dual non-undergraduate student, which made me realize my own shortcomings. Secondly, I am very grateful to the interviewer Ali for my suggestion - think more and find a direction.
Many times, it is really difficult for me to calm down and think about problems, especially when I am working on a project in a hurry. Although I have encountered many problems, I just want to pursue the needs. But in fact, I think I think about learning better than my peers. Many blog posts I write will also record my own thinking and summary, but this is not enough. Learning should be a process of continuous trial and error and lifelong exploration. .

Looking back on my self-study journey, I actually know that I am not smart, and I am not better than others. What I have is just persistence and stubbornness. I can study and explore in the laboratory all day long. Enjoy the fun and fulfillment that knowledge and projects bring me, which others can't achieve, that's all.


Finished on the night of 2022.3.26

May we take our dreams as horses and live up to our youthful youth
and encourage us with you!

Guess you like

Origin blog.csdn.net/qq_46101869/article/details/123756852