100 billion text messages, high concurrent MD5 query, how to get such a large amount of data business?

==Question from Planet Water Friends==
Hello, Mr. Shen, I would like to ask a question about ID card information retrieval.

The company has a business of 50,000 concurrent queries per second, (assuming) to query ID card information based on ID card MD5. Currently, there are 100 billion pieces of data stored in plain text. I saw you write LevelDB a few days ago. Can this business use LevelDB memory? Is the database stored? Are there any other optimization solutions?
Voiceover: LevelDB "Memory KV Cache/Database".
==End of problem description==

The last planet Aquarium asked about 3.6 billion log background paging queries, followed by a 100 billion text MD5 query. This time, at least the business needs to be resolved:
(1) query problem;
(2) high performance problem ;
(3) Storage problem;

1. Inquiry

The search and retrieval of text information is very inefficient. The first problem to be solved is to transform text filtering into structured query.

Since the search condition is MD5, it can be structured as:
(MD5, data)
This can be KV query, or index query in the database.

It should be noted that MD5 is generally a string representation, and the performance of a string as an index will be reduced. You can convert the string MD5 into two uint64_t for storage to improve indexing efficiency.

(md5_high, md5_low, data)
Two long integers are used for joint index, or joint key in KV.

This business has a strong feature, which is the query on the primary key of a single row of data. Regardless of the amount of data, even if the cache is not used, the traditional relational database storage can carry at least 1W queries on a single machine.
Voiceover: But in fact, it can't be saved on a stand-alone device, I will elaborate on it later.

2. High performance issues

Concurrency is 5W per second, and the throughput is very large. The second thing to be solved is: performance improvement.

The ID card inquiry business has two strong characteristics:

(1) The data being queried is fixed;

(2) Only query request, no modification request;

It is easy to think that the cache is very suitable for this scenario, not only that, but also can load data into the memory in advance to avoid the "warm-up" of the cache.
Voiceover: Design based on business characteristics. Any architecture design that is out of business is a rogue.

If the memory is large enough and the data is loaded in advance, the cache hit rate can be 100%; even if it is not loaded in advance, each piece of data will have at most one cache miss. Once the data enters the cache, it will never be swapped out since there is no write request. .

Does the premise of sufficient memory hold?

Assuming that the information of each ID card is 0.5K, 100 billion is about:
100 billion*0.5K = 50000G = 50T
Voice-over: It’s correct, right?

From this point of view, if it is not a very local tyrant, the cache cannot hold all the data and can only carry hot data.

Is 5W throughput a bottleneck?

There are many ways to linearly expand capacity:
(1) Sites and services are redundant with more than 10 copies;
(2) Storage (primary key single row query) is divided into more than 10 copies horizontally; it
can be seen that 5W concurrency is not a problem.

Three, storage issues

As analyzed in the previous part, 100 billion ID card information, 50T data, the amount of data is really too large, traditional relational databases, single-machine memory databases such as LevelDB are not particularly suitable, manual horizontal segmentation, split instances will be very many, more Difficult to maintain.

Or use Hbase storage technology suitable for large amounts of data.

Finally, combined with this example, it is recommended:
(1) Never text retrieval, must be structured;
(2) Single-line query, read-only and no-write, cache + redundancy + horizontal segmentation can greatly improve throughput;
(3) ) Use technology suitable for massive data storage;

Experience is limited, and everyone is welcome to contribute more and better solutions.
Ideas are more important than conclusions.
100 billion text messages, high concurrent MD5 query, how to get such a large amount of data business?
Everyone is welcome to continue to ask questions and answer all questions.

Answer questions from golfers:

"How does MQ achieve smooth migration? "
"3 billion logs, retrieval + paging + background display"

After-school exercises:

100 billion data, different ID numbers may cause MD5 duplication, what should I do?

Guess you like

Origin blog.51cto.com/jyjstack/2548557