3 billion logs, retrieval + paging + background display, have you ever encountered a more exotic demand?

3 billion logs, retrieval + paging + background display, have you ever encountered a more exotic demand?

Mr. Shen, hello, I would like to ask a question about the database query log and the front page display.

demand:

(1) Query logs according to certain specific search conditions;
(2) Query and display relevant log information through the front desk Web page;
(3) Search requirements include specific fields such as user, time period, type, etc.;
3 billion logs, retrieval + paging + background display, have you ever encountered a more exotic demand?

Hope to do:

(1) The query speed is as fast as possible;
(2) Support paging query;

Current plan:

Log information is stored in Oracle, and Oracle is partitioned according to the date. A partition table is generated every day. The total amount of data in each partition table is about 1000W. Create indexes on related query fields such as users and types to meet query requirements of different dimensions.

potential problems:

Cross-partition query, the total number of records (calculating the query when paging), takes 3-4 minutes, is there any optimization method?

==End of problem description==

This requirement is still very abnormal. Usually logs will be filtered/structured/aggregated, put into the data warehouse, business wide tables are established, and queries on wide tables generally do not specifically check row by row records.

If you want to support retrieval and display it line by line on the Web background, at least several problems must be solved:
(1) storage problem;
(2) retrieval problem;
(3) scalability problem (data volume expansion, search field expansion) ;

1. Storage problem

Can a relational database be used to store logs?

If the log format is fixed and the retrieval conditions are fixed, it is possible.

For example:
2019-08-11 23:19:20 uid=123 action=pay type=wechat money=12
can be converted into a table:
t_log(date, time, uid, action, type, money)
and then index the relevant fields , To meet the needs of background query and display.

The amount of data is too large, how to solve it?

According to the title description, the daily data volume is about 1000W, and the 1-year data volume is about 36Y.

If you use Oracle storage, 1000W is a partition table:
365 partitions are needed a year, and cross-partition query performance is low, which is not suitable.


Change to one partition per month: single partition 3Y records, most partitions have no write operations (insert, modify, delete), only read operations on the index, the read and write performance can basically resist. With 12 partitions a year, the performance is much better than 365 partitions.

Although the log in this example can be structured (convert the log into a table), because of the large amount of data, in fact, the relational database is not suitable for storage. It can be stored in ES or Hive suitable for larger data volume.

Second, the retrieval problem

The log format is fixed and the retrieval conditions are fixed. If you use a relational database or Hive storage, you can build indexes on the relevant fields to meet the query requirements.

If you use ES to store it, it uses an inverted table internally, which naturally supports retrieval.

Three, scalability issues

Data volume expansion

Regardless of whether you use Oracle, ES or Hive for storage, the difference is that the storage capacity of a single instance/single cluster is different. If the amount of data expands indefinitely, the essential solution is "horizontal segmentation".

It should be noted that, try not to use the built-in "partition table" to expand, and split it in the business layer.
Voiceover: "Why don't Internet companies use partition tables? ".

Search field extension

If the log is not standardized and the search field is not fixed, it will be troublesome, and it will become a "search engine" problem.

It is more appropriate to use ES at this time, but combined with an unlimited amount of data, you may eventually need to store it in the search engine by yourself (similar to Baidu, with unlimited storage capacity and unfixed search fields).
Voiceover: The principle, structure, realization and practice of "Search"! ".

to sum up:

Combined with this example, the log volume is large and the model is fixed. Suggestions:
(1) The most recommended, use Hive storage, and use the index to achieve log background retrieval requirements;
(2) If the scalability requirements are slightly higher, you can use ES to achieve storage and retrieval , Use horizontal expansion to store a larger amount of data;

I hope that the above ideas will be helpful to the star officials, with limited experience, and everyone is welcome to contribute more and better solutions. Ideas are more important than conclusions.
3 billion logs, retrieval + paging + background display, have you ever encountered a more exotic demand?
Just started playing Knowledge Planet, welcome to ask questions.
Voice-over:
(1) All questions will be answered carefully;
(2) 200 places for the first phase are full, and 300 places for the second phase are open (too many people cannot answer the questions);

Answer questions from golfers:

"How does MQ achieve smooth migration?

Guess you like

Origin blog.51cto.com/jyjstack/2548561