"Quasi-real time data warehouse" design

         The current number of positions roughly divided into the number of off-line and real-time warehouse several positions. Usually the number of offline data warehouse ETL T + 1 of the embodiment; real time data cartridge typically ETL level programs within minutes or even less time. Is a real number of bins is generally upstream data service database binlog by other forms of real-time extraction to Kafka, real-time ETL. However, the current mainstream real time data will be subdivided into two positions, one is a real number of standard positions, all ETL processes through real-time calculation or the like Spark Flink, floor, that is extracted from data binlog to Kafka, subsequent All kafka ETL are read, calculated, written in the form of concatenated kafka, the number of bins that meet the definition of a complete; there is a category of real time data warehouse is a simplified, reduced to a limited ETL layers, ground to the binlog after kafka, the Spark external memory for read or Flink kafka after calculating metrics HBase like landing analysis, of course, also be accomplished by or Druid Kylin index calculation.

         Then the "quasi-real time data warehouse" What kind of program is it?

         In fact, the "quasi-real time data warehouse" is an offline number of positions of a simple upgrade, it offline, day-level ETL process, reduced to a half hour or hour level, but at the same time provide external real-time data query ODS layer. Bins off to shorten the number of relatively simple calculating frequency is hourly or half-hour increments to extract data, the MERGE to the ODS layer, the ETL process and the subsequent off-line bin number exactly.

         "Foreign ODS layer provides real-time data query" What is the use of scenario?

         Internet companies are generally MySQL database business, the case of large amount of data, will be sub-library sub-table; and there will be different for each business database MySQL instance. If you want to query data across the product will be very troublesome, then who cross product query data it? Customer service system. Customer service systems typically can check all the information about a user, if the user's information is distributed in different MySQL instance, different libraries, different tables, queries, it will involve sharding-jdbc, if sharding field of each product is different algorithms different, the query would certainly be very slow and complicated. At this point you will need to have a database to summarize this data to one, while the number of positions of ODS layer is more suitable for this job.

         "Quasi-real time data warehouse" two feature involves two technical difficulties: 1 delta) data extraction and incremental MERGE; 2) provide real-time query interface. The following technical difficulties for these two were introduced corresponding solutions.

         Incremental data extraction and MERGE.

         In fact, the incremental extraction is fairly simple, it is a field based on incremental extraction of data, this may be from the growing field ID or update time. What do the two differ?

         ID can be extracted in accordance with the table in which the data is generally not updated are simply added. It is not only the largest record ID extracted last time, next start extraction from the entire ID can it? of course not.

         There are the following scenarios: USER_LOGIN_HISTORY table has 10 part tables, respectively USER_LOGIN_HISTORY_0 ~ USER_LOGIN_HISTORY_9, where ID is incremented, such as is auto_increment type. But if there are three concurrent transaction ID assigned three values, such as a 1/2/3. But the case of three concurrent transactions have not submitted again three concurrent transactions, their ID should be 4/5/6. If the transaction back 3, submit, in advance, the time increments during the pumping data, the maximum value of the current ID is 6 ,, very unfortunately, at a time when incremental extraction data, ID of 1/2/3 extracting the data does not come. Then the data of these three transactions will no longer drawn but come on! Because the next time extraction, maximum ID is 6, 7 should start from the extraction!

         At this point it should start from a minimum of the last extract data extraction ID, although there are duplicate data, but it can ensure that the extracted data is not lost. That is, if the current batch is 3, the extraction should be started from the smallest batch 1 ID data extraction. Why is this? See below.

 

 

         When Batch1 / batch2 / batch3 extraction, MySQL tables are current maximum ID 1001/2001/4002. Should extract data 1001 to 2001 when Batch2 extraction, but unfortunately it is not yet submitted the ID of this range data 10 in 1990 to 1999, it will miss during extraction. Where should we start extracting it when extracting Batch3? 1001,1990, or 2001?

         Ideally it should be extracted from the beginning of 1990, because it is only missing the 1990 to 1999 data. But how to determine what data is missing it? The answer is not known. Because you do not know the number of pumping, which transactions have not commit. It is clear that 1001 should be drawn from the start. In fact, every time is simply to extract data of two batch, to avoid the impact of the transaction. For certain that two batch it? In fact, here is set to two batch, in fact, assuming that the transaction is less than the maximum duration of the interval time for each batch. For near real-time warehouse, each batch are generally hour or half an hour, will be greater than the maximum duration of a transaction, so the two batch is enough. If the batch is small time interval, so long pushed forward more than a few batch on the line.

         After the incremental data extraction, MERGE relatively simple, in fact, be FULL JOIN incremental data table and ODS-wide scale, incremental data to prevail on the line. But also need to consider is whether to allow table business database physically remove, for example, we are not allowed to physically deleted, so FULL JOIN on the line. Physically Delete more trouble, incremental extraction is unable to locate the data has been deleted! How to do it? You can use the service database binlog Delete the data extracted to another table, then the total scale ODS it to clean enough.

         According to updated data extraction, similar to the above scheme, but with one caveat. When extracted, only the minimum limit of the time, but not limit the maximum. For example, when a batch extraction, the current time is "at 18:00 on the February 22, 2019: 00.153," because the current time is calculated from the actual extraction may still be a difference of a few milliseconds or a few seconds, then updated during which data may They lost, because every time the data may be updated within this interval off!

         Real-time query interface.

         And real-time cross-product, cross-database, cross-table query systems are generally in the back office systems, characteristics of these systems is multi-query data source, a small amount of data query results. Queries are generally one or some of the user data. This can be achieved by sharding-jdbc or ElasticSearch full-text search. sharding-jdbc although there may be a problem, but simpler to implement, configure sharding rules, sql written on it.

ElasticSearch full-text search more trouble, because there is no perfect sql ES interface, it can only be a good first required data collection, which in turn involves the issue of multi-table real-time associate summary. Suppose a query results involving upstream three tables, the association between them and the different conditions in real-time summary, if a table where the data is not to the other two tables can not be put in storage, only the first cache data when all the data arrives summary again, the implementation of the difficulty is quite large.

         Of course, you can also required all business library, drawn to a certain MySQL database query. When a large amount of data, this program will be very bad.

         What better plan is it? We can put the data in a logical table (Table after table of sub-sub-library integration) to extract real-time Phoenix by binlog, foreground business system real-time query by Phoenix's JDBC interface. As the Phoenix index support, we can use a query like Phoenix, like MySQL, of course SQL optimization may be required.

         Since Phoenix HBase based underlayer is made, you can carry vast amounts of data read and write; and can be mapped Phoenix HIVE offline query. So we put off-line real-time query and analysis needs a unified! Then this program is there any problem? One thing to consider is: how to ensure the accuracy of real-time extraction of it? In other words, binlog to Phoenix during an update if a log is missing how to do it?

         Obviously you can use the incremental extraction of data, added in Phoenix. That according to the above incremental extraction, complement logic is not it all right?

         But it is still a problem, the problem is still matters, but this is the problem when the transaction complement. Incremental data are generally large, so it is more time-consuming for a long time, assumed to be 3 minutes, then this period of time, the real-time updated data will be overwritten it? Obviously, we will. Now that would be covered when the complement to determine what update the same data as the primary key myself, whichever is the greatest time. This is problematic, because Phoenix is ​​not turned on by default affairs, that is, to determine when the incremental data is up to date, but updates to Phoenix, incremental data is not necessarily up to date, because the time difference in the real-time data entered.

         Then open the Phoenix transaction chant, open should be able to solve this problem, but the Phoenix transaction mechanism or Beta version, but it also may cause performance issues and deadlock issues.

         How to solve the problem that real-time data and incremental data off each other covered up? There is no best of both worlds then?

         Familiar HBase students must know, HBase concept of time-stamped, and support multiple versions of queries by timestamp. Phoenix is ​​inserted through the HBase, timestamp columns are all RegionServer the current time, that is when the same ID is inserted, the time stamp is increasing, only the query to the latest data query. That this anything to do with the above problem?

         If we update time data is mapped to a timestamp Phoenix underlying HBase table, is not the perfect solution to the problem affairs of it? Very simple, updated map data to HBase timestamp, real-time data and incremental data, simply insert like Phoenix, just check the latest data when Phoenix inquiry!

         However, the ideal is perfect, the reality is cruel, Phoenix does not currently support different fields are mapped to HBase timestamp!

No way, can only change the source code myself. Phoenix by modifying the source code, we support the Phoenix ROW_TS this special field type, the value of this type of timestamp is written HBase, you can freely specify the time stamp when the data that is inserted into the Phoenix! Here are the results after the transformation, it is clear that, in line with expectations.

 

 

So far, the "quasi-real time data warehouse" approach to introduce over, through architecture diagram below briefly summarize.

 

 

1) upstream of the MySQL binlog written by Phoenix Debezium (or the Canal), and real-time flume. When adding a new field, real-time Phoenix modify table structure

2) The data updating time, incremental data extraction MySQL hour, the partial data batch to the MERGE Phoenix

3) automatically create the external table Hive Phoenix table day (table may be created external to Phoenix underlying HBase Hive tables), subsequent ETL process.

4) Real-time query Phoenix internet connection through JDBC, according to the master key or index data in real-time query

 

Guess you like

Origin www.cnblogs.com/gabry/p/10422046.html