[Iceberg+Alluxio] Accelerating Data Channels (Part 2)

[Iceberg+Alluxio] Accelerating Data Channels (Part 2)

Wang Beinan is a software engineer of Alluxio and a committer of PrestoDB. Before joining Alluxio, Dr. Beinan was the technical leader of the Twitter Presto team and built a large-scale distributed SQL system for Twitter's data platform. He has 12 years of working experience in performance optimization, distributed caching, and big data. Dr. Wang Beinan graduated from Syracuse University with a major in computer engineering. His professional direction is signal model detection and operation verification for distributed systems.
Chen Shouwei, Alluxio software engineer, is mainly responsible for the integration of data lake solutions, structured data and high availability optimization in Alluxio. Dr. Chen Shouwei graduated from the Department of Electrical and Computer Engineering of Rutgers University, specializing in performance and stability optimization of large-scale distributed systems.

Query the Iceberg table

Create Table

Many people may use Presto and only use the Hive Connector. In fact, the Iceberg connector is similar to Hive. There are mutual references in terms of implementation and function, especially in terms of implementation. A lot of underlying code of the Hive connector is used. The same is true for table creation. We can extract a few columns from a customer table of TPC-DS data and create a table. You can specify the format of this data, which can be in Parquet or ORC format. You can also specify the partition at the same time. It is easier to use the month of birth here, because there are only 12 months, that is, 12 partitions. After we created this table, like the Hive table, you can select this table, select * from Iceberg.test, test1 is the table name, you can use a dollar sign $ to add partitions, so that you can put all the partitions of this table List it. Each partition will have a statistic. For example, in the first line below August, you can see how many rows, how many files, and the total size. At the same time, for the customer_sk column, you can see the minimum value and maximum value. The following birth date is the date. For August, the minimum value is 1, the maximum value is 31, and there are several null values. Because August is a big month, and the following September is a small month, the maximum value is 30. Each partition will have its own statistics. We will talk about it later. Predicate pushdown will use this, allowing us to skip many partitions. In fact, Hive also has this function, but there may be some data on the Hive metastore. If the metadata is not here, this function cannot be used. However, it is embedded in the table on Iceberg, which is more useful.

Insert

As mentioned earlier, Iceberg will have some transaction support. Let's try to add a row to this table. The SK is 1000 and the date is 40. I deliberately inserted a date and month that cannot exist and the month is 13. This is equivalent to saying that I have created a new partition. In fact, a new snapshot (snapshot) will be generated regardless of whether a new partition is created. In Presto, through select * from table, add a dollar $ sign to the name of the table, and then add a snapshots, you can list all the table. snapshots. You can see that there are two snapshots, because one is generated when creating a new table, and another is generated after inserting a row. The manifest list has two avro files, the second snapshot is based on the first one, and the parent ID of the second snapshot is the first A snapshot parent ID, we will use the snapshot ID for time travel later.

For such a file, what will happen after we add a partition to it? Take a look at this directory. In fact, the directory of Iceberg is very simple. We specify a directory, and it creates a test1 under it. There are two folders in it, one One is data (data) and one is metadata (metadata). The data is partitioned according to the month, and there are 14 partitions in it, because there is a null value for 12 months, plus the newly added month 13, it is equal to a total of 14 partitions now, this is how the file is organized , and under each directory is the parquet file.

Query

So what happens when we query?

In fact, everyone will write a SQL, specify a condition from this select * from Test1, my month is 13, then call out the record I just inserted. I will introduce how to do time travel later. We can see that there is an @ symbol in the table name test1. I can add a snapshot ID later. If I use the second (snapshot) snapshot, it can I found this record, if I use the first snapshot, there is no such record, why? Because the first query occurs before inserting this record. This is quite useful, because sometimes I want to check what my watch looked like yesterday. But there are also problems. If you insert data frequently, a large number of snapshots will be generated, and there will be a large amount of data in avro. So are we going to throw away some expired snapshots? This is also an optimization point. Presto does not have it yet, but I think we will make it into it in the future.

In addition, some friends may ask, since the Iceberg connector has this function, can it be used to replace MySQL and do OLTP to process some online transaction data? ——Yes, but it cannot be used like MySQL, and frequent insertion of data will also bring There are some problems that require more in-depth optimization. If you use it directly like this, it will generate a large number of small files and snapshots, but there are ways to solve these, and we will slowly iterate into it later.

Schema Evolution

This is a picture taken by my former colleague Chunxu when he was doing Schema Evolution. It can be seen that this is also a highlight of Iceberg, that is to say, the table originally had several columns, and I can add or change a column. Of course, this is not difficult, because the original Hive table can also do this, but after finishing, your table Can I still check? The answer Iceberg gave us is that the table can still be checked after the table is changed. Of course, there are tricky places here, and the data inside is not so complete, but no matter what it is, it is not wrong. You need to change the table first, and you can still check it with the old query. arrive. I think this function is quite practical, because the tables of various companies are always changing, and after the changes, the tables can still be checked on the presto side.

Iceberg Connector Update

Next, I will talk about some contributions of our community in the past two months. I hope it can be helpful to everyone. I mainly talk about Presto DB, and Trino is another set of stories. Try to give consideration to it.

New Features

In the past two or three months, some of our things have been unlocked after several functions came in.

1. The first credit should be given to Jack Ye of Amazon AWS. He made a native folder support. This is called Hadoop catalog in Iceberg, which revitalizes many of our functions and solves many of our pain points.

2. In addition, Baolong from Tencent has added the function of local cache. Now the Iceberg connector can be matched with RaptorX, that is, the cache in the Hive connector, and enjoy the same local cache and speed up. Of course, this is not that simple, so out of the box, it may require some configuration, I will talk about it in detail later.

3. Next is Xinli Shang from Uber, we both upgraded the parquet. Xinli Shang is the chair of the Parquet community. After he upgraded the parquet, we took it and put it in presto. Our upgrade work lasted about half a year. After upgrading to the new parquet, we also unlocked Iceberg 1.12, and there are more new ones. features, including support for v2's Iceberg table.

4. There is also a predicate pushdown, which Beinan (Alluxio) will explain in detail later, this is a function that can optimize queries.

Iceberg Native Catalog

This is the native catalog made by Jack that I just mentioned. It was originally called Hadoop catalog in Iceberg. In fact, Iceberg data is also stored in S3, HDFS, and GCS. Each of its tables has both metadata and data, so why do we need the Hive metastore and fetch metadata from the Hive metastore? This is because the original Hive catalog still depends on the metadata of Hive. We need to find the path of the table, load Iceberg's own metadata from this table, and then use presto to query. With the good modification of Jack, we can support Hadoop catalog. You directly give it a path, and the tables are placed under this path. It scans this path, and all the tables can be entered, such as table1, table2, table3, how much metadata per table, we no longer need Hive metastore. With this native catalog, the combination of presto and Iceberg is complete. Originally, we still relied on an additional metadata storage, but now we can directly use the native catalog , which solves a lot of pain points.

Iceberg Loca Cache

This is the local cache that a friend asked about before. This function may only be merged two weeks ago. Tencent’s Baolong is very powerful and completed this function in a few days. So why this thing was done so quickly, this has to start with the implementation of the Iceberg connector, because the Iceberg connector and the Hive connector use a set of things, both of which are the same Parquet reader or ORC reader. So we were in the RaptorX project at the time, and we were doing local cache under Hive. The local cache used in this project has achieved good results in Facebook, Toutiao, and Uber. We directly use this local cache Moving to Iceberg and using it can directly achieve a very good effect.

There is a key point in this that I have to tell you. This local cache is like each worker's own private cache. It is not like Alluxio cache, which is distributed and elastic. For example, 100 nodes or 200 nodes can be deployed. Nodes can be scaled horizontally. But it’s different here. Here we will give you a nearby small-capacity local cache, and give each worker a local disk of 500g or 1TB for use as a cache.

There is a problem here. When Presto is making a plan, it is randomly assigned. For example, there are 10,000 partitions under each large table, and there may be 1 million files on it. If you take a file randomly, it does not necessarily go to any worker. , each worker cache can not so much data. So we have a soft affinity scheduling, a bit like there is an affinity function when doing load balancing, that is, sticking. For example, if this file goes to work1, it will always go to work1 in the future. In this case, work1 only needs to cache it. If you access this file again, its cache hit rate will increase, so this affinity function must be turned on.

If you find that the local cache hit rate is very low, you need to check whether the affinity is wrong, whether your nodes increase or decrease frequently, even if you don’t adjust anything, you just need to turn on the soft affinity and use a 500g Or a 1TB local cache, its hit rate should not be low, it should have a hit rate of 60 to 70%, this is the case of a large amount of data, if the amount of data is small, it may have a 100% hit rate .

In fact, Presto was handed over to Iceberg to make a plan. After receiving the SQL request, Presto parses, disassembles the SQL, and tells Iceberg what to check, and Iceberg generates a plan, saying which files to scan, and then Presto distributes these files to specific workers through soft affinity. The worker will scan these files. If the local cache is hit, the local file will be scanned, and if the local cache is not hit, the remote file will be scanned. In fact, Alluxio is a secondary storage. If there is no local hit, it will go to Alluxio, and if Alluxio is not hit, it will go to the third-level storage.

Of course, we will have semantic cache in the future, which is mainly for Hive, but as I mentioned earlier, because the underlying implementations of Iceberg and Hive are from the same source, we can use both of them. Here I would like to inform you about the latest progress. It was just told by Jack of AWS’s Iceberg team that we may no longer use presto’s Hive implementation. Of course, this is optional. You can continue to use presto’s Hive implementation, or you can Use Iceberg's native implementation. In this way, we will not rely on Hive for any new functions of Iceberg in the future, which is also a good thing, and we can introduce more vectorized things. This is a long-term plan, and you may see it next year.

Iceberg Native Catalog

For a presto query, we say select* from table, for example, city='Beijing', profile age>18 years old, such a query, it actually generates the plan of the three blocks on the left, first scan, After the scan is finished, it will be sent to the filter, and after the filter is finished, it will be output. In fact, after the table is scanned, it will be output to see which ones meet the conditions, but this is not necessary. For example, our table is partitioned by city. There is no need to scan the entire table, and each file has statistics. Maybe the age in this file is all younger than 18, so you can skip it without scanning.

There is a connector optimizer in presto. This is a unique thing of prestoDB. It can be optimized for different connectors. Why should it be optimized for different connectors? Because many people may use an Iceberg table to mirror a Hive table, and the scan data sources under the two tables are different, so you have to decide what conditions to push. In fact, there is one of the simplest rules, that is, Iceberg currently does not support the pushdown of nested fields such as profile.age, then I will push down the city, and push down the filter of the city to the table scan, and merge it with the table scan Create a plan node, which is to do both filter and scan, and then push it down to Iceberg. Here I keep age > 18 years old, and then filter after the scan is completed. This is not the optimal solution, but it is the most basic the rule of.

Predicate Pushdown Resource Usage

Let's take a look at the effect. This is not a regular Benchmark. It is the table I just built. A new record is added. The month is equal to 13. If I don't push down, it is the case on the left. It scans 200 There are 10,000 records, and the input data is more than 200 KB; after pushdown is enabled, it only scans one record, and the time and data are greatly improved. It only scans a specific partition. This kind of query cannot be encountered in reality, because there must be more combinations in reality, and there may be more files under each partition. Here is a bit more extreme, and there is only this one file. In fact, the effect is not obvious. The more files, the better the effect. I recommend you to give it a try.

I mentioned Native Iceberg IO before, we will use Iceberg reader and writer to replace the Hive implementation, make it completely separate, or support both, this is what we will work on, so stay tuned.

Ongoing Work

The other is the materialized view, which is being done by my former colleague Chunxu. This is also an important function, allowing temporary tables to store data together. This is not that simple. Facebook also has colleagues who are doing this. We won't say more here, Facebook will soon have a new blog about materialized views.

I will continue to implement the v2 table to delete. Originally, it can only be deleted according to the partition, and a certain row or a few rows cannot be deleted, because the delete operation is the same as the insert, and some new files will be generated. Mark your lines It will be deleted, and when the result is actually produced, it will be merged together. This function is not supported now, so if we want to support this function, there are two ways, one is to use native Iceberg IO, and the other is to mark the lines to be deleted on the Parquet reader, indicating that these lines are no longer show. The above are some work and ideas for the end of this year or the beginning of next year, thank you all.

For more high-quality content, please check the Alluxio think tank :

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5904778/blog/5586450