Hadoop on the Cloud: Storage and Computing Separation Architecture Design and Migration Practice

The original technical architecture of Yidiandata is a big data cluster built using CDH in the offline computer room. Since the establishment of the company, it has maintained rapid growth every year, and the growth of business has brought about a sharp increase in the amount of data.

In the past few years, we have expanded the hardware according to the plan every 1 to 2 years, but we often have to expand the capacity again after half a year. And each expansion requires a lot of effort.

In order to solve these problems including long expansion period, mismatch of computing and storage resources, and high operation and maintenance costs, we decided to transform the data architecture and migrate the data to the cloud, adopting a structure that separates storage and computing . In this case, we will introduce the whole process of architecture design, model selection, component evaluation and data migration of Hadoop to the cloud.

At present, based on JuiceFS, we have realized the architecture of separation of computing and storage, and the total storage capacity has increased by 2 times; there is no obvious change in performance, and the operation and maintenance cost has been greatly reduced. At the end of the case, the first-hand operation and maintenance experience of Alibaba Cloud EMR and JuiceFS is attached. I hope this case can provide valuable reference for other peers facing similar problems.

01 Old Architecture and Challenges

In order to meet business needs, Yidian Data has captured the data of hundreds of large-scale websites at home and abroad, and the number has exceeded 500 at present, and has accumulated a large amount of raw data, intermediate data and result data. As we continue to increase the number of sites we crawl and the client base we serve, the amount of data is growing rapidly. So we set out to start scaling to meet the growth in demand.

The original architecture is to use CDH to build a big data cluster in an offline computer room. As shown in the figure below, we mainly use components such as Hive, Spark, and HDFS. There are a variety of data production systems upstream of CDH, and only Kafka is listed here because it is related to JuiceFS; besides Kafka, there are other storage methods, including TiDB, HBase, MySQL, and so on.

In terms of data flow, we have an upstream business system and data collection system, and the data will be collected and written to Kafka. We then use a Kafka Connect cluster to sync the data to HDFS.

On top of this architecture, we use a self-developed data development platform called OneWork for developing and managing various tasks. These tasks will be sent to the task queue through Airflow for scheduling.

challenge

Business/data will grow relatively fast, and the business expansion cycle will be long . The company deployed a CDH cluster in the offline computer room in 2016, and by 2021 it has stored and processed PB-level data. Since its establishment, the company has maintained a high growth rate of doubling every year, and the data volume of Hadoop clusters has grown faster than the business volume. In the past few years, the hardware planned for 1 to 2 years often has to be expanded again after half a year due to data growth exceeding expectations. Each expansion cycle can reach one month. In addition to spending a lot of energy to follow up the administrative and technical processes, the business side also has to arrange more man-days to control the amount of data. If you choose to purchase hard drives and servers for capacity expansion, the implementation period will be relatively long.

Storage and computing are coupled, making capacity planning difficult and easy to misconfigure . In the traditional Hadoop architecture, storage and computing are tightly coupled, and it is difficult to independently expand and plan according to storage or computing requirements. For example, suppose we need to expand storage, so we first need to purchase a batch of new hard drives, and at the same time, we need to purchase computing resources. Initially, computing resources may become redundant because there may not be as many computing resources that are actually needed, leading to a certain amount of investment ahead.

The CDH version is relatively old, so I dare not upgrade it . Because we built the cluster earlier, for the sake of stability, we dare not upgrade.

The cost of operation and maintenance is high (there is only one full-time operation and maintenance in the whole company). At that time, there were more than 200 people in the company, and there was only one operation and maintenance, which meant that the workload of operation and maintenance was very heavy. Therefore, we hope to adopt a more stable and simpler architecture to provide support.

There is a single point of risk in the computer room . Considering long-term factors, all data is stored in the same computer room, which has certain risks. For example, if a fiber optic cable is cut, which happens quite often, we still have a single room at risk of a single point of failure.

02 New architecture and selection

Selection considerations

Considering these factors and challenges, we decided to make some new changes. Below are some of the main dimensions we consider for architectural upgrades.

  • Going to the cloud, elastic scaling, and flexible O&M . Utilizing services on the cloud can simplify operations and maintenance. For example, in terms of storage, although HDFS itself is a stable and mature solution, we prefer to spend time on the business level rather than the underlying operation and maintenance work. Therefore, using cloud services may be simpler. In addition, by utilizing resources on the cloud, we can achieve elastic scaling without waiting for long periods of hardware deployment and system configuration.
  • Separate storage and computing . We want to decouple storage and computing for better flexibility and performance.
  • Try to use open source components to avoid binding with cloud vendors . Although we choose to go to the cloud, we don't want to rely too much on the cloud service itself. We use cloud-native solutions when providing services to customers, such as using AWS Redshift, etc., but we prefer to use open source components in our own business.
  • Compatible with existing solutions as much as possible to control modification costs and risks . We want the new architecture to be compatible with existing solutions to avoid introducing additional development costs and impacting our business.

New Architecture: Alibaba Cloud EMR + OSS + JuiceFS

The final choice is to use " Alibaba Cloud EMR + JuiceFS + Alibaba Cloud OSS " to build a big data platform with separation of storage and computing, and gradually migrate the business of the off-cloud data center to the cloud.

This architecture uses object storage instead of HDFS, and chooses JuiceFS as the protocol layer because JuiceFS is compatible with POSIX and HDFS protocols. On top, we used EMR, a semi-managed Hadoop solution on the cloud. It contains many Hadoop-related components, such as Hive, Impala, Spark, Presto/Trino, etc.

Alibaba Cloud vs other public clouds

The first is deciding which cloud vendor to use. Due to business needs, AWS, Azure, and Alibaba Cloud are all in use. After comprehensive consideration, Alibaba Cloud is considered to be the most suitable. There are these factors:

  • Physical distance : Alibaba Cloud has an availability zone in the same city as our offline computer room, and the delay of the network dedicated line is small and the cost is low
  • Complete set of open source components : Alibaba Cloud EMR contains many open source components. In addition to the heavily used Hive, Impala, Spark, and Hue, it can also easily integrate Presto, Hudi, and Iceberg. During our research, we found that only Alibaba Cloud EMR comes with Impala, while AWS and Azure are either of lower versions or need to be installed and deployed by themselves.

JuiceFS vs JindoFS

Alibaba Cloud's EMR itself also has a storage-computing separation solution using JindoFS, but based on the following considerations, we finally chose JuiceFS:

JuiceFS uses Redis and object storage as the underlying storage. The client is completely stateless and can access the same file system in different environments, which improves the flexibility of the solution . The JindoFS metadata is stored in the local hard disk of the EMR cluster, which is not easy to maintain, upgrade and migrate.

  • JuiceFS has rich storage solutions and supports online migration of different solutions, which improves the portability of the solutions. JindoFS block data only supports OSS.
  • Based on the open source community, JuiceFS supports all public cloud environments and facilitates later expansion to multi-cloud architectures.

About JuiceFS

Directly intercept the introduction of the official document :

JuiceFS is a high-performance shared file system designed for cloud native, released under the Apache 2.0 open source agreement. It provides complete POSIX compatibility, and can connect almost all object storage locally as a massive local disk, and can also mount and read on different hosts across platforms and regions at the same time.

JuiceFS adopts a separate storage architecture for "data" and "metadata", thereby realizing the distributed design of the file system. Using JuiceFS to store data, the data itself will be persisted in object storage (for example, Amazon S3), and the corresponding metadata can be persisted in various databases such as Redis, MySQL, TiKV, SQLite, etc. on demand .

In addition to POSIX, JuiceFS is fully compatible with HDFS SDK, and when used in conjunction with object storage, it can perfectly replace HDFS and realize the separation of storage and computing.

Hadoop migration cloud PoC design

The purpose of PoC is to quickly verify the feasibility of the scheme, with several specific goals:

  • Verify the feasibility of the overall solution of EMR + JuiceFS + OSS
  • Check compatibility of component versions such as Hive, Impala, Spark, Ranger, etc.
  • Evaluation and comparison of performance, using TPC-DS test cases and some internal real business scenarios, there is no very precise comparison, but it can meet business needs
  • Evaluate the type and number of node instances required for the production environment (calculate the cost)
  • Explore Data Synchronization Solutions
  • Explore the integration scheme of verification cluster and self-developed ETL platform, Kafka Connect, etc.

During the period, a lot of testing, document research, internal and external (Alibaba Cloud + JuiceFS team) discussions, source code understanding, tool adaptation and other work were done, and finally decided to continue.

03 Implementation

We began to explore Hadoop's cloud migration solution in October 2021; we did a lot of research and discussions in November, and basically determined the content of the solution; we did PoC tests before the Spring Festival in December and January 2022, and started to build a formal cloud solution in March after the Spring Festival. environment and arrange migration. In order to avoid business interruption, the entire migration process is performed in phases at a relatively slow pace. After the migration, the data volume of the EMR cluster on the cloud is expected to exceed 1 PB of a single copy.

architecture design

After the technology selection is completed, the architecture design can also be determined quickly. Considering that the Hadoop cluster will remain in the data center except for part of the business, the whole is actually a hybrid cloud architecture.

The overall architecture is roughly as shown in the figure above: on the left is the offline computer room, which uses the traditional CDH architecture and some Kafka clusters. On the right is the EMR cluster deployed on Alibaba Cloud. These two parts are connected by a high-speed dedicated line. At the top are Airflow and OneWork, both of which support distributed deployment, so they can easily scale horizontally.

Data Migration Challenges

Challenge 1: Hadoop 2 upgrade to Hadoop 3

Our CDH version is relatively old, and we dare not upgrade, but since we have migrated, we still hope that the new cluster can be upgraded to the new version. During the migration process, you need to pay attention to the differences between HDFS 2 and 3, and the interface protocol and file format may change. JuiceFS is perfectly compatible with HDFS 2 & 3 and meets this challenge well.

Challenge 2: Upgrade from Spark 2 to Spark 3

An upgrade of Spark has a relatively large impact on us, because there are many incompatible updates. This means that the code originally written on Spark 2 needs to be modified before it can be adapted to the new version.

**Challenge 3: Hive on Spark does not support Spark 3**

In the computer room environment, the Hive on Spark that comes with CDH is used by default, but the Spark version in CDH was only 1.6 at that time. We are using Spark 3 on the cloud, but Hive on Spark does not support Spark 3, which makes us unable to continue using the Hive on Spark engine.

After research and testing, we changed Hive on Spark to Hive on Tez. This change is relatively easy, because Hive itself provides abstraction and adaptation for different computing engines, so the changes to our upper-level code are relatively small. Hive on Tez may be slightly slower than Spark in performance. In addition, we also pay attention to Kyuubi, a new computing engine open sourced by NetEase in China, which is compatible with Hive and provides some new features.

Challenge 4: Hive 1 is upgraded to Hive 3, and the metadata structure has changed

For the Hive upgrade, one of the most important impacts is the change of the metadata structure, so during the migration process, we need to convert the data structure. Because Hive cannot be used directly to handle this migration, we need to develop corresponding programs to convert data structures.

Challenge 5: Permission management is replaced by Sentry by Ranger

This is a relatively small problem, that is, we used Sentry for permission management before, the community is not very active, and EMR is not integrated, so we replaced it with Ranger.

In addition to technical challenges, the bigger challenge comes from the business side.

Business challenge 1: There are many businesses involved, and the delivery cannot be affected

We have multiple businesses across different sites, clients and projects. Since business delivery cannot be interrupted, the migration process must be handled by business, using a gradual migration approach. During the migration process, data changes will affect multiple aspects of the company, such as ETL data warehouse, data analysts, testing and product development. Therefore, we need to carry out good communication and coordination, and formulate project management plans and schedules.

Business Challenge 2: Multiple data tables, metadata, files, and codes

In addition to data, we also have many business codes on the upper layer, including data warehouse codes, ETL codes, and some application codes, such as BI applications that need to query these data.

Data Migration: Stock Files & Incremental Files

The data to be migrated includes two parts: Hive Metastore metadata and files on HDFS. Since the business cannot be interrupted, the method of stock synchronization + incremental synchronization (double writing) is used for migration; after data synchronization, a consistency check is required.

Stock synchronization

For stock file synchronization, you can use the full-featured data synchronization tool sync subcommand provided by JuiceFS to achieve efficient migration. The JuiceFS sync command supports single-node and multi-machine concurrent synchronization. In actual use, it is found that a single node can use multiple threads to fully utilize the dedicated line bandwidth. The CPU and memory usage is low, and the performance is very good. It should be noted that the sync command will write cache in the local file system during the synchronization process, so it is best to mount it to an SSD disk to improve performance.

The data synchronization of Hive Metastore is relatively troublesome:

  • The two Hive versions are inconsistent, and the metastore table structure is different, so the export and import functions of MySQL cannot be used directly
  • After migration, you need to modify the library, table, and partition storage paths (ie dbstable DB_LOCATION_URIand sdstable LOCATION)

Therefore, we have developed a set of scripting tools that support data synchronization at the granularity of tables and partitions, which is very convenient to use.

Incremental synchronization

Incremental data mainly comes from two scenarios: Kafka Connect HDFS Sink and ETL program, we use the double write mechanism.

All the Sink tasks of Kafka Connect can be copied, and the configuration method is introduced above. ETL tasks are uniformly developed on OneWork, and the bottom layer uses Airflow for scheduling. Usually you only need to copy the relevant DAG and modify the cluster address. During the actual migration process, this step encountered the most problems and took a lot of time to solve. The main reason is that differences in Spark, Impala, and Hive component versions lead to task errors or data inconsistencies, and business code needs to be modified. These issues were not covered in the PoC and early migrations, which is a lesson learned.

Data validation

In order to allow the business to use the new architecture with confidence, data verification is essential. After the data is synchronized, a consistency check is required, which is divided into three layers:

  • The files agree . The usual way to check during the inventory synchronization phase is to use checksum. The original JuiceFS sync command did not support the checksum mechanism. After our suggestion and discussion, the JuiceFS team quickly added this function (issue, pull request ) . In addition to checksum, you can also consider using file attribute comparison: ensure that the number, modification time, and attributes of all files in the two file systems are consistent. Slightly less reliable than checksum, but lighter and faster.

  • The metadata is consistent . There are two ways of thinking: comparing the data in the Metastore database, or comparing the results of Hive's DDL commands.

  • The calculation results are consistent . That is, use Hive/Impala/Spark to run some queries and compare whether the results on both sides are consistent. Some queries that can be referred to: the number of rows in a table/partition, the sorting result based on a certain field, the maximum/minimum/average value of a numeric field, statistical aggregation often used in business, etc.

The function of data verification is also encapsulated into the script, which is convenient for quickly finding data problems.

Hierarchical storage

After migrating and running the business stably, we began to consider tiered storage. Hierarchical storage is a common problem in various databases or storage systems. There is a difference between hot and cold data, and the price of storage media is also different. Therefore, we hope to store cold data on cheaper storage media to control costs.

In the previous HDFS, we have implemented a hierarchical storage strategy, purchased two types of hard disks, stored hot data in high-speed hard disks, and stored cold data in low-speed hard disks.

However, the data block mode adopted by JuiceFS to optimize performance will impose restrictions on hierarchical storage. According to the processing of JuiceFS, when a file is stored on the object storage, it is logically split into many chunks, slices and blocks, and finally stored in the object storage in the form of blocks.

So, if we look at a file in an object store, we can't actually find the file itself directly, we can only see the small pieces it's been divided into. Even though OSS provides the life cycle management function, we cannot configure the life cycle based on the table, partition or file level.

In the future, we will solve it in the following way.

  • Two buckets: standard (JuiceFS) + low frequency (OSS) : Create two buckets, one for JuiceFS, and store all data in the standard storage tier. In addition, we create an additional low-frequency OSS storage bucket.

  • Based on the business logic, configure the storage policy table for the table/partition/file . We can set storage policies based on tables, partitions or files, and write scheduled tasks to scan and execute these policies.

  • Export low frequency files from JuiceFS to OSS with Juicesync and modify Hive metadata . After the file is transferred from JuiceFS to OSS, it will be deleted from JuiceFS, and the complete file content can be seen on OSS, so we can set life cycle rules for it. After transferring the files, you need to modify the Hive metadata in time, and change the location of the Hive table or partition to the new OSS address. Components such as EMR's Hive/Impala/Spark natively support OSS, so the application layer is basically insensitive (note that accessing low-frequency files will bring additional overhead).

After completing this operation, in addition to implementing tiered storage to reduce costs, there is an additional benefit that we can reduce the amount of JuiceFS metadata. Because these files no longer belong to JuiceFS, but are directly managed by OSS, this means that the number of inodes in JuiceFS will be reduced, the pressure on metadata management will be reduced, and the number and capacity of Redis requests will also be reduced. From a stability point of view, this will be more beneficial to the system.

04 Benefits of architecture upgrade & follow-up plan

Benefits of Separation of Storage and Computing The total storage capacity has doubled, while computing resources remain unchanged, and temporary task nodes are occasionally opened. In our scenario, the data volume grows very fast, but the query requirements are relatively stable. Data volumes have tripled from 2021 to the present. Computing resources have basically not been changed much in the initial stage, unless certain business needs require faster computing speed, we will open elastic resources and temporary task nodes to speed up.

performance change

  • Overall, there is no obvious perception. A simple TPCDS test during the PoC period showed little difference, and the ad-hoc Impala query response became faster.
  • There are many influencing factors: HDFS -> JuiceFS, component version upgrades, Hive computing engine changes, cluster load, etc.

In our business scenario, we mainly perform batch processing and offline computing of big data, which is generally not sensitive to performance delays. During the PoC, we ran some simple tests. However, it is difficult for these tests to be accurate because the testing process is affected by many influencing factors. We first replaced the storage system, switching from HDFS to JuiceFS, and at the same time upgraded the component version, and the Hive engine also changed. In addition, the cluster load cannot be completely consistent. In our scenario, the performance difference of the cluster architecture is not significant compared to the CDH previously deployed on physical servers.

Usability & Stability

  • No problems with JuiceFS itself
  • There are some minor problems in the use of EMR, but overall CDH is more stable and easy to use

**Implementation Complexity**

  • In our scenario, the incremental double-writing & data verification process takes the most time (looking back, the verification investment is too large and can be streamlined);
  • There are many influencing factors: business scenarios (offline/real-time, number of tables/tasks, upper-layer applications), component versions, supporting tools and reserves.

When evaluating the complexity of a similar architecture or solution, there are many influencing factors to consider. These include differences in business scenarios and sensitivity to latency requirements. In addition, the size of the table data volume will also have an impact. In our scenario, we have a large number of tables and databases with a relatively high number of files. In addition, the characteristics of the upper application program, the number of services used, and related programs will also have an impact on the complexity. Another important influencing factor is the gradual divergence of version migrations. If you just do the translation and keep the versions the same, then the effect of the component can basically be eliminated.

Ancillary tools and reserves are an important influencing factor. When performing data warehouse or ETL tasks, there are many implementation methods to choose from, such as manually writing Hive SQL files, Python or Java programs, or using common scheduling tools. But no matter which method is used, we need to copy and modify these programs, because double writing is necessary.

We use OneWork, a self-developed development platform, which is very complete in task configuration. Through the OneWork platform, users can configure these tasks on the web interface to achieve unified management. The deployment of Spark tasks does not require logging in to the server, and OneWork will automatically submit them to the Yarn cluster. This platform greatly simplifies the process of code configuration and modification. We wrote a script to copy the task configuration, and with some modifications, we can achieve a high degree of automation, almost 80 to 90%, so that these tasks can run smoothly.

There are several directions for follow-up plans:

  • Continue to complete the cloud migration of the remaining business
  • Explore the hot and cold tiered storage strategy of JuiceFS + OSS. JuiceFS files are completely broken up on OSS and cannot be classified based on file level. The current idea is to migrate cold data from JuiceFS to OSS, set it as archive storage, and modify the LOCATION of Hive tables or partitions without affecting usage.
  • Currently JuiceFS uses Redis as the metadata engine. If the amount of data increases in the future and there is pressure to use Redis, you may consider switching to TiKV or other engines.
  • Explore the elastic computing instance of EMR, and strive to reduce the cost of use while meeting the business SLA

05 Appendix

Deploy and configure

About the IDC-Aliyun dedicated line :

There are many providers who can provide dedicated line services, including IDC, Alibaba Cloud, operators, etc. When choosing, we mainly consider factors such as line quality, cost, and construction period. In the end, we chose IDC's solution. IDC has cooperated with Alibaba Cloud, and soon completed the opening of the dedicated line. If you encounter problems in this regard, you can find support from IDC and Alibaba Cloud. In addition to leased line rental costs, Alibaba Cloud will also charge for downlink (from Alibaba Cloud to IDC) transmission fees. The intranet IPs at both ends of the leased line are fully interoperable, and some routing configurations are required on both sides of Alibaba Cloud and IDC.

Regarding the selection of EMR Core/Task node types :

JuiceFS can use local hard disk as cache , which can further reduce OSS bandwidth requirements and improve EMR performance. Larger local storage space can provide higher cache hit ratio.

Alibaba Cloud's local SSD instance is a more cost-effective SSD storage solution (compared to cloud disks), and it is suitable for use as a cache. JuiceFS Community Edition does not support distributed caching, which means that each node needs a cache pool, so you should choose as large a node as possible.

Based on the above considerations and configuration comparison, we decided to choose ecs.i2.16xlarge, each node has 64 vCores, 512GiB Memory, and 1.8T*8 SSD.

About EMR version :

In terms of software, it mainly includes determining component versions, enabling clusters, and modifying configurations. Our computer room uses CDH 5.14, the Hadoop version is 2.6, and the closest version on Alibaba Cloud is EMR 3.38. However, during the investigation, we found that this version of Impala is not compatible with Ranger (in fact, our computer room uses Sentry for permission management, But not on EMR), after evaluation and comparison, it was decided to directly use the latest version of EMR 5, and the major versions of almost all components have been upgraded (including Hadoop 3, Spark 3 and Impala 3.4). Also, use external MySQL as database for Hive Metastore, Hue, Ranger.

About the JuiceFS configuration :

Basically refer to the JuiceFS official document " Accessing JuiceFS through a Java client in Hadoop " to complete the configuration. In addition, we also configured these parameters:

  • Cache related: the most important of which is juicefs.cache-dirthe cache directory. This parameter supports wildcards and is very friendly to the instance environment of multiple hard disks. If it is set to /mnt/disk*/juicefs-cache(need to manually create the directory, or create it in the initial script of the EMR node), all local SSDs are used as the cache. Also pay attention juicefs.cache-sizeto juicefs.free-spacetwo parameters.
  • juicefs.push-gateway: Set up a Prometheus Push Gateway to collect metrics from the JuiceFS Java client.
  • juicefs.users, juicefs.groups: respectively set to a file in JuiceFS (eg jfs://emr/etc/users, jfs://emr/etc/groups), to solve the problem that the uid and gid of multiple nodes may not be uniform.

About Kafka Connect using JuiceFS :

After some tests, it was confirmed that JuiceFS can be perfectly applied to the HDFS Sink plug-in of Kafka Connect (we also added the configuration method to the official document ). Compared with using HDFS Sink to write to HDFS, writing to JuiceFS requires adding or modifying the following configuration items:

  • Publish the JAR package of JuiceFS Java SDK to the HDFS Sink plugin directory of each node of Kafka Connect. The plugin path for the Confluent platform is:/usr/share/java/confluentinc-kafka-connect-hdfs/lib

  • Write an arbitrary directory containing the JuiceFS configuration core-site.xml, published to each node of Kafka Connect. Include these mandatory configuration items:

fs.jfs.impl = io.juicefs.JuiceFileSystem

fs.AbstractFileSystem.jfs.impl = io.juicefs.JuiceFS

juicefs.meta = redis://:[email protected]:6379/1

See the configuration documentation for the JuiceFS Java SDK.

Kafka Connector task settings:

hadoop.conf.dir=<core-site.xml所在目录>

store.url=jfs://<JuiceFS文件系统名称>/<路径>

First-hand operation and maintenance experience

During the entire implementation process, we stepped on some pitfalls one after another, accumulated some experience, and shared it for your reference.

Alibaba Cloud EMR and components related

compatibility

  • The Hive and Spark versions of EMR 5 are not compatible, and Hive on Spark cannot be used. You can change the default engine to Hive on Tez.
  • After Impala's stats data is synchronized from the old version to the new version, the table may not be queried due to IMPALA-10230 . num_nulls=-1The solution is to change the to . when synchronizing metadata num_nulls=0. It may be necessary to use the CatalogObjects.thrift file.
  • The original cluster had a small amount of files in Textfile format that were compressed with snappy, but the new version of Impala could not read them and reported an error Snappy: RawUncompress failed, which may be caused by IMPALA-10005 . The workaround is to not use snappy compression on Textfile files.
  • Compared with Impala 2.11, the function behavior of Impala 3.4 CONCAT_WSis different, the old version CONCAT_WS('_', 'abc', NULL)will return NULL, while the new version returns 'abc'.
  • Impala 3.4 is stricter on reserved keyword references in SQL, and "''" must be added. In fact, a good habit is not to use reserved keywords in business code.
  • The coverage of PoC or pre-test is as complete as possible, and the real business code is used to run. We used relatively few component features in the PoC and early migration business, and they were basically the most commonly used and compatible functions, so it went smoothly. However, many problems were exposed during the second batch of migration. Although they were resolved in the end, it took a lot of extra time for diagnosis and positioning, which disrupted the rhythm.

performance

  • Impala 3.4 of EMR 5 is patched with IMPALA-10695 , which supports setting the number of independent IO threads for oss://and (the original intention is to support JindoFS, but JuiceFS also uses the jfs scheme by default). jfs://Add or modify Impala configuration items on the EMR console num_oss_io_threads.
  • Alibaba Cloud OSS has an account-level bandwidth limit, the default is 10Gbps, which can easily become a bottleneck as the business scale increases. You can communicate and adjust with Alibaba Cloud.

operation and maintenance

  • EMR can be associated with a Gateway cluster, which is usually used to deploy business programs. If you want to submit Spark tasks in client mode on the Gateway, you need to add the IP of the Gateway machine to the hosts file of the EMR node first. The cluster mode can be used by default.
  • EMR 5 will start a Spark ThriftServer, and you can directly write Spark SQL on Hue, which is very convenient to use. However, there is a pit in the default configuration, which will write a large number of logs (the path is probably /mnt/disk1/log/spark/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-emr-header-1.cluster-xxxxxx.out), causing the hard disk to be full. There are two solutions: configure log rotate or clear spark.driver.extraJavaOptionsthe configuration (recommended by Alibaba Cloud technical support).

JuiceFS related

  • JuiceFS requires the same UID and GID on each node, otherwise it is easy to have permission problems. There are two implementations: modify the user of the operating system (more suitable for new machines, without historical baggage), or maintain a user mapping table on JuiceFS . We also shared an article on JuiceFS + HDFS permission problem location before , which was discussed in detail. Users who usually need to maintain mappings include impala, hive, hadoop, etc. If you use Confluent Platform to build Kafka Connect, you also need to configure the cp-kafka-connect user.
  • When using the default JuiceFS IO configuration , Hive on Tez and Spark are much faster than Impala for the same write query (but Impala is faster in the computer room). Finally, I found that after changing the juicefs.memory-size from the default 300 (MiB) to 1024, Impala's write performance has doubled.
  • When doing JuiceFS problem diagnosis and analysis, the client log is very useful. It should be noted that the logs of POSIX and Java SDK are different. For details, see JuiceFS Troubleshooting and Analysis | JuiceFS Document Center
  • Pay attention to monitoring the space usage of Redis. If Redis is full, the entire JuiceFS cluster cannot be written. (This point requires special attention) When using JuiceFS sync to synchronize data in the computer room to the cloud, choose to run on a machine with SSD to get better performance.

If you are helpful, please pay attention to our project  Juicedata/JuiceFS  ! (0ᴗ0✿)

Musk announced that Twitter will change its name to X and replace the Logo . React core developer Dan Abramov announced his resignation from Meta Clarification about MyBatis-Flex plagiarizing MyBatis-Plus OpenAI officially launched the Android version of ChatGPT ChatGPT for Android will be launched next week, now Started pre-registration Arc browser officially released 1.0, claiming to be a replacement for Chrome Musk "purchased for zero yuan", robbed @x Twitter account VS Code optimized name obfuscation compression, reduced built-in JS by 20%! Bun 0.7, a new high-speed JavaScript runtime , was officially released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5389802/blog/10091511
Recommended