Amazon cloud technology enables DHgate Group to go to the cloud, and the new architecture on the cloud brings value

 

Founded in 2004, DHgate.com is a leading B2B cross-border e-commerce trading platform. DHgate.com has established competitive advantages in four dimensions: brand advantage, technological advantage, operational advantage, and user advantage. With the maturity of cross-border e-commerce, the continuous expansion of business scope, the increase of categories and channels, and the wide application of new technologies in industries such as AIGC in the context of operational efficiency improvement, deep mining and insight into the big data that has accumulated for nearly 20 years And use, bring new challenges to Amazon cloud technology in terms of cost, computing power, efficiency, and security.

In the past, traditional IDC big data clusters had more and more serious problems such as high maintenance costs, inability to achieve elastic scaling, coupling of computing and storage, and long expansion cycles of computing power bottlenecks, which could not respond to rapid business development.

 

Expected cloud to achieve goals

● Intelligent lake warehouse architecture

Build an intelligent lake warehouse architecture to seamlessly connect the entire process of data collection, transmission, storage, analysis, and application, realize centralized storage and management of data, and improve data transfer efficiency, data quality, reliability, and security. Perform in-depth mining, intelligent stratification and thermal analysis on data to improve the value and utilization of data.

● Refined operating cost control

Establish a refined operation and cost control system for cloud resources to improve resource utilization and reduce costs. Realize the flexible expansion and contraction of resources according to the business, and improve the flexibility and response speed of the business. Utilize cloud-native intelligent layering, automated management, and O&M capabilities to improve O&M efficiency and quality.

● One-stop data platform base

Create a one-stop big data platform that integrates data integration, data development, data asset management, and data services, realizes a "fast, accurate, complete, and stable" data warehouse system, and achieves the goal of data-driven decision-making and algorithmic business growth. The platform provides data visualization and report analysis tools to help business personnel better understand and utilize data, and improve the accuracy and efficiency of business decisions.

 

Data architecture and technical solutions

Technical components and architecture of Dunhuang Big Data (IDC)

1c9da7311f134bd38605f6ecf417d7f7.png

 

The IDC big data environment is built on the basis of CDH, big data open source ecological components, commercial and self-developed tools.

Data source: including hundreds of MySQL, Oracle and NoSQL database instances, tens of thousands of source tables (sub-database and sub-table), dozens of terabytes of data.

Data buffer: Billions of database incremental data and user behavior log data are sent to the Kafka cluster in real time every day, ensuring high availability of data and meeting the needs of offline and real-time large-scale data analysis and processing.

Offline computing and real-time computing clusters: use CDH6.x to build big data clusters, with the help of Cloudera Manager, you can easily manage and deploy Hadoop clusters, and perform visual monitoring and fault diagnosis. Provide stable and reliable offline and real-time computing engine services.

OLAP engine: ElasticSearch, ClickHouse, and StarRocks query engines are configured according to the requirements of different application scenarios to provide online query services for buyers and sellers and business operations.

Business applications: Commonly used report and visualization tools: Hue, Tableau, BO, self-developed EOS system and docking service interface and other business applications.

Data security: Kerberos+Sentry+Ldap is integrated to provide unified user authentication and authentication, ensuring data security. Among them, Kerberos provides the basis of the authentication protocol, Sentry provides fine-grained authorization control, and LDAP provides the management function of user and group information. The combination of these technologies greatly improves the security and management efficiency of big data clusters.

Data development platform: Amazon Cloud Technology's data development platform adopts a combination of open source and self-developed technologies. Among them, the task scheduling part is implemented by DolphinScheduler, and the data integration part is developed on the basis of DataX to realize the visual configuration. In addition, Amazon Cloud Technology also focuses on data lineage, metadata, and life cycle management, and has carried out targeted research and development.

 

The value that the new architecture on the cloud can bring

● Elastic scaling: Based on Amazon Cloud Technology’s EMR storage-computing separation architecture, at the computing layer, different computing power can be flexibly scheduled according to data analysis tasks, supporting minute-level scaling of computing instances, which solves the needs of IDC resources from procurement to deployment and online The long time and the possible waste of resources caused by the pre-fabricated computing power in advance.

● Performance improvement: The performance of Spark Runtime on Amazon EMR is about 1.7 to 2 times higher than that of open source Spark. Under the same resource usage, jobs can be executed faster. Presto has also optimized the runtime, and its performance is about 2.7 times faster than that of OSS. It will also benefit from the interactive query and analysis connected to the OLAP engine.

● Cost savings: Amazon EMR can flexibly scale and adjust clusters according to changes in computing requirements, adding instances during workload peaks and removing instances after workload peaks. Amazon EMR also provides the option to run multiple instance groups. You can use on-demand instances in one group to guarantee processing power, and use spot instances in another group to speed up task completion and reduce costs. instance types to take advantage of the pricing benefits of a Spot instance type. Applying the intelligent layering of S3 to automatically manage the data life cycle, it can greatly reduce storage costs compared with IDC without affecting the data read and write performance.

● Development efficiency: Amazon EMR is a fully managed cloud data platform that supports resident and transient cluster modes to adapt to daily routine offline tasks, temporary data analysis and Ad-Hoc tasks respectively, and supports fast The ability to build clusters can be easily integrated with existing big data platforms, avoiding the daily maintenance workload of traditional self-built clusters, and allowing big data teams to spend more time on technology exploration.

● Platform-based data base: Apply Amazon Cloud Technology’s intelligent lake warehouse architecture to provide a unified and shareable data base, avoid data movement between traditional data lakes and data warehouses, and integrate raw data, processed and cleaned data, and models It can not only realize high-concurrency, precise, high-performance historical data and real-time data query services for business, but also carry analytical reports, batch processing, data mining, etc. Analytical business.

Guess you like

Origin blog.csdn.net/m0_71839360/article/details/130987319