"Smart lake warehouse" integrated solution!

[text begins]

As the new round of technological revolution and industrial revolution are deeply intertwined, the digital economy is becoming a key force for reorganizing global factor resources, reshaping the global economic structure, and changing the global competitive landscape.

In this process, the cloud, as a digital base, is not only limited to the role of infrastructure, but also a key support for continuous innovation and lean operation of enterprises.

Whether or not more value can be obtained from the cloud will become the key to whether an enterprise can have a place in the digital age and continue to lead.

The innovation of cloud native infrastructure has brought about a series of "butterfly effects".

As cloud-native technology has "blossomed" in the Internet, finance and other industries, cloud-native infrastructure can not only enrich the diversified practice of digital application scenarios, but also provide sustainable endogenous power for enterprise development, "smart lake warehouse" It is one of the most important technical structures.

As the proposer of the "smart lake warehouse" architecture, Amazon Cloud Technology continues to iterate and innovate in cloud-native data infrastructure.

On March 14, Amazon's cloud-native data lake S3 celebrated its 17th birthday. On Pi Day 2023, Amazon Cloud Technology conducted a comprehensive review of the development process of Amazon S3, constantly stimulating the greater value of data.

 

01   From "data warehouse" to "data lake"

According to IDC statistics, it is estimated that by 2025, the global data volume will increase tenfold from 16.1ZB in 2016 to 163ZB.

The mass and diversification of data determines that it is becoming more and more difficult to obtain useful value from data. If benefits cannot be obtained from data, then the value of data is out of the question.

At present, the value of data is polarized. One is timely discovery and real-time analysis to quickly promote business development; the other is long-term storage. Data is accumulated to explore the hidden laws behind the data and analyze its value in a unified way to provide reference for business development.

The new value of data will bring more intelligent and innovative applications to enterprises, such as growth hacking, recommendation systems, user behavior analysis, and more models brought by AIoT, which also means changes in IT infrastructure.

The traditional data processing method is like a "trickle stream", which contains various business systems such as ERP and CRM. Users can design "a river channel", and the database is at the bottom.

The data is sorted to form a data warehouse in the middle layer, and then displayed through business intelligence tools (BI).

However, in the digital age, various videos and mobile terminal information such as "river torrents" form large-scale massive data, which users have no time to sort out and use.

At this time, a new idea opened up people's vision. Suppose there is such a depression, and all the data is stored in it first, and then queried and processed through effective tools. This is the data lake.

According to the latest research report of the international research institution MarketsandMarkets, by 2024, the global data lake market will exceed 20 billion US dollars and increase to 20.1 billion US dollars, with a compound annual growth rate of 20.6%.

It can be said that with the surge in data governance and application requirements, it has become an indisputable fact that data lakes have become an important method of data management.

The emergence of data lakes has solved a series of problems in the construction of data warehouses, and simplified the process of data management into two stages, data entering the lake and data analysis.

Data lakes generally use O&M-free, highly reliable object storage as the base, and support the storage of various data types.

For users, with the help of the latest data lake solutions, it can not only solve the problem of data islands in the past, but also be compatible with traditional data warehouse and data analysis methods.

Most importantly, it is more suitable for modern application deployment, such as combining with machine learning for predictive analysis.

02 "Smart lake warehouse" has become a new trend of technology

With the rise of the concept of data lakes, the industry has been comparing and even arguing about data warehouses and data lakes.

Some people say that the data lake is the next-generation big data platform, and major cloud vendors have also proposed their own data lake solutions, and some cloud data warehouse products have also added the feature of linkage with the data lake.

However, in our opinion, data lakes and data warehouses are not substitutes, but complement each other and complement each other. On this basis, the "smart lake warehouse" will be able to fully realize the benign interaction between the two and become an important data warehouse in the future. One of the technology trends.

Through the introduction of data warehouse governance capabilities, "Smart Lake Warehouse" can not only solve the problem of data lake construction mentioned above, but also provide a basis for better mining the value of data in the lake. It will build warehouses efficiently and flexibly The two advantages are combined together.

Different from the management methods of traditional data warehouses, "Smart Lake Warehouse" greatly improves the efficiency of data development and reduces the difficulty of data management.

In the past, to process data, senior data architects were required to define the planning of the data warehouse, from the layering of the data warehouse, the definition of indicators to the design of the data mart model, and then handed over to professional data engineers for business development, and then verified by business personnel , there is a set of normative but complicated process. The emergence of "smart lake warehouse" enables enterprises to quickly develop based on business, and flexibly adjust and plan their own data management methods.

In addition, the "smart lake warehouse" also makes the interaction between various roles in the big data industry more comfortable. Based on the data management concept of "smart lake warehouse", various data application roles can better cooperate with each other and develop together.

For example, data scientists can easily integrate their own data and manage it in a standard data warehouse process, and business analysts can also develop their own data requirements.

03 "Smart lake warehouse" breaks the isolated island and outlines the future of data value

When it comes to "smart lake storage", we have to mention its most important support - Amazon S3 (Simple Storage Service).

Seventeen years ago, Amazon Cloud Technology launched the Amazon S3 service, which defined object storage for the first time, and S3 has since become the de facto standard for object storage, which is of epoch-making significance. In 2015, Amazon S3 supported storage of trillions of objects, with an average online peak capacity of 1.5 million requests per second, and was designed for 99.999999999% ("11 nines") durability.

In 2022, Amazon cloud technology ushered in another milestone. The number of objects stored in Amazon S3 has exceeded 200 trillion, and it can handle tens of millions of requests per second.

Amazon Cloud Technology released Amazon Redshift to support auto-copy from Amazon S3, connecting data lakes and data warehouses from the physical storage level.

At present, tens of thousands of users around the world are using the Amazon Redshift analysis database for data analysis. These users come from games, finance, medical care, consumption, and the Internet.

In the course of more than ten years of development, Redshift has been continuously iterating, and many functions and features are derived from the real business needs of enterprises.

Specifically, the customer data warehouse scenario mainly includes four major blocks:

First, routine business operations and BI analysis; second, real-time data warehouse analysis; third, query, report and data analysis; fourth, machine learning and analysis and prediction.

It can be said that if enterprises want to quickly build data pipelines, Amazon Redshift is an important support for the underlying infrastructure.

With the seamless integration of Amazon Redshift and other data analysis applications, users can obtain a more perfect data analysis experience.

For example: it can store data in a high-performance format, expand storage to gigabytes in a more cost-effective manner, realize the separation of storage and computing, and realize the selection of analysis and machine learning engines, etc.

As early as 2017, Redshift has realized the integration of lakes and warehouses. Redshift Spectrum can directly query data in open formats on S3, and of course it can also write data into lakes, realizing the data seamlessness of data warehouses and data lakes. circulation.

2022 will be the 10th anniversary of the launch of Redshift. In this special year, Amazon Cloud Technology was uncharacteristically at the annual conference and did not release major upgrades.

Instead, many new functions were introduced during the meeting, all related to Redshift, from tighter data integration, streaming media data analysis to enhanced security access, and strive to make Redshift an enterprise data distribution center to meet the needs of various modern applications. Use, as well as the ability to collect and organize various types of data, provide AI analysis and subsequent applications, and make this new-generation data warehouse architecture that can take all types of data into a key product that accelerates the modernization of enterprise data.

Overall, as the cornerstone technology of Amazon cloud technology, Amazon S3 continues to provide a steady stream of nutrients for its technological innovation. The "smart lake warehouse" builds a data lake based on Amazon S3 as a central repository, and integrates a special "data service ring" around the data lake, including data services such as data warehouse, machine learning, big data processing, and log analysis. Then use Amazon Lake Formation, Amazon Glue, Amazon Athena, Spectrum and other tools to realize the construction of data lakes, data movement and management, etc.

The "smart lake warehouse" architecture can be regarded as a "hub", which seamlessly integrates data services of Amazon cloud technology, opens up data movement and access between data lakes and data warehouses, and further realizes data in data lakes, data warehouses, And on-demand movement among various specially constructed services such as data query, data analysis, and machine learning, so as to form a unified and continuous whole to meet the different needs of customers in various actual business scenarios.

Enterprises at any stage can quickly benefit from this agile architecture, easily break data and skills silos, and obtain the agility of data analysis in an iterative and incremental manner, shortening the innovation cycle for enterprises to extract data value.

This architecture makes full use of the advantages of security, reliability, extreme performance, and unlimited expansion brought by cloud services. It can help enterprises eliminate data islands, create a unified data foundation, and open up the complete process from data acquisition to data application. Realize the deep integration of data and intelligence in the cloud, so as to give full play to the value of data.

Today, Amazon cloud technology has helped 1.5 million customers become data-driven enterprises.

Taking the digital upgrade of the supply chain as an example, SF Express uses Amazon Cloud Technology’s massively scalable object storage service Amazon S3 to build a data lake, and integrates a large number of front-end sensing devices in the park, including cameras, IoT devices, geomagnetism, and multi-mode sensors. The collected information is aggregated into the data lake.

Relying on the nearly unlimited storage capacity of Amazon S3 cloud object storage, it provides a solid data foundation for data-driven operations.

By using services such as computing, storage, data analysis, container, machine learning and security of Amazon cloud technology, SF Express’s supply chain has improved the operation process of the park and improved operational efficiency. The daily throughput of vehicles in the park has increased by 40%-60%, and the efficiency of employees Increased by 30%, the workload of dispatchers and security inspectors is reduced by 50%.

Another customer of Amazon Cloud Technology, Nasdaq, also empowers data management through Amazon S3.

Due to the influx of automated trading platforms into the market, the transaction speed and transaction volume continued to grow. In 2014, in order to expand the scale, improve performance and reduce operating costs, Nasdaq migrated from the old local deployment data warehouse to the data empowered by the Amazon Redshift cluster. storehouse. Over time, more and more transactions led to a massive increase in data, and at the same time, Nasdaq began to plan and develop a new architecture to continue to achieve the performance standards and operational excellence expected by the ecosystem.

In 2018, Nasdaq chose to build its new data lake on Amazon S3, which enabled the company to separate compute and storage and scale each function independently. By integrating Amazon Cloud Technology IAM policies and Amazon S3, Nasdaq can also provide comprehensive access control functions between multiple Amazon Cloud Technology accounts. In addition, Nasdaq uses Amazon S3 to store critical financial data and moves it to Amazon S3 Glacier, enabling archiving at a lower cost.

In January 2019, Nasdaq participated in the Data Lab of Amazon Cloud Technology. During the four-day experiment, Nasdaq used Amazon Redshift as the computing layer to redesign the way it provides analysis. As a result, Nasdaq started using Amazon Redshift Spectrum, a feature that enables a smart warehouse architecture that can directly query data stored in data warehouses and Amazon S3 data lakes.

This minimizes time-to-insight generation, empowers Starkey Economics research teams to analyze data, and run complex queries on the data. What started as a performance-focused solution has now become a multipurpose data lake shared between teams.

With the help of the new intelligent lake warehouse architecture based on Amazon S3 and Amazon Redshift, the number of records that Nasdaq can handle per day has easily jumped from 30 billion to 70 billion, and the completion rate of data loading has reached 90% 5 hours earlier than before . Additionally, Nasdaq was able to run Amazon Redshift queries 32% faster by optimizing its data warehouse.

In view of the good system experience, Nasdaq has successfully migrated the core trading system of Nasdaq MRX, one of its six major US options trading markets, to Amazon Cloud Technology in 2022. This successful migration marks an important milestone in Nasdaq's journey to build a next-generation technology infrastructure for global capital markets.

Whether in data infrastructure, unified analysis, or business innovation, from connecting data lakes and data warehouses to cross-database and cross-domain sharing, the practice of Amazon Cloud Technology's "smart lake warehouse" architecture in enterprises has built modernization for enterprises The data platform provides a path to follow. It will cooperate with more technologies and products such as Amazon S3 and Amazon Redshift to further promote the modernization of the underlying data architecture and bring greater value to enterprises and even the entire industry.

 

Guess you like

Origin blog.csdn.net/kuangfeng88588/article/details/129628138