Cloud native + big data full-stack solution!

We often say: Today's society has entered the era of big data. This sentence is easy to understand that in the past, data seemed to be far away from people, or that in the past era there was only small data or even no data.

In fact, since the birth of human beings, data has carried human beings' records and expressions of all things in nature, production and life. From the ancient "Knotting Notes" to the current "Smart Dashboard", from traditional data to network data, from small data to big data, the only changes are the carrier of data and the method/technology of using data. The codes of the world have always existed and have not changed since ancient times. Everything can generate data, and everything can be digitized.

Nowadays, people often compare data to oil and gold mines. In fact, in my opinion, this is just to explain the value of data from an economic perspective, and the value of data is much more than that. The data contains not only the mysteries of the universe and starry sky, but also the various aspects of human society. Whoever controls the data controls the future!

Activating the potential of data and releasing the value of data has become an important consensus among all walks of life in today's society. Ever since, we have seen more and more organizations collect, store, manage and utilize data as an important resource.

01 The development history of big data technology

As people pay more and more attention to data, the storage and processing technology of big data has also developed rapidly. Note: The data here refers to the data that has been digitized. Oracle bone inscriptions on tortoise shells, cuneiform characters on clay tablets, and modern paper documents are not included in this category. According to the author's observation, data storage and processing technology has mainly experienced four development stages:

1. Traditional SQL database

Traditional SQL databases are also called SMP architecture databases (the full name is Symmetrical Multi-Processing, SMP). Its core principle is that multiple processors share a unified memory and disk , and the application scenarios are mainly stand-alone. Our common Oracle, MySQL, SQLServer, DB2, etc. all belong to the SMP data architecture. This architecture has dominated the data storage world for 40 years, and it is still enduring in the field of "small data" management.

2. MPP data structure

With the advent of the DT era, the entire social data explosion. Enterprises need to process terabytes of data at every turn, and the data storage architecture of shared resources such as SMP is more and more difficult for massive data processing. As a result, a large-scale, distributed data storage architecture emerged, which is the MPP (Massively Parallel Processing) architecture, which can distribute queries to different nodes for parallel execution , significantly improving the performance of data queries. Warehouses and data analytics platforms offer excellent solutions. Representatives of MPP data architectures include: Redshift, Terdata, GreenPlum, Vertica, etc.

3. Hadoop data architecture

The surge of data has driven the transformation of data architecture. Open source database products represented by Hadoop and Spark have had a huge impact on traditional SQL databases. The characteristic of Hadoop is that it can not only store and process structured data, but also collect, store, manage and use semi-structured and unstructured data. Hadoop is not a single product, but a huge software ecosystem. Deployment usually requires proficiency in a series of tools, including HDFS, Yarn, Spark, Impala, Hive, Flume, Zookeeper and Kafka, etc.

4. Cloud Native Data Architecture

The emergence and development of cloud computing has made enterprise IT infrastructure cloudified and applications transferred to the cloud. At the same time, databases with cloud-native architectures have also emerged in the industry. The core is to give full play to the advantages of various resource elasticity brought by cloud infrastructure. By separating computing and storage, the efficiency of database resource allocation is improved, and computing and storage are realized. Elastic expansion and on-demand allocation bring customers super high ROI. Take Amazon Cloud Technology's cloud-native data warehouse Redshift as an example. Redshift adopts a cloud-native architecture that separates storage and computing. high concurrent computing capability. It is worth mentioning that Redshift also supports machine learning algorithms. Users can directly create machine learning models in SQL, which makes data analysis and mining easier.

02 Major Challenges Facing Big Data

There is no doubt that "big data contains great value". For enterprises, although big data has developed to a certain extent and has accumulated a certain amount of technology and business, there are still many problems waiting for us to solve. Among the many problems, the main challenges that need to be solved are: operation and maintenance challenges, cost challenges and security challenges.

1. Operation and maintenance challenges of big data

The development of cloud computing and big data technology, especially the application of the open source Hadoop system, has brought great challenges to data operation and maintenance. First of all, enterprises generally lack professional talents who master big data technology. In many cases, the data personnel of enterprises must not only do requirements, but also do development and operation and maintenance. If the situation is better, big data development and operation and maintenance will be separated, but in the face of the huge Hadoop system and continuous technological development, big data operation and maintenance will become more and more difficult. Second, with the surge in data volume and data applications, more and more things need to be delivered for operation and maintenance. Big data operation and maintenance is not only shallow-level operation and maintenance work such as service startup and shutdown, monitoring, alarming, and job scheduling, but also more tasks such as performance tuning, resource scaling, and fault handling to ensure the stable operation of big data. .

2. The cost challenge of big data

For enterprises, the deployment of big data projects sometimes brings not "big value" but "big cost" to enterprises. First of all, enterprises need to consider the cost of new hardware, such as computer room, server, storage and power consumption. Secondly, in terms of software, although enterprises can choose the open source Hadoop system to build a big data platform, they still have to pay for the design, development and daily operation and maintenance of big data. Of course, there are also enterprises that choose cloud data architecture solutions, but when purchasing cloud services, they continue to use the "over-provisioning" thinking of localized deployment in the past, resulting in problems such as over-allocation of resources, unnecessary capacity, and poor visibility of the environment. cloud computing costs out of control.

3. Security challenges of big data

The security of big data has always been a problem in the industry. The data stored in big data is very huge, and it is very easy to be targeted by hackers. Enterprises use a distributed data architecture for storage, and the path view of data storage in this architecture is relatively clear, resulting in relatively simple data protection, and it is easier for hackers to exploit related loopholes and carry out illegal operations. Some companies even have  the misconception that "open source equals security"  , so they actively embrace open source software. In fact, there are many loopholes in open source software, and the resulting data security and leakage issues are also increasing year by year.

03 Seek solutions and actively respond to challenges

The three major challenges of big data are issues that every enterprise has to face today. But how to deal with it, is there any specific solution?

Recently, in response to this problem, the author had an in-depth exchange with his friend Will (English name) who is a data architect at Amazon Cloud Technology. The solution given by Will is - Serverless data. He believes that serverless data will be based on cloud-native data. The new normal for next-generation technology architectures for services . 

To be honest, I have heard of Serverless, but I am at a loss as to what is Serverless data service and whether it can really solve the challenge of big data, so I can only humbly ask for advice! After exchanging cups and in-depth exchanges, I finally have a certain understanding of this brand-new data architecture, so I can't wait to share it with you.

1. First, let’s talk about what is Serverless

Serverless is a cloud computing architecture pattern, also known as serverless computing. This term first appeared in an article around 2012. The author Ken Fromm explained it as follows: The word serverless does not mean that servers are no longer involved, it just means that developers no longer need to consider so much physical capacity or other infrastructure resource management responsibilities. By removing the complexity of back-end infrastructure, serverless allows developers to shift their attention from the server level to the task level.

Serverless is an event-driven computing model. Developers do not need to care about the underlying servers and infrastructure, but only need to write processing logic code and upload it to the cloud service provider's platform. This architectural model has the advantages of high scalability, flexibility, reliability, and low cost, and is suitable for handling complex high-concurrency application scenarios.

According to a friend, Amazon Cloud Technology is the pioneer and leader of Serverless technology. Although I know that Huawei, Alibaba, and Tencent have also launched their own serverless products, my friend said that Amazon Cloud Technology is the leader in this field. Without substantive evidence, I will not refute him, hahaha!

2. Let’s talk about what is Serverless data

Serverless data refers to a data processing method with a serverless architecture, which uses the infrastructure and platform services of cloud service providers to execute and manage data processing tasks in an event-driven manner. In serverless data processing, developers do not need to pay attention to the operation and maintenance and deployment of servers. They only need to write processing logic codes, deploy them on the cloud service provider's platform, and then use event triggers to trigger the execution of processing tasks. This processing method has the advantages of high scalability, high reliability, and low cost, and is suitable for processing large-scale and complex data processing tasks. Taking the serverless database as an example, its working principle is as follows:

3. Problems that Serverless data can solve

After listening to Will's introduction to Serverless data, I generally understand that Serverless data does have certain advantages in data computing and storage, but Will also mentioned Amazon Cloud Technology's full line of Serverless data, which is said to be able to effectively solve the problem of enterprise data management and application. various challenges and problems.

"What is full-line serverless data? What problems can it solve?" Under my constant questioning, Will gave the following answer:

First, serverless data does not require operation and maintenance. Using Serverless Data, users do not need to care about the underlying details such as servers, operating systems, and network configurations. They only need to write code, host database services or data analysis services, and it will automatically host the operating environment to provide users with highly available computing resources and elastic scaling. Capacity, thus avoiding the server operation and maintenance costs and risks in the traditional architecture.

Second, serverless data can effectively reduce IT costs. Using Serverless Data, users only need to start services when needed, without pre-preparing resources for future peak traffic or visits, and without paying for unused idle resources. To put it simply, serverless data does not require users to reserve cluster capacity, but automatically scales resources according to task requirements, thereby improving resource utilization and cost-effectiveness, avoiding resource waste, and effectively reducing enterprise IT costs.

Third, Serverless data supports real-time processing of data. Serverless data can automatically trigger and schedule the execution of data processing tasks through event triggers and schedulers, respond to real-time events, process real-time data streams and generate real-time data results.

Fourth, Serverless data supports data governance. Serverless data can convert data resources of enterprises into data assets by writing data processing logic codes, cleaning, converting and formatting data. Cloud service vendors will also provide a series of data governance and management tools based on the Serverless architecture to support the management, monitoring and maintenance of enterprise data. In terms of data security protection, using the security protection system of cloud service vendors can effectively guarantee data security and privacy.

Fifth, Serverless data can also support data analysis and mining. Cloud service providers represented by Amazon also provide a series of Serverless architecture products and tools in terms of data analysis and mining. With these tools, large amounts of data can be analyzed, mined, and visualized to discover valuable information in the data. Let data empower business and help enterprises achieve digital transformation.

sixth……

seventh……

Let it go...&*#¥#@~##!

This is the case for Will, a technical fan, and he can't stop talking about technology. Although I think my technique is okay, but compared to him, I still feel ashamed...

04 Amazon Cloud Technology Serverless Data

Through this communication with Will, I have indeed learned a lot, and I have a deeper understanding of Serverless and Serverless Data. Before parting, he dropped an Amazon Cloud Technology Serverless service introduction material. Let's take a look at Amazon Cloud Technology, which claims to lead the development of serverless technology, what serverless services are available in the field of data services, and what value they can bring to enterprises!

Amazon Cloud Technology Serverless Data Development History

This picture generally introduces the development process of Amazon Cloud Technology in the field of serverless data. It is not difficult to see from the picture that it started to deploy serverless data as early as 2012.

In 2012, Amazon DynamoDB, a database service with a serverless architecture, was released. It is said that the performance and experience of this product are really good. This is a key/value and document database with millisecond-level performance, supporting PB-level data and tens of millions of read and write requests per second.

In 2013, Amazon Kinesis was released. This is a serverless service for message stream processing—it can be used as an extensible and scalable message stream service to support stream computing scenarios. It can collect, process and analyze real-time data streams, and forward the data streams to multiple targets to support various real-time applications and tools.

In 2014, the famous Amazon Lambda service was launched. Lambda is a highly available, serverless, event-driven computing service that lets you run code without provisioning or managing servers or clusters. Not long after the Amazon Lambda service was launched, similar services from companies such as Google and Microsoft entered the market, and "Serverless" gradually became a "hot word" in the industry.

Panorama of Amazon Cloud Technology Serverless Data

In 2016, the Amazon Quicksight product was launched, which is a business intelligence software based on serverless architecture, which can help users easily obtain insights from various data sources, and use visualization tools and dashboards for data analysis and reporting. QuickSight can also automate the data visualization and analysis process to improve productivity and accuracy.

In 2018, Amazon Aurora Serverless was released. It is a fully managed, on-demand auto-scaling relational database service. It can automatically scale compute and memory capacity according to the needs of the application, suitable for applications with unpredictable or highly variable workloads. This service will only be billed according to the actual database resources used, without capacity planning and upfront cost investment. This model greatly reduces the cost of purchasing cloud services for enterprises.

In 2019, Amazon Lake Formation, a serverless data lake management tool, was released, which can help enterprises quickly configure their own data lakes.

In 2021, Amazon Cloud Technology released three more data analysis services with serverless architecture, namely streaming data pipeline Amazon MSK Serverless, big data analysis platform Amazon EMR Serverless, and data warehouse Amazon Redshift Serverless.

In 2022, Amazon OpenSearch Serverless will be released, which is a serverless log analysis engine. Its release represents that in the field of data analysis, Amazon cloud technology has realized the "full-line serverless architecture".

Really amazing! !

On March 30, 2023, Amazon Cloud Technology plans to hold a technological innovation conference that will be held simultaneously online and offline. What surprises can Amazon Cloud Technology bring us this time in terms of Serverless Data? let us wait and see!

Click to follow:

Amazon Cloud Technology Innovation Conference : Fully Embrace the Serverless Era

Guess you like

Origin blog.csdn.net/kuangfeng88588/article/details/129793071