The 6 stages of big data practice, now it is imperative to migrate the big data platform to K8s

Original author | Dr. Peng Feng
Translated | Wang Longfei
insert image description here

In the "Gartner Data Management Technology Maturity Curve" report in 2018, the concept of DataOps was first proposed. Gartner marked it as "very nascent" at present, and it is estimated that it will take 5-10 years to reach a technical maturity period. In the latest Gartner2022 report, the concept of DataOps has entered the second stage of "explosive growth".

insert image description here

Gartner predicts that a technology will go through the following five stages from emergence to public familiarity:

A potential technological breakthrough at the "very early stage" stage could solve the problem. Early proof-of-concept stories and media interest led to significant publicity. Often no usable product exists and commercial viability has not been proven.

Early hype during the "explosive growth" phase has produced many success stories – often accompanied by many failures. Some companies act; most don't.

The "Trough of Disillusionment" stage where early gains fade as experiments and implementations fail to materialize. The producers of the technology abandon the technology or declare failure. Investment will continue only if surviving vendors improve their products to meet the needs of early adopters.

The "Slope of Enlightenment" phase More examples of how technology can benefit businesses begin to become clear and more widely understood. Later iterations of products come from technology providers. More corporate funding begins to inject capital into pilot projects; conservative firms remain cautious.

The "plateau of productivity" stage technology begins to gain widespread acceptance. Criteria for assessing the viability of technology providers are clearer. The broad market applicability and relevance of the technology is clearly paying off. If the technology is more than just a niche market, then it will continue to grow.

Based on the above definition, the Gartner report basically shows that DataOps is currently in the stage of explosive growth, and many companies have begun to take action on it, and it is considered to be one of the potential market disruptive technologies like Spark and stream processing a few years ago. So what exactly does DataOps mean? Why is it only around 10 years after Hadoop led the big data wave?

We'll try to answer these questions by describing the six phases of a big data project and see what DataOps really brings to the table.

Phase 1 Technology Test Phase

At this stage your team will probably install a Hadoop cluster and Hive (possibly with Sqoop) in order to transfer some data to the cluster and run some queries. In recent years, components including Kafka and Spark have also been considered. If you want to perform log analysis, you can also install packages such as ELK (ElasticSearch, LogStash, Kibana).

However, most of these systems are complex distributed systems, some of which require database support. While many offer single-node modes for you to use, your team still needs to be familiar with common Devops tools such as Ansible, Puppet, Chef, Fabric, etc.

Using these tools and prototyping should be feasible for most engineering teams thanks to the hard work of the open source community. If you have some good engineers on your team, you can probably have a system up and running in a few weeks, depending on how many components you have to install.
insert image description here

Phase 2 Automation Phase

At this stage, you already have a basic big data system, and your next needs may include:

Some Hive queries that run periodically, say hourly or daily, to generate some business intelligence reports;

Use some Spark programs to run machine learning programs to generate some user analysis models so that your product system can provide personalized services;

Some crawlers that need to extract data from remote sites from time to time;

Or some streaming data processing program for creating real-time data dashboards to display on a big screen.

To implement these requirements, you need a job scheduling system to run them based on time or data availability. Workflow systems like Oozie, Azkaban, Airflow, etc. allow you to specify when to run programs (similar to Cron programs on Linux machines).

Functionality varies widely between workflow systems. For example, some systems provide dependency management, allowing you to specify scheduling logic such as job A to run only when job B and job C complete; some systems allow managing only Hadoop programs, while others allow more types of workflows. You have to decide on the one that best meets your requirements.

Besides the workflow system, you have other tasks that need to be automated. For example, if some data on your HDFS needs to be deleted after a period of time, assuming the data is only kept for one year, then on day 366, we need to delete the data from the earliest day in the dataset, which is called the data retention policy. You'll need to write a program that specifies and enforces a data retention policy for each data source, or your hard drive will quickly wear out.
insert image description here

Phase 3 put into production

Now that you have an automated data pipeline, data can finally flow through this data pipeline! You're done? The reality is that your production environment will encounter the following difficult problems:

  • 5.1% first year hard drive failure rate (similar to first year server failure rate)
  • 11% failure rate of servers in year 4
  • Open source programs that are heavily used have many bugs
  • Your program may also have some bugs
  • External data sources have latency
  • The database has downtime
  • network error
  • Someone was using extra spaces when running "sudo rm -rf /usr/local/"

These problems occur much more often than you might think. Assuming you have 50 machines with 8 hard drives each, then 20 hard drives will fail in a year, or about 2 a month. After months of struggling with the manual process, you finally realize that you desperately need to:

  • Monitoring system: you need a monitoring program to monitor hardware, operating system, resource usage, program running;
  • System probes: the system needs to tell you its various operating indicators so that it can be monitored;
  • Alarm system: When a problem occurs, the operation and maintenance engineer needs to be notified;
  • SPOF: avoid single point of failure, if you don't want to be woken up at 3 am, it is best not to have SPOF in the system;
  • Backup: You need to backup important data as soon as possible; don't rely on Hadoop's 3 copies of data, they can be easily removed with some extra spaces;
  • Recovery: If you don't want to manually handle all errors every time they occur, it's best to recover them automatically as much as possible.

At this stage you realize that building an enterprise-class system is not as easy as installing some open source programs, maybe we have to work a little bit more.
insert image description here

Phase 4 Data Management Phase

An enterprise-class big data system has to deal with not only hardware and software failure issues similar to any standard system operation, but also data-related issues. For a truly data-driven IT system, you need to ensure that your data is complete, correct, on time, and ready for data evolution.

So what do these mean?

  • You need to know that no data is lost during any step of the data pipeline. Therefore, you need to monitor the amount of data each program is processing in order to detect any anomalies as soon as possible;

  • You need to have mechanisms for testing data quality so that you receive alerts if any unexpected values ​​appear in the data;

  • You need to monitor the uptime of your application so that each data source has a predefined ETA and you will be alerted for delayed data sources;

  • You need to manage data lineage so we understand how each data source was generated so that if something goes wrong, we know which data and outcomes are affected;

  • The system should automatically handle legitimate metadata changes and should detect and report illegal metadata changes immediately;

  • You need to version control the application and associate it with the data so that when the program changes, we know how the related data changes accordingly.

Also, at this stage, you may need to provide a separate test environment for data scientists to test their code. And provide them with various convenient and safe tools, so that they can quickly verify their ideas and easily release them to the production environment.
insert image description here

Stage 5 Focus on security stage

At this stage, big data is already inseparable from you: customer-facing products are driven by data, and your company management relies on real-time business data analysis reports to make major decisions. The security of your data assets will become very important. Can you be sure that only the right people can access your data? And does your system have an authentication and authorization scheme?

A simple example is Hadoop's Kerberos authentication. If you are running Hadoop without Kerberos integration, anyone with root access can impersonate the root user of the Hadoop cluster and access all data. Other tools like Kafka and Spark also require Kerberos for authentication. Due to the complexity of setting up these systems with Kerberos (and usually only commercial versions offer support), most systems we see choose to omit Kerberos integration.

Aside from authentication issues, here are some issues you'll need to deal with at this stage:

  • Auditing: The system must audit all operations in the system, for example, who has accessed content in the system
  • Multi-tenancy: the system must support multiple users and groups sharing the same cluster, with resource isolation and access control functions; they should be able to process and share their data safely and securely;
  • End-to-end security: All tools in the system must implement proper security measures, for example, Kerberos integration for all Hadoop related components, https/SSL for all network traffic;
  • Single sign-on: All users in the system should have a single identity across all tools, which is very important for enforcing security policies.

Since most open source tools don't offer these features in their free versions, it's not surprising that many projects take a "hit and miss" approach to security. We agree that the value of security is interpreted differently for different projects, but one must be aware of potential problems and take an appropriate approach.
insert image description here

Phase 6 Big Data Phase of Cloud Infrastructure

At this stage, as the business continues to grow, more and more applications are added to the big data system. In addition to traditional big data systems like Hadoop/Hive/Spark, you now need to use TensorFlow to run deep learning, use InfluxDB to run some time series analysis, use Heron to process streaming data, or some Tomcat program to provide data service API. Whenever you need to run some new program, you will find that the process of configuring machines and setting up production deployments is very tedious and has many pitfalls. In addition, sometimes you need to temporarily get some machines to complete some additional analysis work, for example, it may be some POC, or you need to train a relatively large data set.

These issues are why you need to run big data systems on cloud infrastructure in the first place. Cloud platforms like Mesos provide great support for analytical workloads as well as general workloads and provide all the benefits offered by cloud computing technologies: easy configuration and deployment, elastic scaling, resource isolation, high resource utilization, high elasticity ,Automatic recovery.

Another reason to run big data systems in a cloud computing environment is the development of big data tools. Traditional distributed systems such as MySQL clusters, Hadoop and MongoDB clusters tend to handle their own resource management and distributed coordination. But now due to the emergence of distributed resource managers and schedulers such as Mesos/Yarn, more and more distributed systems (such as Spark) will rely on the underlying distributed framework to provide distributed operations for these resource allocation and program coordination scheduling primitive. Running them in such a unified framework will greatly reduce complexity and improve operational efficiency.
insert image description here

Summarize

We've seen real big data projects in various stages. After Hadoop has been adopted for more than 10 years, most of the projects we see are still stuck in stage 1 or stage 2. The main problem here is that implementing a system in Phase 3 requires a lot of expertise and a lot of investment. A Google study showed that only 5% of the time spent building a machine learning system is spent on the actual machine learning code, and the other 95% is spent on getting the infrastructure right. Since data engineers are difficult and expensive to train (due to the need to have a good understanding of distributed systems), most companies are unfortunately not on the fast track to the big data era.

Like DevOps, DataOps is an ongoing process that requires the right tools and the right mindset. The goal of DataOps is to make it easier to implement big data projects in the right way to get the most value from the data with less work. Companies like Facebook and Twitter have long pushed DataOps-like practices internally. However, their methods are often tied to their internal tools and existing systems, making it difficult to generalize for others.

Over the past few years, the standardization of big data operations has become possible through technologies such as Mesos and Docker. Combined with a broader adoption of a data-driven culture, DataOps is finally ready to enter the fold. We believe this movement will lower the barriers to implementing big data projects, making it easier for every business and institution to obtain the maximum value from data.

Thinking: Running a big data platform on Kubernetes has become a new trend in the industry, and the discussion in this article is still instructive

Many years ago, "Six Stages of Big Data Practice" written by Dr. Peng Feng inspired a lot of data-driven enterprises. Today, enterprises widely use Kubernetes, and we still have some experience after reading this classic.

Back to two years ago, in March 2021, Apache's Spark supported Kubernetes; in May of the same year, Kafka also publicly supported Kubernetes, marking that the core big data components all support Kubernetes. This provides conditions for the inclusion of big data components and data applications into the Kubernetes management system to standardize system management.

Nowadays, most domestic enterprises have widely used Kubernetes, but they face the same problem, that is, when using K8s, most of them are doing cloud computing-related scheduling, and for the field of big data, enterprises are still managing another set of complicated systems , which is the traditional big data platform.

And this brings too many disadvantages and troubles, so these companies that "use traditional big data platforms and also use K8s" often think about whether they can migrate big data platforms to Kubernetes (this is called Data on Kubernetes).

If you have such an idea: migrate the big data platform to Kubernetes. Well, the first containerized cloud-native big data platform on the market that can be fully deployed on Kubernetes independently developed by Zhilingyun-Kubernetes Data Platform (KDP for short) can help enterprises that are going through the sixth stage of big data to solve this problem.

KDP is often referred to as a "live" and "real" cloud-native big data platform. The reason why the word "real" is emphasized is that all components in the platform have been reconstructed through containers and incorporated into K8s standards A management system, not just a part.

The value of this is obvious. Even across different environments, as long as the underlying infrastructure is the K8s environment, there is no need to repeatedly deal with the configuration of the physical infrastructure, and there is no need for code modification, and the big data platform can be deployed smoothly.

In addition, the underlying support of the "cloud-native big data platform" is a globally shared platform. Users can migrate existing systems to resource pools to achieve higher resource utilization. At the same time, the cloud-native storage-computing separation architecture can also manage cold and hot data storage separately, that is, for different application scenarios, choose different storage media such as mechanical hard disks, solid-state hard disks, and object storage to reduce storage costs.

Of course, KDP allows users to completely remove the dependence on Hadoop, and can directly run all workloads in the K8s environment, unify resource management, facilitate multi-tenant billing management, and greatly reduce operation and maintenance costs.

So far, KDP, as a forward-looking big data platform, can help enterprises quickly explore new paths of competition in the new era of competition driven by data value.

Guess you like

Origin blog.csdn.net/LinkTime_Cloud/article/details/129948583