Zero-ETL, Big Models, and the Future of Data Engineering

Editor's note: This article explores future trends and challenges in the field of data engineering, as well as its ever-changing, often "reinventing" character. In the field of data engineering, there is always a certain upper limit to the performance and capacity improvement of big data, and every progress will bring a certain technological improvement, thereby increasing the upper limit. But soon we will reach this upper limit until the next technological leap.

The following is the translation, Enjoy!

Author  | Barr Moses

Compile | Yue Yang

picture

Image courtesy of the author

If you don't like to embrace change, then data engineering shouldn't be for you. In this field, few things can escape the fate of being reinvented.

The most typical and latest examples are Snowflake and Databricks subverting the concept of traditional databases and ushering in the era of Modern Data Stack.

As part of this disruptive movement, Fivetran and dbt have fundamentally changed the data pipeline from ETL to ELT. Hightouch tried to shift the focus to the data warehouse (data warehouse), interrupting the process of SaaS "eating the world". Monte Carlo also joined the fray, stating that " hand-writing unit tests may not be the best way to ensure data quality [1]."

Today, data engineers continue to rely on hard coded pipelines and on-premises servers to march to the peak of the modern data stack (Modern Data Stack).

It seems unfair that new ideas have sprung up, disrupting previous disruptors:

  • Zero-ETL intends to focus on data ingestion
  • AI and large language models can change the data transformation process
  • Data product containers are striving to be at the heart of data builds

Do we need to reinvent everything again? The industry system in the Hadoop era still maintains a high degree of enthusiasm.

The answer is, of course, that data systems must be constantly reinvented . This happens probably a few times in each of us in our careers, mostly why, when and how to reinvent.

I am not an expert and cannot predict all outcomes. But this article will take a closer look at some of the most prominent ideas that may become part of the data stack of the future, and explore their potential impact on data engineering.

01 Specific factors to be considered and decision-making of trade-offs

picture

Image credit: Tingey Injury Law Firm on Unsplash

The modern data stack doesn't come about because it's better than previous data stacks in every way . In fact, there are many specific factors that need to be considered. The amount of data is larger and data generation is faster, but it is also more messy and difficult to control. The comparison in terms of cost performance is still inconclusive.

The modern data stack reigns supreme because it can support use cases and unlock value from data in ways that were previously nearly impossible or very difficult. Machine learning has gone from a buzzword to a revenue-generating tool. Now analysis and experimentation can be conducted in greater depth, helping to make more important decisions.

This is true for each of the trends listed below. There are pros and cons to each trend, but what is driving their widespread adoption is unlocking new ways to leverage data. Or, there may be "dark horse" ideas that we haven't discovered yet that could also help move the data landscape forward. Next, we examine each trend in detail.

02 Zero-ETL

picture

What it is : Zero-ETL is actually a misnomer - the data pipeline is still there.

Currently, data is typically generated by services and written to transactional databases. An automatic pipeline was deployed to not only move the raw data into the analytical data warehouse, but to modify it slightly along the way.

For example, the API exports data in JSON format, and the ingestion pipeline (ingestion pipeline) not only needs to transfer the data, but also needs to perform simple transformations to ensure that the data is presented in a tabular format that can be loaded into the data warehouse. Other transformations that are common during the ingest phase include data formatting and deduplication .

While it is possible to do the conversion by hardcoding the pipeline in Python , and some advocate doing so [2] to pre- model the latest data into the data warehouse, most data teams do it for the sake of convenience. Choose not to do this for expediency and visibility/quality reasons.

Zero-ETL changes this data access process by enabling transactional databases to perform data cleaning and normalization before being automatically loaded into the data warehouse. It's important to note that the data is still in a relatively raw state .

Currently, this integration is possible because most Zero-ETL architectures require the transactional database and data warehouse to be from the same cloud service provider .

Pros : Can reduce latency. There is no duplicate data storage. Reduced one possible source of failure.

Cons : Less ability to do custom data processing during the data acquisition phase. There may be some cloud provider limitations.

Who is driving Zero-ETL : AWS is the driving force behind Zero-ETL (from Aurora to Redshift[3]), but both GCP (BigTable to BigQuery[4]) and Snowflake (Unistore[5]) offer similar capabilities. Snowflake (Secure Data Sharing [6]) and Databricks (Delta Sharing [7]) are also pursuing so-called "copy-free data sharing". This process does not actually involve ETL , but provides extended access to stored data .

Practicality and value potential : On the one hand, Zero-ETL seems to be only a matter of time due to the support of the technology giants and the functions that have been implemented now. On the other hand, I have observed that data teams are continuing to decouple their data operations and analytical databases rather than integrate them more tightly, which prevents accidental schema changes from bringing down the entire operational process.

This innovation could further reduce software engineers' visibility and accountability for the data their services produce. Why should they care about the schema when the data is already on its way to the data warehouse shortly after the code is committed?

As streaming and micro-batch approaches to data currently seem to be sufficient for the vast majority of "real-time" data needs, I think the main business driver for this innovation is infrastructure simplification . While this is nothing to scoff at, the possibility of "copy-free data sharing" removes lengthy security review hurdles and could lead to wider usage in the long run.

03 One Big Table and large language model

What it is : At present, business stakeholders need to describe their requirements, indicators and logic to data professionals, and then professionals will translate them into SQL query statements, or even a dashboard (dashboard). Even if all the data already exists in the data warehouse, this process takes time. Not to mention that ad-hoc data requests rank between root canal and documentation on the data team's favorite to-do list.

There are a number of startups aiming to leverage the power of large language models like GPT-4 to automate this process by letting consumers "query" the data in natural language in a fluent interface.

picture

English at least until binary is made the new official language

This will radically simplify the self-service analytics process and further democratize data . However, solving this problem will be difficult given the complexities of data pipelines for more advanced analytics, beyond basic "metric fetching".

However, this complexity can be reduced if all raw data is stored in one big table, an idea proposed by Benn Stancil [8], one of the best and most forward-looking writers and founders in the field of data one. No one ever imagined[9] what the death of the modern data stack[10] would look like.

Although it is a concept, it is not so far away. Some data teams have leveraged the one big table (OBT) strategy, which has both proponents and detractors [11].

Using a large language model seems to overcome one of the biggest difficulties of using one big table (i.e. data exploration analysis, pattern recognition, etc. and its complete lack of organization ). It's helpful for humans to have a table of contents and clearly labeled chapters for their stories, but AI doesn't care about that.

Advantages: Perhaps the promise of self-service data analytics [12] is finally realized, and the speed of data insight is faster, so that the data team can spend more time mining data value and building business , rather than responding to ad hoc Inquire.

Disadvantages: Is the degree of freedom too high? Data professionals are familiar with the pain of data (such as time zone issues [13]! and what is an "account" issue?) in a way that is beyond the scope of most business stakeholders. Do we benefit from representative examples rather than straight-up democratizing data?

Who is driving it : Ultra-early startups like Delphi[14] and GetDot.AI[15] and startups like Narrator[16]. There are also some more mature companies doing this kind of business, such as AWS QuickSite[17], Tableau Ask Data[18] or ThoughtSpot.

Practicality and potential : Refreshingly, this is not a technique for finding use cases [19]. The value and efficiency are obvious, but so are the technical challenges. This vision is still under construction and will take more time to develop. Perhaps the biggest hurdle to adoption will be the infrastructure disruption that such an approach requires, which may be too risky for more established companies.

04Data product container

What is a data product container : a data table is the basic unit for building a data product. In fact, many data industry leaders (data leaders) consider production tables to be their data products [20]. However, in order to consider Data Tables as a product, many features need to be added, including access management, data insights, and data reliability.

Containerization is critical to the microservices trend in software engineering. They enhance portability, infrastructure abstraction, and ultimately the ability to scale microservices. The concept of a data product container envisions the containerization of data tables.

Data product containers may already prove to be an effective mechanism for making data more reliable and manageable , especially if they can better represent the semantic definition, data lineage, and )[21] and quality metrics.

Pros : Data product containers appear to be a way to better package and enforce the four data grid [22] principles ( domain-oriented decentralized data ownership and architecture, data as a product, self-service data platform, federated computing governance ).

Cons : Will this concept make it easier or harder for companies to scale their data products? Another fundamental question is the future of data trends, i.e. do the by-products of the data pipeline ( code, data, metadata ) hold value to the data team that is worth keeping?

Who's driving it : Nextdata [23], a startup founded by data grid creator Zhamak Dehgahni. Nexla [24] also plays a larger role in this field.

Utility and Potential : While Nextdata has only recently been on the scene and data product containers are still evolving, many data teams are already seeing proven results from data grid implementations. The future of data sheets will depend on the exact shape and execution of these containers.

05 Endless Refactoring of the Data Lifecycle

picture

Photo by zero take on Unsplash

To envision the future of data, we need to look back at the past and present of data. Past, present, future — data infrastructures are in a constant state of destruction and rebirth (although perhaps we need more chaos [25]).

The meaning of a data warehouse has changed dramatically from the term introduced by Bill Inmon in the 1990s. ETL pipelines now become ELT pipelines. The data lake is no longer as vague as it was two years ago.

With all these innovations brought about by the modern data stack, data engineers still play a central technical role in determining how data flows and how data consumers access it.

The term Zero-ETL seems threatening because it (inaccurately) implies the death of the pipeline, and if there is no pipeline, do we still need data engineers?

Despite the hype surrounding ChatGPT's ability to generate code, the process still relies heavily on auditing and debugging by data technology engineers. The scary thing about large language models is that they can fundamentally distort data pipelines, or our relationships with data consumers (and how data is provided to them).

However, if this future does come, it still relies heavily on data engineers .

The general lifecycle of data has been around since the dawn of man. Data is emitted, shaped, used, and then archived.

While infrastructure may change and automation will take time and attention aside, human data engineers will continue to play a critical role in extracting value from data for the foreseeable future.

This is not because future technologies and innovations cannot simplify today's complex data infrastructure, but because our needs and uses for data will continue to increase, become more complex and larger.

Big data is always a pendulum swinging back and forth. We made a big leap in capacity, and then we quickly figured out a way to get to that boundary until the next leap is needed.

END

References

1.https://www.montecarlodata.com/blog-what-is-data-observability/

2.https://medium.com/towards-data-science/is-the-modern-data-warehouse-broken-1c9cbfddec3e

3.https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-aurora-zero-etl-integration-redshift/

4.https://www.infoq.com/news/2022/08/bigtable-bigquery-zero-etl/

5.https://www.snowflake.com/en/data-cloud/workloads/unistore/

6.https://docs.snowflake.com/en/user-guide/data-sharing-intro

7.https://www.databricks.com/product/delta-sharing

8.https://benn.substack.com/p/the-rapture-and-the-reckoning#footnote-anchor-12-99275606

9.https://benn.substack.com/p/how-fivetran-fails

10.https://benn.substack.com/p/how-dbt-fails

11.https://twitter.com/pdrmnvd/status/1619463942392389632

12.https://www.montecarlodata.com/blog-is-self-service-datas-biggest-lie/

13.https://www.explainxkcd.com/wiki/index.php/1883:_Supervillain_Plan

14.https://www.delphihq.com/

15.https://getdot.ai/

16.https://www.narratordata.com/

17.https://www.delphihq.com/

18.https://help.tableau.com/current/pro/desktop/en-us/ask_data.htm

19. https://en.wikipedia.org/wiki/Blockchain

20.https://www.linkedin.com/posts/shanemurray5_datamesh-dataengineering-dataquality-activity-7023310666983735296-4W3Y?utm_source=share&utm_medium=member_desktop

21.https://www.montecarlodata.com/blog-data-lineage/

22.https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to-mesh-it-up/

23.https://www.nextdata.com/

24.https://www.nexla.com/nexsets-modern-data-building-blocks/

25.https://medium.com/towards-data-science/the-chaos-data-engineering-manifesto-5dc09a182e85

This article is authorized by the original author and compiled by Baihai IDP. If you need to reprint the translation, please contact us for authorization.

Original link :

https://towardsdatascience.com/zero-etl-chatgpt-and-the-future-of-data-engineering-71849642ad9c

Guess you like

Origin blog.csdn.net/Baihai_IDP/article/details/130483200