Emerging Architectures for Modern Data Infrastructure

Over the past year, nearly all key industry metrics have hit new all-time highs, and new product categories have emerged faster than most data teams can reasonably track. A set of data infrastructures is published in this article. They showcase the best relevant components in current analytics and operating systems.

1. Reference Architecture

A unified overview of all data infrastructure use cases:

 

data source Collection and Transformation storage analysis and processing convert Analysis and output
Generate relevant business and actionable data 1) Extract data from existing business systems
2) Transfer to storage, alignment scheme between source and target (L)
3) Transfer analysis data back to required business systems

Store data in a format accessible to query and processing systems.

Optimize data consistency, performance, reduce cost and scale 
Transform high-frequency code (sql, python, java, scala) into low-maintenance data processing jobs.

Use distributed computing to execute queries and data models

Incorporate both historical and predictive analytics
    Transform data into structured data for analysis

Plan processing resources for data transformation architecture
A series of interfaces that provide insight and cooperation for decision makers or data analysis science plus

display results

Embed data models in user applications

Enlarge the architecture design after machine learning:

data conversion Model training and development model interface integration
Convert rough data to available data for model training, including supervised learning and labeling Train models on processed data - typically build an ontology of pre-trained models on public data corpora

Track experiments and model training process including input data, superpower meter used and

final model performance

as part of an iterative loop, analysis, validation and audit model performance, often resulting in retraining and/or additional data collection and processing

Prepare trained models for deployment by compiling to relevant hardware targets and storing for access during the inference phase
Execute training models in real-time (online) or in batches (offline) based on

input data Monitor production models for data drift, harmful predictions, performance degradation, etc.
Integrate model output into user-facing applications in a structured and repeatable manner

Both the analytics ecosystem and the operations ecosystem continue to thrive. Cloud data warehouses such as Snowflake are growing rapidly, focusing primarily on SQL users and business intelligence use cases. But the adoption of other technologies has also accelerated — data warehouses such as Databricks, for example, are adding customers faster than ever. Many of the data teams we spoke to confirmed that heterogeneity is likely to remain in the data stack.

Other core data systems (i.e. ingestion and transformation) have also proven equally durable. This is especially evident in modern business intelligence models, where the combination of Fivetran and dbt (or similar technology) is almost ubiquitous. But this also applies to operating systems to some degree, where de-facto standards like Databricks/Spark, Confluent/Kafka, and Astronomer/Airflow emerge.

Blueprint 1: Modern Business Intelligence Applications

Cloud-native business intelligence for companies of all sizes

 

What hasn't changed:

The combination of data replication (such as Fivetran), cloud data warehouses (such as Snowflake), and SQL-based data modeling (using dbt) continues to form the core of this pattern. Adoption of these technologies has grown meaningfully, prompting funding and early growth of new competitors such as Airbyte and Firebolt.

Dashboards remain the most used application in the output tier, including new entrants such as Looker, Tableau, PowerBI and Superset.

new features:

There is a lot of interest in the metrics layer (a system that provides a standard set of definitions on top of the data warehouse). This has sparked intense debate about what features it should have, which vendor should have it, and what specifications it should follow. So far we've seen several solid pure-play offerings (such as Transform and Supergrain), as well as dbt's expansion into this category.

There has been a significant growth of reverse ETL vendors, notably Hightouch and Census. The purpose of these products is to update operational systems such as CRM or ERP with output and insights from the data warehouse.

Data teams are showing stronger interest in new applications to enhance their standard dashboards, especially data workspaces such as Hex. Broadly speaking, the new applications are likely to be the result of increasing standardization of cloud data warehouses—once data is clearly structured and accessible, data teams will naturally want to do more with it.

Data discovery and observability companies proliferated and raised significant funding (notably Monte Carlo and Bigeye). While the benefits of these products are clear, namely more reliable data pipelines and better collaboration, adoption of these products is relatively early as customers discover relevant use cases and budgets. (Technical Note: While there are several solid new vendors in data discovery such as Select Star, Metaphor, Stemma, Secoda, Castor, we have excluded seed-stage companies from the chart.)

Blueprint 2: Multimodal Data Processing

Evolving data lakes supporting analytical and operational use cases – also known as Hadoop refugees for modern infrastructure

Lighter boxes are new or meaningful changes; lighter boxes remain largely unchanged. Gray boxes are considered less relevant to this blueprint.

 

What hasn't changed:

Core systems for data processing (such as Databricks, Starburst, and Dremio), transmission (such as Confluent and Airflow), and storage (AWS) continue to grow rapidly and form the backbone of this blueprint.

Multimodal data processing maintains variety in design, allowing companies to adopt the system that best suits their specific needs in analytical and operational data applications.

new features:

There is growing acceptance and clarity about the underlying architecture of a data lake. We're seeing this approach supported by a wide range of vendors (including AWS, Databricks, Google Cloud, Starburst, and Dremio) and data warehouse pioneers. The basic value of lakehouse is to combine a powerful storage layer with a series of powerful data processing engines (such as Spark, Presto, Druid/Clickhouse, Python library, etc.).

The storage layer itself is being upgraded. While technologies like Delta, Iceberg, and Hudi are not new technology, they are being adopted at an accelerated rate and being built into commercial products. Some of these technologies (notably Iceberg) also interoperate with cloud data warehouses such as Snowflake. If heterogeneity continues, this could become a critical part of the multimodal data stack.

Adoption of stream processing (i.e. real-time analytical data processing) is likely to rise. While first-generation technologies like Flink haven't hit the mainstream yet, new entrants with simpler programming models like Materialize and Upsolver are gaining early adoption, and, interestingly, stream processing offerings from existing Databricks and Confluent Usage also started to speed up.

 

Blueprint 3: Artificial Intelligence and Machine Learning

Architecture for robust development, testing, and operation of machine learning models

What hasn't changed:

Compared to 2020, tools for model development are broadly similar today, including major cloud providers (such as Databricks and AWS), ML frameworks (such as XGBoost and PyTorch), and experiment management tools (such as Weights&Biases and Comet)

Experiment management has effectively relegated model visualization and tuning to separate categories.

Building and operating a machine learning stack is complex and requires specialized knowledge. This blueprint is not for the faint of heart, and productizing AI remains a challenge for many data teams.

new features:

The ML industry is consolidating around a data-centric approach, emphasizing complex data management over incremental modeling improvements. This has several implications:

The rapid growth of data labels (such as Scale and Labelbox), and the growing interest in closed-loop data engines, mainly based on Tesla's Autopilot data pipeline.

Increased adoption of feature stores such as Tecton, a way to collaboratively develop production-grade ML data, in both batch and real-time use cases.

There has been renewed interest in low-code ML solutions such as Continuous and MindsDB that at least partially automate the ML modeling process. These updated solutions focus on bringing new users (i.e. analysts and software developers) into the ML market.

Using pre-trained models is becoming the default, especially in NLP, and has given companies like OpenAI and Hugging Face a tailwind. There are still interesting questions to be addressed around fine-tuning, cost, and scaling.

Operational tools for ML (sometimes called MLops) are becoming more mature, built around ML monitoring, the most in-demand use cases and immediate budgets. At the same time, a range of new operational tools—notably verification and auditing—are emerging, and the ultimate market remains to be determined.

There is growing focus on how developers can seamlessly integrate ML models into applications, including through pre-built APIs such as OpenAI, vector databases such as Pinecone, and more insightful frameworks.

 

Guess you like

Origin blog.csdn.net/shishi521/article/details/129261059