Large-scale models "spoiler", data lakes, data warehouses, who will be eliminated first in the selection of lake warehouses?

0bf63e05f9bafbb870b20ece08f1f1fa.png

It's always like this:

The pressure is beginning to show, and I secretly consider changing.

The pressure is high and the energy is off the charts. Make changes immediately.

We started chatting with a well-known American company called Databricks.

This Databricks company has innovative DNA.

Its founder Ali Ghodsi (Ali Ghodsi), as the 1645th Swedish rich man on the "2022 Forbes Global Billionaires List", has no shortage of money and is willing to spend money for the company.

He has repeatedly stated publicly that he will not consider reducing R&D investment.

Before (a few years before the big model came out), Databricks had a very important ability, let's call it "two-in-one" ability:

Big data capabilities, as well as traditional artificial intelligence capabilities.

Collectively referred to as: "Data+AI" capabilities.

It is more accurately called: the capabilities of the "Data+AI" platform.

Databricks already has the functions of traditional AI platforms.

After all, it is called a one-stop shop.

In the past, traditional AI could also be classified as "advanced" data analysis services, such as for scenarios such as prediction.

After the emergence of large models, such classification is out of date.

The big model is not just for analysis, people are intelligent.

Therefore, today's basic requirements for AI platforms are "increasing all boats": being able to train large models.

However, Databricks, a Data+AI platform spanning two worlds, is an excellent class cadre of three-good students. Although they have the ability to "two-in-one" early on, they have not grown up in the "land of the four seas" under their jurisdiction. A large model of generative AI features.

I have everything I should have, but unexpectedly, I watched myself fall behind.8cd4f311029182d2de665f87943ec619.png

How much data is needed for a large model, let’s talk about a fact during training.

It takes about two months to fine-tune a large model with hundreds of billions of parameters, and the consumption data is about 20 terabytes.

This means that after the advent of the big model, the "worth" of big data has changed, "rich and proud".

Because the big model can use the value in the big data more thoroughly.

It doesn't matter how long you saved it before,

It doesn't matter how much ash falls.

The most important thing is to quickly feed the large model.

Let the big model "learn" all these long-lost knowledge.

Once changes happen, all kinds of clues come out.

It's the turn of the big model to pose a problem for the "two-in-one" platform.

First, there are many data types.

Different data, different modes, multiple data, multiple modes, large models evolve to multi-modal.

Although the general multi-model large model trains three types of data: graphic, text and audio; however, after the American large model takes the lead, everyone is very radical. If you have good cards in your hand, you will play king bombs in succession.

May 9, 2023

American manufacturer Meta large model ImageBind takes vision as the core, combines text, sound, depth, heat (infrared radiation), motion (inertial sensor), and covers 6 modes.

Coincidentally.

On the afternoon of May 26, 2023,

The domestically produced "Zidong Taichu" 2.0 full-modal large model is released, featuring different modes such as text, pictures, voice, video, 3D point cloud, and sensor signals.

At this point in the writing, I have to sigh with emotion that in the same May, only from the 9th to the 26th, the rhythm of the multi-modal urgent and complicated strings has disappeared.

Second, there are more computing engines.

According to Jia Yangqing, from a technical point of view, data and AI computing are separate.

Data uses a data platform, and AI uses an AI platform.

Today, neither the data platform nor the AI ​​platform can use its own experience to solve the other party's problems. Because the technologies behind the data platform and AI platform are completely different.

Previous big data computing engines mainly supported the calculation of structured data.

Different computing engines have different optimization directions (data freshness, query performance, and cost), as well as different development languages, computing semantics, and storage systems, making assembly extremely difficult.

And AI needs its own engine.

One computing engine is not enough, and this problem arises in the era of big data system products;

One type of computing engine is not enough, and this problem also appeared in the era of traditional artificial intelligence.

This is all right, multiple computing engines.

See how your Data+AI architecture supports it?

In the era of large models, the problem of Data+AI architecture has deteriorated, which is visible to the naked eye.

Third, large model iterations are too fast.

Sometimes the week is used as the unit, and sometimes the day is used as the unit. The large model focuses on a "performing personality", and what it plays is "high-speed evolution".

There are so many new things that make people tremble when they see them, and burn oil when they learn them.

Fourth, the computational load of large models will only increase, not decrease.

People may tend to agree with:

In the foreseeable future, the AI ​​load brought by large models will dominate.

Therefore, preparations will be made for "increased calculations".

In the past, the traditional AI load accounted for a small proportion.

For example, 5%, can treat AI as a separate component.

Now the status is not what it used to be,

The proportion of large model AI computing load increased from 10% to 80%.

The nature has changed.

This is the story of a new guy putting pressure on the old to make changes.

DataBricks internal OS is:

Family members, who knows?

The big data platform architecture is complex, and the Data+AI platform architecture is very complicated.

When the big model comes, the Data+AI platform architecture is more complex.

f984973860ae3490d9b55a6b1791d10f.png

The most important thing is that the level of architecture of this type of platform determines the height of skill.

How to deal with it?

Now there is no one-step mature solution,

We tried to look back at the history of platform architecture for inspiration.

Big data still stands behind the big model. Its technology is also old.

2023 is the 23rd year of the development of big data technology (counted since Google started to build a big data platform for search business in 2001).

The architecture of a purely big data system is also very complicated.

Or, the big Internet companies build themselves on the basis of open source;

Line: "Just do it."

Or, use the public cloud platform architecture and buy PaaS services;

Line: "If you have money, you have to know how to spend it. It is very troublesome to choose a model."

Or, outsource it.

The line is: "Have money to buy services. Although they are not sensitive to technology stack and technology selection, this does not prevent them from having high requirements for stability."

Observing from the perspective of platform technology architecture can reveal the essence more.

Because the "two-in-one" platform architecture is roughly divided into two parts, computing and storage.

AI is still iterating at a high speed, and the iteration speed of the Data+AI architecture is not so fast.

Then we really need a robust and scalable architecture.

Is the calculation part not important?

No, but calculations can be relocated, and adding GPUs and CPUs is not that difficult.

However, it is not easy to move the data after storage, referring to the high cost of long-distance bandwidth between data centers.

So storage should be more important.

Ever since, the Data+AI platform can't get around the "old three things":

Data lake, data warehouse, lake warehouse.

2b8a3d1e794a0b9c1cd5486406c0c5e2.png

Observing them is essentially observing the Data+AI platform from the perspective of storage.

In fact, none of them can be regarded as a pure single product, and they all include "storage architecture".

Because typically, such "two-in-one" platforms include multiple components.

Different component combinations will bring a variety of system architecture forms, making things very difficult.

The computer system software architecture is essentially a durable good.

The core of what can be called a "good" architecture lies in:

It lasts a long time. If a new structure emerges every six months or a year,

Then this structure may be seriously ill.

Therefore, the time scale of its iterations may be very long.

It can be observed that from the very beginning, there were two factions developing in parallel in the arena.

One faction, data warehouse, has been developed for more than 40 years, and the mainstream computing paradigm is two-dimensional relational expression.

Therefore, for more than ten years, data warehouses have been dominated by relational computing architectures.

Therefore, its architectural iteration timeline scale may be ten years.

Another faction, the data lake.

Big data originated on the data lake (2006),

And data lake solutions were born from leading technology companies, Google and Yahoo.

The pioneer of the data lake school is Google File System (Google File System, GFS), which was born as a data lake architecture.

The same is true for Hadoop Distributed File System, an open source version of Google File System.

What the data lake schools have in common is that they all have a standard data lake architecture, with a computing engine on top and a set of standard storage (a file system, you can store anything) underneath, with unified metadata inside.

There are many followers of the data lake school, Spark, Presto (data query engine developed by Facebook), these are the computing power on the data lake.

They focus on one: the separation of storage and computing.

There are a lot of content that can be used for flexible combination,

Such as storage systems, resource scheduling systems,

A variety of different computing engines can be flexibly combined.

Two genres, two lanes, simultaneously, the development is good.

In terms of cost, open source without cost tends to be data lakes, and enterprise-level paid services with costs tend to be data warehouses.

After a period of time, a new structure will be developed.

The main reason is that everyone suddenly discovered, hey, this data analysis on the database is not efficient enough,

It involves some issues such as the integrated linkage of storage and computing.

Therefore, the overall architecture of big data is developing in the direction of the lane of the data warehouse.

Therefore, things like ClickHouse use a new architecture with storage instead of a separate architecture, but use a more integrated architecture to do things inside.

In recent years, the development of Hucang (integrated) has just started, and when observed on the ten-year time scale axis,

It has only developed a short distance forward, and Hucang is still a relatively new structure.

In essence, the lake warehouse combines the openness and flexibility of the data lake with the efficiency and management capabilities of the data warehouse.

In the first quarter of 2022, the A16Z "Data50 List", a well-known investment institution in Silicon Valley, shows that Databricks' subdivision track (Query & Processing) has received amazing investment, almost accounting for the total amount of funds in the data enterprise track 50% of.

Although Databricks' own high financing accounted for a large part, but the reason is that if data analysis (query processing) is too slow, it will affect the business. This is a rigid need related to the life and death of customers.

In other words, before the popularity of large models, the proportion of AI load was not large, and many companies regarded it as a relatively independent large component.

After the big model came out,

Client companies will consider how these piles of data in the database can be consumed by AI.

The core technology of the "two-in-one" platform company has become:

Can it support AI load well?

AI at this time is not AI at that time.

AI is not what it used to be. It is a first-class citizen.

At least, AI and data analysis are on an equal footing.

Therefore, in the development trend of the integrated storage architecture of lakes and warehouses, AI is equivalent to casting a vote in the direction of data lakes.

Because the data warehouse processes structured and semi-structured data, AI emphasizes the processing capabilities of this unstructured and semi-structured data.

Therefore, you can understand that the large model is putting pressure on the integrated structure of the lake and warehouse, pushing it forward.

The story also ends with a company called Databricks.

6e6a3d6b652fa7bb23ff036f8454a336.png

Databricks paid $1.3 billion out of its own pocket to acquire artificial intelligence startup MosaicML.

The MosaicML product becomes part of the Databricks Lakehouse AI suite.

At the recent "Data + AI Summit 2023 (Summit)", it can be seen that Databricks is also adding weight to its large model tool chain.

At the same time, the large-scale model companies in the "Hundred Models Competition" also made great efforts.

Both sides want to win customers as early as possible.

Missing, or missing out, is not a good thing after all.

Some people are always quick to make changes.

(over)

One More thing

Don't be a headline party, and then directly answer the title question of the article:

After the advent of the large model, in the selection of future-oriented data platforms, traditional data warehouse products that are only designed for structured relational expressions will be eliminated first.

b6e5ad438e37ef0cd0365072b2579e2e.png

Bring goods ing

"I Saw the Storm" Teacher Tan's new book, available on JD.com

bd0b801ed285fb83a901a97fa3f35d17.jpeg

read more

AI large model and ChatGPT series:

1. ChatGPT is on fire, how to set up an AIGC company and make money?

2.  ChatGPT: Never Bully Arts Students

3.  How does ChatGPT learn by analogy? 

4.  Exclusive丨From the departure of the great gods Alex Smola and Li Mu to the successful financing of AWS startups, look back at the evolution of the "underlying weapon" in the era of ChatGPT large-scale models

5.  Exclusive 丨 Former Meituan co-founder Wang Huiwen is "acquiring" the domestic AI framework OneFlow, looking to add a new general from light years away

6.  Is the ChatGPT large model used in criminal investigation and solving cases only a fictional story?

7.  Game of Thrones of the Large Model "Economy on the Cloud"

8.   CloudWalk's large-scale model: what is the relationship between the large model and the AI ​​platform? Why build an industry model?

9.  In-depth chat with 4Paradigm Chen Yuqiang丨How to use AI large models to open up the trillion-scale traditional software market?

10. In-depth chat with He Xiaodong of JD Technology丨A "departure" nine years ago: laying the foundation for multi-modality and competing for large-scale models

11. Old store welcomes new customers: Things no one tells you about vector database selection and betting

AI large model and academic paper series:

1. The open source "imitation" of ChatGPT actually works? UC Berkeley thesis, persuasion, or move forward?

2. In-depth chat with Wang Jinqiao丨Zidong Taichu: How many high-quality papers are needed to build a large-scale domestic model? (two)

3. In-depth chat with Zhang Jiajun丨What papers are worth reading behind the big model of "Zidong Taichu" (1)

comic series

1.  Is it joy or sorrow? AI actually helped us finish the Office work

2.  AI algorithm is a brother, isn't AI operation and maintenance a brother?

3.  How did the social bullishness of big data come about?

4.  AI for Science, is it "science or not"?

5.  If you want to help mathematicians, how old is AI? 

6.  The person who called Wang Xinling turned out to be the magical smart lake warehouse

7.  It turns out that the knowledge map is a cash cow for "finding relationships"?

8.  Why can graph computing be able to positively push the wool of the black industry?

9.  AutoML: Saving up money to buy a "Shan Xia Robot"?

10.  AutoML : Your favorite hot pot base is automatically purchased by robots

11. Reinforcement learning: Artificial intelligence plays chess, take a step, how many steps can you see?

12.  Time-series database: good risk, almost did not squeeze into the high-end industrial manufacturing

13.  Active learning: artificial intelligence was actually PUA?

14.  Cloud Computing Serverless: An arrow piercing the clouds, thousands of troops will meet each other

15.  Data center network : data arrives on the battlefield in 5 nanoseconds

16.   Data center network : It’s not scary to be late, what’s scary is that no one else is late

AI framework series:

1. The group of people who engage in deep learning frameworks are either lunatics or liars (1)

2. The group of people who engage in AI frameworks 丨 Liaoyuanhuo, Jia Yangqing (2)

3. Those who engage in AI frameworks (3): the fanatical AlphaFold and the silent Chinese scientists

4. The group of people who engage in AI framework (4): the prequel of AI framework, the past of big data system

Note: (3) and (4) are only included in "I Saw the Storm".

6355027b7535db837facd1783fc21c93.jpeg

Guess you like

Origin blog.csdn.net/weixin_39640818/article/details/131799047