Building a new data platform in the era of AI big language model

Under the guidance of the big language model, the future reset of data platform leaders Databricks and Snowflake was discussed. Well-known platforms such as Databricks and Snowflake were discussed.

Delta, udi, Iceberg in the storage field, or Databricks and Snowflake in the real-time data processing field

1. LLM brings changes to big data companies

        The changes big models bring to businesses are wide-ranging. It can help enterprises replace the work of many people, such as data development, data tuning, database administrators (DBA), etc. The success of large models depends on three elements: model, data and computing power. Bloomberg has released a named It is a large model of "BloombergGPT", focusing on the fields of news and finance. Due to the accumulation of rich data in this field, the large models generated are superior in terms of knowledge depth and logical structure.

2. LLM and unleash the value of data

        AI is integrated into the data platform as a core function. Currently, the AI ​​link is still evolving rapidly and has undergone many changes. Enterprise infrastructure needs to be flexible. The plug-in system itself can be solved through UDF, FunctionCompute or a specialized PipelineManagement system. There are many components of LLM applications, such as LangChain, vector database, and LLM runtime. These combinations can easily build an end-to-end LLM service link. Many new and easier-to-use LLMOps components are emerging, such as Lepton.ai, XInference

3. Comparison between open source products Spark/Flink/Clickhouse and SaaS-based Snowflake

The fourth new calculation method is incremental calculation. We hope to unify these three traditional computing modes through incremental computing and ultimately form an integrated engine.

Flink was an early adopter of integrated solutions and proposed the slogan of "integrated flow and batch". Currently, there are not many implementation cases.

 4. BI and AI/ML are gradually integrating

        System decoupling/balance between openness and high performance, linkage of two computing modes. SQL is the mainstream language in the data analysis field, and Python is the most popular in the AI ​​field. How to program both systems conveniently is a key challenge. SQLML, SQL+UDF embedded Python, Python's SQLAlchemy library, native Python interface, etc. are all options.

5. BI+AI/ML, even LLM, data platforms need to gradually support OLAP, OLTP, streams, Graph, and vectors

        The data field is divided into three general directions: OLTP, OLAP, and AI. Typical scenarios in the field of OLAP data analysis are basically fixed. There is a clear consensus in the industry on Lambda architecture issues. An integrated architecture that unifies all analytical workloads is the future direction.

        OLAP+AI integration is currently a hot topic, and the overlap and interaction requirements of these two types of data are strong enough. Databricks has always been focused on this direction, and it has always adhered to the Data+AI strategy. Snowflake starts from the OLAP field and has recently accelerated its deployment of supporting AI, such as SnowPark, which has been making efforts.

        Support data analysis and other computing paradigms . Both SQL engine and AI engine can support it well, and the data analysis architecture will become unified . In the field of data analysis, everyone may eventually develop in the direction of incremental computing, thereby gradually breaking the limitations of the Lambda architecture, and integrated architecture will become the future. Just like our prediction two years ago that the integration of lakes and warehouses will become the future, we hope that the integrated architecture will be truly implemented in two years.

        Big language models bring significant enhancements in semi-structured and unstructured data processing capabilities . It was almost difficult to process this data before, but now it has become relatively easy. In the past, it was difficult to understand the content of a PDF file, but now it has become easier. At this level, if we could only process structured data before, now there are two more categories, semi-structured and unstructured data. The significant increase in the ability to process this data will inevitably lead to a significant increase in storage and computing requirements.

        With the arrival of big language models, data exchange/privacy protection will receive more investment . The requirements for data security and privacy have further increased, and the need for data sharing has become more urgent. Because data is essentially knowledge.

        BI+AI has become a must for data platforms , which need built-in or plug-in support for AIOps technologies such as heterogeneous data, finetune, and vector retrieval. AI makes all platforms intelligent, and the intelligence of data platforms has also become inevitable. Data platforms that significantly lower the barriers to use will be used by more people

"2023 China Artificial Intelligence Maturity Model Report"

Guess you like

Origin blog.csdn.net/ejinxian/article/details/132777222