How can an enterprise's large model become an "encyclopedia" of its own data?

Author | Guo Wei

Editor | Debra Chen

In today's business environment, the management and application of big data have become a core component of corporate decision-making and operations. However, with the explosive growth of data volume, how to effectively utilize this data has become a common challenge.

This article will discuss the integration of big data architecture and big models, and how to integrate big models into the company's big data architecture, and use Apache SeaTunnel and WhaleStudio to "encyclopedia" the company's internal data, and use big data and big models to improve Business operating efficiency.

The position of big models in the overall company big data architecture

Today, both large and small companies will encounter the same problem: there is a huge amount of data accumulated in the company, but how to use it?

The emergence of large models has opened up a whole new way to use data. The question is how to obtain a large amount of company data and turn it into "your" large model?

And how to inject large models into the company's internal data and turn them into "encyclopedias"?

Overview of Big Data and Big Model Architecture

To better answer these questions, we first need to figure out where big models fit into the complex data structure of the enterprise. Currently, the globally popular big data structure diagram is as follows:

file

When enterprises deal with big data, they usually divide the data into two categories: real-time data and batch data. Real-time data can come from various sources such as Internet of Vehicles, database logs, click streams, etc., while batch data may include files, reports, CSV files, etc. These data can be processed through various tools and technologies, such as Apache Kafka, Amazon Kinesis, etc., and finally integrated into the enterprise's big data analysis system.

Big models play a vital role in big data architecture. They are able to process and analyze large amounts of data to provide businesses with deep insights and predictions. Large models can be integrated in two main ways:

  1. Optimization based on open source models: Enterprises can use large open source models and optimize them based on their own data to improve model performance. Although this method is complicated and difficult for ordinary users to operate, it can train a highly customized model. For specific training methods, please refer to "Train your own private ChatGPT with the money of a cup of Starbucks "
  2. Data vectorization: Another approach is to vectorize the data, which is to convert the data into a format that is easy to process and query for large models, and then quickly put it into the enterprise's own vector database.

This is the position and role of the big model in the big data architecture. As the core technical component of the big data architecture, the big model plays an irreplaceable role in data conversion, predictive analysis, and intelligent applications. It is the key to realizing the value of big data. The key.

Data Highway: Apache SeaTunnel& WhaleStudio

Data synchronization is another key link in big data architecture. Using tools such as Apache NiFi, Apache Spark, Sqoop, etc., real-time and batch synchronization of data between different systems and databases can be achieved. These tools support cross-cloud and hybrid cloud environments and are capable of processing data from a variety of data sources and synchronizing it to a target database or data warehouse. However, because they rely on open source, their data source support is very limited.

Apache SeaTunnel: a new generation of real-time multi-source data synchronization tool, the highway of big data

There is a very vivid metaphor that can simply and clearly summarize the role of Apache SeaTunnel - the highway of big data. It can synchronize real-time and batch data from various data sources, such as MySQL, RedShift, Kafka, etc., to the target database. Different from Apache NiFi and Apache Spark, the new generation of real-time multi-source data synchronization tool Apache SeaTunnel can currently support data synchronization and integration of hundreds of source databases/destination , and supports data synchronization across clouds and hybrid clouds, which is convenient Different users further perform big data and large model training.

file

Apache SeaTunnel typical case

Currently, Apache SeaTunnel has a large number of users around the world, one of the typical users is JP Morgan (JP Morgan Chase Bank).

file

JPMorgan Chase, a globally recognized financial giant with more than 200,000 employees, including more than 30,000 data professionals (engineers, analysts, scientists and consultants), is struggling with complex legacy systems and an emerging data environment. Operating in a maze of more than 10 different data platforms, the agency needed a robust, secure, and efficient approach to data integration.

The most significant challenge for JPMorgan Chase is the ingestion and processing of data through complex privacy and access controls, which, while critical for data protection, often delay the data integration process. Coupled with the company's transition phase to AWS, which is still underway two years later, and experiments with modern database solutions like Snowflake, the need for flexible data integration solutions is acute.

In the pursuit of agility, JPMorgan Chase compared several popular data synchronization products, such as Fivetran and Airbyte, but ultimately chose an alternative that supports Spark clusters to achieve the best performance - Apache SeaTunnel.

The reason is that SeaTunnel is compatible with its existing Spark infrastructure. A key advantage is Apache SeaTunnel's seamless integration with the Java code base, allowing data migration jobs to be triggered directly from JPMorgan Chase Bank's main coding environment. J.P. Morgan uses SeaTunnel to ingest data from sources such as Oracle, DB2, PostgreSQL, DynamoDB, and SFTP files, processes the data on a Spark cluster, and ultimately loads it into S3, J.P. Morgan’s centralized data repository, and subsequently integrates Go to Snowflake and Amazon Athena for advanced analytics.

An outstanding feature of Apache SeaTunnel is its ability to explicitly handle data type conversion to ensure data integrity between different systems, which is an important part of JPMorgan Chase Bank's diverse data ecosystem.

Why do we need Apache SeaTunnel?

Since various popular data processing tools such as Flink and Spark already exist, why do we need Apache SeaTunnel? Like JPMorgan Chase, dig deeper into the tool and you'll find it's not a difficult question to answer.

  • Apache SeaTunnel supports the development version, currently supporting 130+ Connectors, and the commercial version product (WhaleTunnel) supports 150+ databases, which is unmatched by other products;

file

  • SeaTunnel performance advantage: 30 times faster than Airbyte and 30% faster than DataX; (For performance reports, please refer to "Latest Performance Comparison Report: SeaTunnel is 30 times faster than Airbyte!"

file

  • Easy to deploy: Apache SeaTunnel can be deployed in 3 minutes and supports running on Spark/Flink/Zeta.

file

Simple to use

In terms of usage, Apache SeaTunnel also adheres to the purpose of serving a wide range of big data practitioners, and its main design goal is simplicity and ease of use.

  • Synchronization jobs can be created using SQL-like code.
  • Supports Source Connector, Sink Connector and Transform operations.

file

Want an easier way? WhaleStudio on AWS Market Place

If creating code to perform data integration is a challenge, there are simpler and easier options available. WhaleStudio, a commercial product created by White Whale open source based on Apache DolphinScheduler and Apache SeaTunnel, is a distributed, cloud-native DataOps system with a powerful visual interface. It adds enterprise-level features required by commercial customers, and users with no basic knowledge can easily get started:

  • WYSIWYG data mapping and processing
  • Fully visual operation scheduling and data processing, no code processing required
  • Fully compatible with AWS and multi-cloud and hybrid cloud architectures
  • Multi-team collaboration and development
  • High-performance connections to over 150 data sources, including
    • AWS S3, Aurora, Redshift
    • SAP
    • Oracle, MySQL
    • Damn, Iceberg

Simply put, the usage process of WhaleStudio and large model integration can be summarized as follows:

  1. Data source connection: First, you need to configure the data source in WhaleStudio. This includes CSV files, databases, cloud storage services, and more. Users can add data source components to the workflow by dragging and dropping and set connection parameters.
  2. Data transformation: Data may need to be cleaned and transformed during transfer to fit the target system. WhaleStudio provides a variety of data transformation tools, including data filtering, field mapping, data merging, etc.
  3. Data loading: The transformed data needs to be loaded into the target database or data warehouse. WhaleStudio supports a variety of target systems, including relational databases, NoSQL databases and cloud data services.
  4. API integration: In order for the data to be understood by the large model, the data needs to be converted into a specific format through an API. WhaleStudio can call external APIs and output the transformed data into large models.
  5. Process monitoring: Users can monitor the status of the data flow in real time, view the progress of data synchronization and any errors that may occur.
  6. Data synchronization and update
    1. Scheduled tasks: WhaleStudio supports scheduled tasks, allowing users to set up data flows to run automatically at specific times to ensure real-time updates of data.
    2. Data version control: Through version control, users can track the change history of data flows and roll back to previous versions when necessary.

How to inject large models into company internal data and turn them into "encyclopedias"

file

As mentioned above, the "highway" of data is available, so how to put the data into the large model through the "highway" and make use of it?

The above figure shows an example of how the large model can turn the company's internal data into an "encyclopedia". All the articles about books in the MySQL database are input into the large model graphically, that is, in the form of vectors. The large model understands and ultimately questions and answers the input data in a language. This process is explained in detail with a practical case below.

Practical case: Using WhaleStudio+ large model on AWS to transform library retrieval from book title retrieval to semantic retrieval

Existing book search solutions, such as those used by public libraries, rely heavily on keyword matching rather than a semantic understanding of the actual content of a book title. As a result, the search results may not meet our needs well, or may even be quite different from the results we expected. This is because relying solely on keyword matching is not enough because it cannot achieve semantic understanding and therefore cannot understand the searcher’s true intention.

There are better ways to conduct book searches more accurately and efficiently. By using specific APIs, book data can be converted into a format that can be understood by large models, thereby enabling semantic-level search and question-and-answer functions. This approach not only improves the accuracy of searches, but also provides businesses with a new way to leverage their data.

WhaleStudio is a powerful data integration and processing platform that allows users to design and implement data flows through a graphical interface. WhaleStudio is used to integrate library book data into large models for deeper semantic search and question answering.

Next, we will demonstrate how to use WhaleStudio, Milvus and OpenAI to perform similarity search to achieve semantic understanding of the entire book title, thereby making the search results more accurate.

Preparation

  1. Before the experiment, we need to go to the official website to obtain an OpenAI token.

  2. Deploy WhaleStudio on AWS MarketPlace

  3. Then deploy a Milvus experimental environment ( https://milvus.io/docs/install_standalone-docker.md).

  4. We also need to prepare the data that will be used for this example. You can download it from here and put it under /tmp/milvus_test/book ( https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks)

  5. Configure WhaleStudio tasks

Create a project → create a new workflow definition → create a SeaTunel task → copy the script into the task

file

  1. script code
env {
  # You can set engine configuration here
  execution.parallelism = 1
  job.mode = "BATCH"
  checkpoint.interval = 5000
  #execution.checkpoint.data-uri = "hdfs://localhost:9000/checkpoint"
}

source {
  # This is a example source plugin **only for test and demonstrate the feature source plugin**
  LocalFile {
    schema {
      fields {
        bookID = string
        title_1 = string
        title_2 = string
      }
    }
    path = "/tmp/milvus_test/book"
    file_format_type = "csv"
  }
}

transform {
}

sink {
  Milvus {
    milvus_host = localhost
    milvus_port = 19530
    username = root
    password = Milvus
    collection_name = title_db
    openai_engine = text-embedding-ada-002
    openai_api_key = sk-xxxx
    embeddings_fields = title_2
  }
}
  1. Click to run

file

  1. Simple data preprocessing can also take advantage of the visual interface

file

file

  1. Query the database to confirm that there is already data

file

  1. Use the following code to search book titles semantically
import json
import random
import openai
import time
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

COLLECTION_NAME = 'title_db'  # Collection name
DIMENSION = 1536  # Embeddings size
COUNT = 100  # How many titles to embed and insert.
MILVUS_HOST = 'localhost'  # Milvus server URI
MILVUS_PORT = '19530'
OPENAI_ENGINE = 'text-embedding-ada-002'  # Which engine to use
openai.api_key = 'sk-******'  # Use your own Open AI API Key here

connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)

collection = Collection(name=COLLECTION_NAME)

collection.load()


def embed(text):
    return openai.Embedding.create(
        input=text, 
        engine=OPENAI_ENGINE)["data"][0]["embedding"]
def search(text):
    # Search parameters for the index
    search_params={
        "metric_type": "L2"
    }

    results=collection.search(
        data=[embed(text)],  # Embeded search value
        anns_field="title_2",  # Search across embeddings
        param=search_params,
        limit=5,  # Limit to five results per search
        output_fields=['title_1']  # Include title field in result
    )

    ret=[]
    for hit in results[0]:
        row=[]
        row.extend([hit.id, hit.score, hit.entity.get('title_1')])  # Get the id, distance, and title for the results
        ret.append(row)
    return ret

search_terms=['self-improvement', 'landscape']

for x in search_terms:
    print('Search term:', x)
    for result in search(x):
        print(result)
    print()
  1. operation result

file

Result: If we follow the old method of keyword search, the book title must contain keywords such as self-improvement and improvement; but by providing a large model for semantic level understanding, we can retrieve book titles that better meet our needs. For example, in the above example, the keyword we searched for is self-improvement, and the titles of the displayed books are "The Dance of Relationship: The Art of Getting Along with Both Intimacy and Independence", "Nicomachean Ethics", etc. Containing relevant keywords obviously better meets our requirements.

Conclusion

Big data and big models provide enterprises with unprecedented data processing capabilities and insights. Through effective data architecture design, large model integration, real-time and batch data processing, and data synchronization, enterprises can better utilize their data resources, improve operational efficiency, and stay ahead in a highly competitive market.

Apache SeaTunnel and WhaleStudio serve as enterprise data highways to help quickly connect internal enterprise data and achieve vectorization and "encyclopedia" of data. Among them, WhaleStudio, as a data integration tool, provides enterprises with a simple, efficient and powerful solution, allowing enterprises to easily synchronize data into large models to achieve deeper data analysis and application, thereby improving the enterprise's Data processing capabilities and business insights.

This article is published by Beluga Open Source Technology !

How much revenue can an unknown open source project bring? Microsoft's Chinese AI team collectively packed up and went to the United States, involving hundreds of people. Huawei officially announced that Yu Chengdong's job changes were nailed to the "FFmpeg Pillar of Shame" 15 years ago, but today he has to thank us—— Tencent QQ Video avenges its past humiliation? Huazhong University of Science and Technology’s open source mirror site is officially open for external access report: Django is still the first choice for 74% of developers. Zed editor has made progress in Linux support. A former employee of a well-known open source company broke the news: After being challenged by a subordinate, the technical leader became furious and rude, and was fired and pregnant. Female employee Alibaba Cloud officially releases Tongyi Qianwen 2.5 Microsoft donates US$1 million to the Rust Foundation
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/dailidong/blog/11126562