[ODPS New Product Release Issue 2] Real-time data warehouse Hologres: launches computing group instances/supports JSON data/vector calculations + large models and other new capabilities

Alibaba Cloud's ODPS series products, with MaxCompute, DataWorks, and Hologres as the core, are committed to solving the computing needs of users' diversified data, realizing integrated architecture integration in storage, scheduling, and metadata management, and supporting transportation, finance, scientific research, and many other fields. Efficient processing of scene data is the earliest self-developed and most widely used integrated big data platform in China.

This issue will focus on

ꔷ Hologres launches computing group instances

ꔷ Hologres supports JSON data

ꔷ Hologres vector calculation + large model capability

ꔷ Hologres new data synchronization capabilities

ꔷHologres data tiered storage

New feature—Hologres launches compute group instances

Computing group instances support decomposing computing resources into different computing groups to better serve high-availability deployments.

Application scenarios:

  • Resource isolation: Query jitter caused by the interaction between different enterprise scenarios, such as the interaction between writing and writing, between reading and writing, between large and small queries, as well as the interaction between online services, multi-dimensional analysis, ad hoc analysis, etc.; a certain Some big data engines do not use storage and calculation separation architecture to achieve high-cost business scenarios such as isolation through multiple copies.
  • High availability: For solutions without service-level high availability, disaster recovery, and multi-activity, enterprises use dual/multiple links to achieve high availability, disaster recovery, and multi-activity, which involves high-cost business scenarios such as manpower and computing resources.
  • Flexible expansion and contraction: In response to the high demand of enterprises for business flexibility: sudden increase in business traffic can be expanded in time to handle the traffic, and the capacity can be reduced in time during low business peaks, reducing business capital losses and reducing costs.

Features:

  • Natural physical resource isolation: There is natural physical resource isolation between each computing group. Enterprise use can avoid mutual influence between computing groups and reduce business jitter.
  • Flexible expansion and contraction on demand: Computing and storage are highly scalable and have dual elasticity. Enterprises can scale up on time or on demand (Scale Out); hot expansion and contraction on demand (Scale Up).
  • Reduce costs: Based on physical Replication implementation, physical files are completely reused, enterprises can use resources flexibly on demand, and costs can be controlled to a minimum.

Product Demo-Computing Group Instance

Jump to the Hologres console, create a new calculation group through SQL and grant the corresponding Table group (data) permissions - change the calculation group, change the innit warehouse to the newly created read warehouse - execute the query, and the entire load will be transferred to the read warehouse . At the same time, you can start and stop the calculation group as needed. Stop or start operations can be implemented using SQL, or can be operated visually on the interface. At the same time, the resources of the computing group can also be adjusted on demand, which can be operated visually on the page or using CPO - it can be released in time when the computing group does not need to use it, without occupying any resources.

View product demo video

New feature—Hologres supports JSON data

Support columnar JSONB storage to improve query efficiency

Application scenarios:

  • Query efficiency: For semi-structured schema, the schema cannot be fixed in advance, and row storage is mainly used. When calculating large-scale data, a large amount of data needs to be scanned. Query efficiency must meet enterprise business needs.
  • Storage efficiency: The compression capability of column storage cannot be used, resulting in low compression rate and large storage space. Storage efficiency must meet enterprise business needs
  • Data processing: For relatively complex problems in the processing of semi-structured data, operations such as data cleaning, extraction and conversion are required. It is necessary to meet the enterprise's more comprehensive functional support business needs.

Features:

JSON data processing methods: As a common semi-structured data type, JSON has two data processing methods:

  • Importing means parsing the data structure and storing the data in a strong schema. The advantage of this method is that when it is stored in the database, it is already a strong schema data, which has better query performance and storage performance. The disadvantage is that during the parsing process, the data needs to be converted into a strong scheme during the processing, and the flexibility of JSON data is lost. If JSONkey is added or reduced, the parsing program needs to be modified.
  • Another way is to directly write this layer of data into the database and use the JSON function to parse it during querying. The advantage of this method is that it retains the flexibility of JSON data to the greatest extent. The disadvantage is that the query performance is poor, and the development is complicated by selecting appropriate processing functions and methods each time.

For JSON data processing methods, Hologres optimizes JSON data storage capabilities and can store them according to their disadvantages. The JSON data system will deduce the data types that can be stored based on the written key and value values.

  • Flexible and easy to use: Different from option 1, the data is strongly Schemaized in advance, retaining the flexibility of JSON data to the greatest extent.
  • High compression rate: Using columnar storage can effectively improve the compression rate and save storage space.
  • Strong query performance: Use column storage to reduce scanned data, improve IO efficiency, and improve query efficiency.

Product Demo-column JSON function

Based on the public sample data stored in JSON form, which contains key value data stored in JSON form, each row will have key and value to represent different business meanings. ——Use this section C to query the number of issues closed every year and month, and the system starts to execute. ——The traditional execution method and query method scan line by line and take out the keys and values ​​one by one, which takes a total of time. 55 seconds. ——At this time, data column storage is enabled, and it can be queried after it is completed. It takes 1.47 seconds in total, and the query efficiency is greatly improved.

View product demo video

New capabilities—Hologres vector calculation + large model capabilities

High-performance vector computing, combined with large models to build exclusive knowledge base

Application scenarios:

Problems in deploying enterprise-level large model knowledge base:

When enterprises deploy models, there will be problems such as high costs in computing and storage resources, resource elasticity, and large model deployment;

When the business processes corpus, the original corpus processing process will be complicated. When there is a large amount of corpus data, there are higher requirements for the writing ability and real-time performance of the vector database. When the knowledge base question and answer QPS is high, the query ability of the vector database is high. requirements and other needs;

When enterprises build large-scale model knowledge bases, they will encounter problems such as long processes, many products involved, high overall architecture connection costs, and difficulty in connecting the architecture.

Features:

Overall advantages of Hologres + Proxima:

Proxima is a self-developed vector engine by DAMO Academy. Its stability and performance are superior to open source products such as Faiss. Hologres is deeply integrated with Proxima, the self-developed vector engine of DAMO Academy, to provide high PQS and low-latency vector computing services. Its specific advantages are the following three aspects:

  • High performance: Through the integrated data warehouse, it provides low-latency, high-throughput online vector query services; supports real-time writing and updating of vector data, and can be queried immediately after writing.
  • High usability: unified SQL query interface for querying vector data, compatible with PostgreSQL ecosystem; supports vector retrieval with complex filtering conditions
  • Enterprise-level capabilities: Flexible horizontal expansion of vector computing and storage resources; supports master-slave instance architecture, computing group instance architecture, supports physical isolation of computing resources, and achieves enterprise-level high availability capabilities

Hologres+PAI deploys large model knowledge base architecture and advantages:

The architecture is mainly divided into three levels

  • Pre-data preprocessing layer: For the original corpus data, text chunks are formed after loading and analysis, and then vectorized through Embedding to generate corpus vector data, which is finally written into real-time data Hologres.
  • Text generation layer: For the user's original question, the question is first Embedding into a question vector, and then Top K vectors are retrieved in Hologres.
  • Final generation layer: Top K corpus is used as the input of the large model, and combined with other inputs of the large model, including chat history and the final reasoning of Prompt, the final answer is obtained. The large models here can be deployed uniformly through the machine learning platform.

Architectural advantages:

  • Simplified model deployment: One-click deployment of LLM large model inference service through model online service PAI-EAS
  • Simplify corpus processing and query: one-click corpus data loading, slicing, vectorization, and import into Hologres; at the same time, based on Hologres' low-latency and high-throughput vector retrieval capabilities, it provides users with faster and better vector retrieval services.
  • One-stop knowledge base construction: no need for manual connection, large model deployment, WebUI deployment, corpus data processing, and large model fine-tuning can be completed on one platform.

Product Demo demonstration-Hologres+PAI deploys large model knowledge base

Open a Hologres instance and record the domain of the instance in the network information on the instance details page. Click the login instance button to enter HoloWeb - create a database on the original data management page and record the database account name - click the Security Center to enter the user management page, create a custom user and authorize it, and record the created user name and password - proceed For large model deployment, you can use PAI-EAS to deploy a large LLM model and record the large model call information - in the Demo, use PAI-EAS to deploy the langchain's WebUI service. Click to view the web application to enter the web UI page. Set the Embedding model in the settings page. You can set the LLM large model just deployed and the Hologres vector storage. The above files can be configured with one click through the Json file - click Parse and fill in the relevant configuration information with one click. At the same time, click Connect Hologres to test connectivity—enter the upload page to process corpus data. Upload the corpus data, set the parameters related to text slicing, click upload to import the data into the Hologres vector table - return to the HoloWeb editor to refresh, the corpus data has been imported into Hologres as a vector. We return to the web UI page just now, enter the Chat page, first try the native ChaGLM large model and ask "What is Hologres", but the result is not ideal - then use Hologres to fine-tune the large model and ask the same question, the result is correct - Return to the langchain chatbot page and complete the API call of the above solution by calling the information.

View product demo video

New capabilities—Hologres data synchronization new capabilities

Added support for synchronizing data sources such as ClickHouse, kafka, and Postgres to Hologres

Application scenarios:

  • Synchronization performance: There are many sources of enterprise data, resulting in different data requirements, such as full database synchronization, full incremental synchronization, sub-database and sub-table merging, real-time synchronization, etc.;
  • Enterprises build data platforms. Each data source needs to make certain adaptations, so to achieve high-performance writing, development students need to have certain synchronization tuning capabilities.
  • Synchronization cost: There are many data sources, and corresponding client development will lead to high start-up costs for developers; synchronization performance cannot meet business needs, and resources are continuously added in a short period of time, and the cost increases; metadata management is difficult during data synchronization
  • Business operation and maintenance: self-built data platform, the entire life cycle of development, debugging, deployment, operation and maintenance, etc., are all managed by development students. The whole process is very cumbersome, and the entire link needs to be checked one by one for data inconsistency, which is costly. If there is a problem with the data at a certain point, it will involve data backwashing, and the backwashing sources are different, making the operation and maintenance process very difficult.

Features:

Overview of Hologres data synchronization capabilities

Hologres has a very open ecosystem, supporting Flink, DataWorks data integration, Holo client, JDBC and other methods to synchronize data to Hologres to meet the data synchronization and data migration needs of various businesses and achieve more real-time and efficient data analysis and Data service capabilities

  • Flink is fully compatible: it can realize real-time data writing, dimension table association, reading, etc.
  • Highly adaptable to DataWorks data integration: Highly adaptable to DataWorks data integration. For example, various data sources supported by DataWorks can basically be synchronized to Hologres.
  • Holo Client and Holo Shipper are available out of the box: High-performance data checking and high-performance point-to-write updates can be achieved through Holo Client. At the same time, Holo Shipper can realize the migration of the entire database of data instances.
  • Standard JDBC/ODBC interface: Provides standard JDBC/ODBC interface, ready to use out of the box.

Continuous evolution, new capabilities for Hologres data synchronization

In order to meet different business needs, Hologres continuously iteratively updates its data synchronization capabilities. Its new capabilities have the following characteristics:

  • ClickHouse entire database offline migration: It relies on DataWorks data integration. The overall offline migration is divided into two parts: one is the automatic identification and mapping of metadata; the other is the one-time synchronization of the entire database data, without the need to write one table per table as before tasks, greatly reducing various inconveniences in development and operation, and realizing the rapid migration of ClickHouse data to Hologres.
  • Kafak real-time subscription: Kafak real-time subscription can be achieved in two ways: first, Flink subscribes to Kafka, writes it to Hologres in real time, and implements streaming ETL of the real-time data warehouse in the data warehouse layer; second, consumes Kafka in real time through DataWorks data integration , message changes are automatically synchronized, and then automatically written directly to Hologres. Kafak data can be quickly accessed.
  • PostgreSQL real-time synchronization: PostgreSQL data is synchronized to Hologres in real time through DataWorks data integration. It not only supports real-time synchronization of single tables, but also supports DDL capability configuration, real-time synchronization of the entire database, automatic mapping of database and table structures, and full summation Real-time incremental data synchronization greatly reduces development synchronization problems.

Product Demo-ClickHouse entire library synchronization

In the DataWorks data integration interface, configure the ClickHouse and Hologres data sources, and test the connectivity of the data sources. If the test passes, you can proceed to the next step - select the tables that need to be synchronized in ClickHouse, and select advanced configurations, such as single-end task speed. Concurrency, running and other configurations, check the table and synchronize it to Hologres at one time - mapping of the target table, click the batch refresh button to realize the mapping of the table structure - start the synchronization task, wait for about two minutes - after the data synchronization is completed, the page It has been refreshed. You can verify the upstream data based on the number of written data to see if the data has passed - Hologres performs data verification. You can make a simple query on the table and the query is completed.

View product demo video

New capability—Hologres data tiered storage

Application scenarios:

  • E-commerce orders: Orders have been accessed frequently in recent months, and RT sensitivity is high; historical data access frequency is low, and latency is not sensitive.
  • Behavior analysis: high-frequency queries of recent traffic data require high timeliness; historical data queries are less frequent but require that they can be checked at any time.
  • Log analysis: recent data is queried frequently; historical data needs to be saved for a long time to ensure subsequent auditing and backtracking work.

Features:

  • Standard storage: Standard storage is a full SSD hot storage, which is the default storage of Hologres. It is mainly suitable for scenarios where the full table data is frequently accessed and has high requirements for access performance.
  • Low-frequency access storage: The access frequency will decrease over time and gradually become cold data. For example, some log data will not be accessible after this year, and then the data needs to be migrated from standard storage to low-frequency storage to reduce costs. If we have the ability to automatically convert hot and cold data based on rules, we will be able to greatly reduce our costs. The maintenance cost is suitable for scenarios where the data volume is large, the frequency of access is low, and storage costs need to be reduced.
  • Dynamic hot and cold stratification of partitions: Set hot and cold partition flow rules through dynamic partitioning capabilities to achieve dynamic hot and cold stratification of partitions; and the cost of hot and cold stratification, taking Beijing's annual and monthly subscription as an example, its standard storage is one yuan per GB per month, and low-frequency guaranteed storage is 0.144 yuan per GB per month, which is about seven times the difference in cost. In terms of performance, based on the test set results measured using standard TPC to ETB data, there is a gap of about 3 to 4 times.

Product Demo - Create cold storage table statements and set up partition tables

For example, in the table creation statement in the Demo, setting a science table property when creating the table indicates that you can create a cold storage table by clicking Run. By querying the HG table storages status system table, you can see whether the storage strategy of the following table meets expectations. ——The progress status of the table is cold, and this is a storage table. For the hot storage table of this standard storage that already exists in the system, by executing it separately, follow the command, specify the table and click Run, and the cold storage setting is successful - all the data in the existing state of the table has been completely moved to the cold storage low-frequency storage medium. ——The partition table is divided into two parts. The first part is to create a cold storage table of an ordinary partition table. Then set the storage mode of this table in the statement of creating the partition table. The partition sub-table of the partition table will default The storage strategy is recorded as a database table and does not need to be set separately. ——On the other hand, if we want to modify the attributes of a certain partition, assuming that we want to modify an attribute of a certain partition, then specify the table name of the partition subtable in the table property, and then set the storage policy to change a certain The partition subtable is changed to the hot and cold attributes we want. For dynamic partition tables, we need to set some other properties.

View product demo video

Free trial when you receive Hologres5000CU: https://free.aliyun.com/?pipCode=hologram

Get a free trial of DataWorks: https://free.aliyun.com/?pipCode=dide

Free use when receiving MaxCompute5000CU: https://free.aliyun.com/?pipCode=odps

Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk's hard-core migration server Solon for JDK 21, virtual threads are incredible! ! ! TCP congestion control saves the Internet Flutter for OpenHarmony is here The Linux kernel LTS period will be restored from 6 years to 2 years Go 1.22 will fix the for loop variable error Google celebrates its 25th anniversary Svelte has built a "new wheel" - runes
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10112759