Databend v1.2 release! Data + AI

Dear community partners, Databend ushered in the official release of v1.2.0 on June 29, 2023! Compared with the v1.1.0 version, the developers added a total of 600 commits, involving 3083 file changes and about 170,000 lines of code modification. Thanks to all the community partners for participating, and every one of you who makes Databend better!

In version v1.2.0, Databend added features such as BITMAP data type , directly querying CSV/TSV/NDJSON files using column numbers , AI Functions, etc., and designed and implemented a new hash table to greatly improve the performance of Join . The release of this version brings Databend closer to realizing the vision of LakeHouse. It can directly read and analyze CSV/TSV/NDJSON/Parquet files stored on object storage. You can also perform ETL operations on these files inside Databend, thereby Do some higher performance OLAP analysis.

At the same time, Databend has also designed and implemented enterprise-level features such as computed columns , VACUUM TABLE , and Serverless Background Service . Interested partners can contact the Databend team for upgrade information, or visit Databend Cloud for instant experience.

Databend x Kernel

A quick look at the important new features of Databend, and meet the Databend that is closer to your heart.

type of data:BITMAP

Databend has added BITMAPsupport for the data type and implemented a series of related functions.

BITMAPis a compressed data structure that can efficiently store and manipulate collections of Boolean values. It provides fast set operations and aggregation capabilities, and is widely used in data analysis and query. Common usage scenarios include: deduplication counting, filter selection, and compressed storage. BITMAPThe datatype implementation in Databend adopts RoaringTreemap. Using this data structure improves performance and reduces memory usage compared to other bitmap implementations.

SELECT user_id, bitmap_count(page_visits) AS total_visits
FROM user_visits

+--------+------------+
|user_id |total_visits|
+--------+------------+
|       1|           4|
|       2|           3|
|       3|           4|
+--------+------------+

If you want to learn more, check out the resources listed below.

Query CSV/TSV/NDJSON files directly using column numbers

If you want to query files without schema like CSV/TSV/NDJSON, you need to load them into a table before querying them. But sometimes the user doesn't know the details of the file in advance (such as how many columns the CSV file has), or just wants to do ad hoc queries.

To this end, Databend introduces the column position, using $Nthe syntax to represent the first Ncolumn. All columns of CSV/TSV files are regarded as Stringtype, and if the number of columns in a row is less than the used column number, it will be padded with an empty string. NDJSON files have only one column $1, the type is Variant.

Combine this capability with the COPY statement to load parts of the columns on demand and use functions to transform the data while loading.

SELECT $1 FROM @my_stage (FILE_FORMAT=>'ndjson')

COPY INTO my_table FROM (SELECT TRIM($2) SELECT @my_stage t) FILE_FORMAT = (type = CSV)

If you want to learn more, check out the resources listed below.

Design and implement a new hash table to improve the performance of Hash Join

In the past, Databend's hash table was specially designed to meet the needs of aggregation operators. In order to further improve the performance of Hash Join, we set out to design and implement a new hash table optimized for Hash Join. Through parallel The optimized design enables Databend to make full use of computing resources, and at the same time, it becomes more precise in memory control, avoiding unnecessary memory overhead and significantly improving the performance of Hash Join.

Business intelligence analyst Mimoune Djouallah commented: Databend has excellent performance. Under the condition of 8 cores and 32GB memory, it only takes 25 seconds to run TPCH-SF10. He even wrote a blog post titled " Databend and the rise of Data warehouse as a code ".

image.png

If you want to learn more, check out the resources listed below.

AI Functions

Databend introduced a powerful AI function in version v1.2.0, realizing the seamless integration of Data and AI. We can realize it through SQL:

  1. Natural Language Generated SQL
  2. Embedding vectorized and stored
  3. Similarity Calculation
  4. text generation

Natural Language Generated SQL

For example, if you ask a question in an nginx log database: "What are the top 5 IP addresses making the most requests", using the function of Databend, AI_TO_SQLyou will directly get the corresponding SQL statement, which is very convenient to use.

image.png

Embedding vectorization

With AI_EMBEDDING_VECTORthe function of Databend, we can realize the vectorization of data and save it in ARRAYthe type of Databend. In this way, Databend actually becomes a vector database.

image.png

Similarity Calculation

Under the vectorized representation, the similarity between two words, sentences or documents can be calculated. For example, suppose we have two words "dog" and "puppy" (could also be sentences), we first convert them into vectors v1 and v2 respectively, and then use cosine similarity to calculate their similarity.

cos_sim = dot(v1, v2) / (norm(v1) * norm(v2))

The function in Databend COSINE_DISTANCEis the realization of this formula.

image.png

text generation

AI_TEXT_COMPLETIONText generation is very useful in many scenarios, and now you can use functions in SQL to do it.

image.png

At present, we have used the above Data + AI capabilities to perform Embedding processing on all documents of https://databend.rs , store them in Databend, and build an intelligent question-and-answer website: https://ask.databend. rs . On this site, you can ask any question about Databend.

Databend Enterprise Features

The new enterprise-level features are online! Learn how Databend drives more valuable data analysis services.

Calculated column

Computed Columns (Computed Columns) are columns that calculate and generate data from other columns through expressions. Computed Columns can store expression data to speed up queries, and can simplify some complex query expressions. Computed columns include storage (STORED) and virtual (VIRTUAL) two types.

  • Stored computed columns generate data and store it on disk every time they are inserted or updated, and do not need to be recalculated during query, which can read data faster.
  • A virtual computed column does not store data, does not occupy additional space, and is calculated in real time for each query.

Computed columns are especially useful for reading data in JSON internal fields. By defining commonly used internal fields as computed columns, the time-consuming operation of extracting JSON data during each query can be greatly reduced. For example:

CREATE TABLE student (
    profile variant,
    id int64 null as (profile['id']::int64) stored,
    name string null as (profile['name']::string) stored
);

INSERT INTO student VALUES ('{"id":1, "name":"Jim", "age":20}'),('{"id":2, "name":"David", "age": 21}');

SELECT id, name FROM student;
+------+-------+
| id   | name  |
+------+-------+
|    1 | Jim   |
|    2 | David |
+------+-------+

If you want to learn more, check out the resources listed below.

VACUUM TABLE

VACUUM TABLEcommand to help optimize system performance by permanently deleting historical data files from the table to free up storage space. Deleted files include:

  • A snapshot related to a table and its associated segments and blocks.
  • Orphaned files. In Databend, orphan files are snapshots, segments, and blocks that are no longer associated with the table. Orphaned files can be generated by various operations and errors, such as during data backup and restore, and over time can take up valuable disk space and slow down system performance.

If you want to learn more, check out the resources listed below.

Serverless Background Service

Databend's built-in storage engine FuseTableis a log-structured table similar to Apache Iceberg. During the continuous writing of data, table compression, re-clustering and cleaning need to be performed periodically to merge small data blocks. The process of merging small data chunks involves stages such as sorting data by clustering key or cleaning up unneeded branches.

To automate this process requires the use of different drivers, adding to the complexity of the infrastructure. And other services must be deployed and maintained to trigger drive events. In order to simplify this process, Databend has designed and implemented Serverless Background Service , which can automatically discover tables that need to be compressed, reordered, and cleaned up after data is written, without other services or manual operations by users, and automatically triggers the maintenance of the corresponding tables. It reduces the burden of user maintenance, improves the performance of table queries, and reduces the cost of data in object storage.

Databend x ecology

Databend's ecological map has been further improved. It's time to introduce Databend into your data insight workflow!

Python bindings for databend

Databend now offers Python bindings, providing new options for executing SQL queries in Python. The binding has Databend built in and can be used without deploying an instance.

pip install databend

Import from databend SessionContextand create a session context to get started:

from databend import SessionContext
ctx = SessionContext()
df = ctx.sql("select number, number + 1, number::String as number_p_1 from numbers(8)")

The resulting DataFrame can be converted to PyArrow or Pandas format using to_py_arrow()or :to_pandas()

df.to_pandas() # Or, df.to_py_arrow()

Act now and integrate Databend into your data science workflow.

BendSQL - Databend native command line tool

BendSQL is a native command-line tool designed for Databend, which has been rewritten in Rust language and supports both REST API and Flight SQL protocols.

Using BendSQL, you can easily and efficiently manage databases, tables and data in Databend, and easily perform various queries and operations.

bendsql> select avg(number) from numbers(10);

SELECT
  avg(number)
FROM
  numbers(10);

┌───────────────────┐
│    avg(number)    │
│ Nullable(Float64) │
├───────────────────┤
│ 4.5               │
└───────────────────┘

1 row in 0.259 sec. Processed 10 rows, 10B (38.59 rows/s, 308B/s)

We look forward to sharing more updates on BendSQL with you! Feel free to try it out and give us feedback.

Data Integration and BI Services

Apache DolphinScheduler

Apache DolphinScheduler is a distributed and scalable open source workflow coordination platform with a powerful DAG visualization interface. Supports 30+ task types, including Flink SQL, DataX, HiveCli, etc. Capable of high concurrency, high throughput, low latency, and stable execution of millions of tasks, it can execute tasks in batches according to the planned time (special date range or special date list), and without affecting the workflow template, Workflow instances support modification, rollback, and rerun.

image.png

DolphinScheduler now supports Databend data sources, you can use DolphinScheduler to manage DataX tasks to achieve data synchronization from MySQL to Databend heterogeneous databases.

Apache Flink CDC

Apache Flink CDC (Change Data Capture) refers to Apache Flink's ability to capture and process real-time data changes from various sources using SQL-based queries. CDC allows monitoring and capturing of data modifications (inserts, updates, and deletes) occurring in a database or streaming system, and reacting to these changes in real time.

Databend now provides Flink SQL Connector , which can integrate Flink's stream processing capabilities with Databend. Connectors can be configured to capture data changes as streams from various databases and load them into Databend for real-time processing and analysis.

If you want to learn more, check out the resources listed below.

Tableau

Tableau is a popular data visualization and business intelligence tool. It provides an intuitive and interactive way to explore, analyze and present data, helping users to better understand the meaning and insight of data.

Refer to Other Databases (JDBC) , put databend-jdbc under the Tableau driver path, then you can use Tableau to analyze the data in Databend.

image.png

If you want to learn more, check out the resources listed below.

Download and use

If you are interested in the features of our new version, please visit https://github.com/datafuselabs/databend/releases/tag/v1.2.0-nightly page to view all changelogs or download the release experience.

If you are still using an old version of Databend, we recommend upgrading to the latest version. For the upgrade process, please refer to:

https://databend.rs/doc/operations/upgrade

Feedback

If you encounter any problems in use, please feel free to make suggestions through GitHub issue or community user groups

GitHub: https://github.com/datafuselabs/databend/

Graduates of the National People’s University stole the information of all students in the school to build a beauty scoring website, and have been criminally detained. The new Windows version of QQ based on the NT architecture is officially released. The United States will restrict China’s use of Amazon, Microsoft and other cloud services that provide training AI models . Open source projects announced to stop function development LeaferJS , the highest-paid technical position in 2023, released: Visual Studio Code 1.80, an open source and powerful 2D graphics library , supports terminal image functions . The number of Threads registrations has exceeded 30 million. "Change" deepin adopts Asahi Linux to adapt to Apple M1 database ranking in July: Oracle surges, opening up the score again
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5489811/blog/10086095