How to build a high-performance SQL vector database MyScale based on Amazon cloud technology

7340e9f910db95335d816728be3fedf9.gif

MyScale is an efficient vector database that is fully hosted on Amazon cloud technology and supports SQL. The advantage of MyScale is that it supports full SQL syntax while providing performance that matches or exceeds that of dedicated vector databases. In this article, we will explain how MyScale uses the infrastructure of Amazon Cloud Technology to build a stable and efficient cloud database.

What is a vector database

In case you haven't noticed it, vector embeddings are actually everywhere. They form the basis of many machine learning and deep learning algorithms used in everything from search engines to smart assistants. Machine learning and deep learning usually convert unstructured data such as text, images, audio, and video into vector embeddings for storage, and use vector similarity search technology to search for semantic relevance. Vector-based similarity search has been widely used in various artificial intelligence-driven application scenarios, including image retrieval, video analysis, natural language understanding, recommendation system, targeted advertising, personalized search, intelligent customer service, and fraud detection. In this context, the management of vector data is particularly important, and we need to be able to quickly store, index and search these vectorized data.

Existing vector databases can be roughly divided into two categories. One is proprietary vector database products designed for vectors, such as Pinecone, Weaviate, Qdrant, Chroma, Milvus, etc. The other is to expand on general-purpose SQL or NoSQL database products. Postgres, one of the most well-known SQL databases, supports vector indexing and searching through the plug-in pgvector; many databases including ClickHouse, Redis, Elasticsearch, and Cassandra Open source databases have recently added native support for vector indexes.

It is generally believed that proprietary vector databases are designed specifically for vector retrieval and provide better search performance. The general database products that support vector search rely on the original general database, which can provide more complete data management and structured data query capabilities, but the vector retrieval performance is lost. MyScale is developed based on the open source online analytical processing (OLAP) database ClickHouse, and integrates the self-developed multi-scale tree graph (English: multi-scale tree graph, abbreviated as MSTG) vector indexing algorithm. It not only inherits ClickHouse's excellent structured data analysis and query capabilities, but also provides several times the price-performance ratio of proprietary vector databases, thus combining the advantages of both, providing enterprises with unified structured and unstructured data management solutions.

Overall structure

MyScale is a database service completely based on the Amazon cloud technology cloud platform. Its architecture deeply combines the diversified product lines of Amazon cloud technology, including Amazon EC2 cloud virtual server, Amazon EKS cluster management, Amazon S3 object storage, Amazon NLB load balancing etc. Relying on the powerful underlying facilities provided by Amazon cloud technology, we quickly built MyScale's cloud service products.

As shown in the figure below, the architecture design of MyScale cloud service includes three levels: global control plane, regional control plane and regional data plane, and each level corresponds to a Kubernetes cluster. The cloud service business system is deployed in the global control plane, which is responsible for organization, user management, and overall usage statistics. Each region corresponds to an availability zone of a cloud service provider, such as Amazon US-EAST-1. Each region independently deploys a control plane and multiple data planes. The control plane provides the cluster management (create, stop, destroy) API and billing system in the region, and the data plane runs user-initiated MyScale database clusters running in multiple availability zones in the same data plane.

d3cfaf40d82eebcfb0b3489bed476865.jpeg

All of MyScale's services are deployed on Amazon Cloud Technology's managed Kubernetes service EKS. EKS provides a highly available, secure and scalable Kubernetes environment, which enables MyScale to take full advantage of the powerful functions of Kubernetes, such as service discovery, load balancing, automatic scaling, security isolation, etc. With the help of Cluster Autoscaler on Amazon EKS, MyScale can quickly start, stop, and expand instances according to the needs of user workloads, and scale the EKS node pool.

In order to ensure the isolation between user clusters, MyScale takes advantage of the namespace feature of Kubernetes. In the data plane, each MyScale database cluster created by the user corresponds to a namespace in Kubernetes, so that the mutual influence between the clusters can be minimized. The namespace corresponding to each cluster includes database nodes, load balancing services, and metadata storage services.

When using MyScale cloud services, users can create and manage MyScale clusters through the Web UI running on the global control plane. After the user creates a MyScale cluster in the Web UI, the backend of the cloud service will call the interface in the control plane of the corresponding area to convert the specific parameters and configuration of the MyScale database cluster into a CRD resource configuration file in Kubernetes, and save it in the control plane of the area. in plane. The Cluster Manager running in the corresponding regional data plane will monitor the changes of the database cluster CRD resources in the regional control plane, and make corresponding operations to create or modify the actual MyScale database cluster in the data plane. After the MyScale database cluster is started, users can access the MyScale database through the Web UI, Python/Java/NodeJS client, HTTP interface, and LLM application frameworks including Langchain and LlamaIndex.

We chose EC2 instances with local NVMe-based SSD disks to deploy the MyScale database. Unlike most vector databases that choose the pure memory HNSW vector index algorithm, MyScale's self-developed MSTG algorithm allows vector data to be cached in the local NVMe SSD disk, so MyScale provides users with high-performance vector search while greatly saving memory usage. In our public tests, MyScale surpassed proprietary vector databases such as Pinecone, Weaviate, Qdrant, Zilliz, etc., and provided the best price/performance ratio (QPS per dollar).

When deploying MyScale cloud service, we use Crossplane to implement the deployment and management of EC2 and EKS services on Amazon Cloud Technology. First, we configured the corresponding Amazon cloud technology account information through Crossplane's Amazon Provider, enabling Crossplane to access and operate our Amazon cloud technology resources. Then, we defined the YAML configuration files of EC2 and EKS, through these files, we can define the parameters of the server and Kubernetes cluster we need, such as instance type, cluster size, etc. By applying these configuration files, Crossplane's Amazon Provider calls the Amazon API to create and configure these resources.

Not only that, Crossplane can also periodically synchronize the status of these resources, so that we can monitor and manage these resources through the Kubernetes interface. When we need to modify or delete these resources, we only need to modify the corresponding YAML file and reapply, and Crossplane will automatically complete the corresponding operation. By using Crossplane, we can manage our cloud resources in a declarative, unified and automated way, greatly improving our work efficiency and accuracy.

For data security, MyScale uses Teleport, an advanced remote access management system. Teleport can provide developers and operators with the ability to securely access our Kubernetes clusters through encrypted connections. This not only improves the security of the system, but also improves the convenience of operation. What's more, Teleport has a comprehensive audit function, which can record all sessions and events in detail, which is very helpful for security analysis and compliance requirements. This means that we can have complete visualization of any operation, so as to better control and protect the MyScale cloud service system, and provide users with safe and reliable services.

summary

This post introduces MyScale, a vector database hosted on Amazon Web Technologies. MyScale is developed based on the open source online analytical processing (OLAP) database ClickHouse, and integrates the self-developed multi-scale tree graph (MSTG) vector indexing algorithm, which can provide excellent data management and structured data query capabilities, and also provides cost-effective Vector search functions, as well as structured and unstructured joint analysis and processing functions, can be widely used in AI-driven scenarios such as image retrieval, video analysis, and natural language understanding.

The author of this article

f1f6b6f8d42e61927986a22931d6b94f.jpeg

Liu Qin

Dr. Qin Liu is the head of the MyScale infrastructure team. Before that, he worked in Moqi Technology and Huawei Noah's Ark Lab. In 2016, Qin Liu received a PhD in Computer Science from the Chinese University of Hong Kong (CUHK) under the supervision of Prof. John CS Lui. During his doctoral period, he focused on the research of graph computing and stream processing systems. In 2012, he graduated from Shanghai Jiaotong University with a bachelor's degree in computer science and is a member of the ACM class. He has published many academic papers in international conferences and journals KDD, VLDB, ICDE, DSN, CIKM and IEEE TKDE, and won the 2012 KDD Cup champion together with his peers.

a080e62ec765eec9696319f4ccaeb271.jpeg

Kelvin Guo

Senior Solution Architect of Amazon Cloud Technology. The main technical directions are MLOps, DevOps, containers, and data analysis. 20+ years of experience in software development, project management, implementation of agile ideas, engineering efficiency consulting and implementation.

fe605863740b56c6ef1799899b307da9.gif

3ac316da559ecaa30038e23b5cadd101.gif

I heard, click the 4 buttons below

You will not encounter bugs!

35e82719035deb47477c8b4df6cfd32f.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/132440097