[Tencent Cloud Lab] Practical application of vector database in financial information database analysis

I. Introduction

This article will lead readers to explore the diverse solutions of databases and their evolution, paying special attention to the importance of vector databases and their application in actual projects.

Provide readers with a comprehensive and practical guide through in-depth analysisTencent Cloud Vector Database and its practical application in financial credit database analysis , helping them understand, apply and master the key points of this technical field.

2. Database classification types

The development of modern databases shows a diversified trend, from traditional relational and NoSQL databases to cloud databases, cloud native databases and vector databases, each of which provides customized solutions for specific needs. As technology continues to advance, the database field continues to innovate to meet changing needs.

2.1 Evolving databases: diverse solutions to cope with changing needs

When it comes to databases, we can see constant evolution and innovation. Traditional self-built databases are often built based on relational databases (such as MySQL, PostgreSQL) or NoSQL databases (such as MongoDB, Cassandra). These databases were mainly used to store structured data in the early days and were widely used in enterprises and applications.

With the rise of cloud computing, cloud databases have emerged, providing users with more flexible, scalable and easy-to-manage solutions. Cloud databases include various services, such as Amazon RDS, Google Cloud SQL, and Azure Database, which can automatically manage and adjust the capacity and performance of the database, and provide high availability and disaster recovery functions.

Cloud native databases focus more on database solutions built and deployed in cloud native environments. These databases are typically containerized and leverage cloud-native technologies such as Kubernetes to achieve greater elasticity, scalability, and reliability.

Another important trend is the rise of vector databases. These databases focus on processing high-dimensional and complex data, such as images, text, and audio. Vector databases (such as Milvus and Faiss) use vector indexing technology to process and query large-scale vector data more efficiently, which is widely used in the fields of artificial intelligence, machine learning, and big data analysis.

2.2 What is a vector database

Vector database is a type of database specifically designed to handle high-dimensional vector data. They are designed for storing, indexing, and efficiently querying data sets containing vector information. These vectors may represent unstructured or semi-structured data such as text, images, audio, etc., or they may be features extracted from machine learning and deep learning models.

Vector databases often employ specific vector index structures and algorithms to store and retrieve vector data efficiently. Their design goal is to make operations such as similarity search or clustering in high-dimensional space more efficient and to be able to handle large-scale vector data sets.

These databases are widely used in artificial intelligence, recommendation systems, image recognition, natural language processing and other fields. They can accelerate the rapid search of similar vectors, thereby supporting applications such as recommendation algorithms, similar image searches, text similarity matching, etc. Milvus and Faiss are some well-known vector databases.

The importance of vector databases stems from their ability to handle large-scale high-dimensional data sets and perform fast similarity searches. Traditional relational databases are not suitable for this type of data due to their inflexible structure and lack of specialized indexing technology tailored for similarity searches.

In contrast, vector databases employ a series of specially designed index structures and algorithms designed to efficiently handle high-dimensional data and enable fast nearest neighbor searches.

2.3 Why is the vector database so important?

First, developers are able to index the generated embedding vectors into a vector database. This move makes it possible to find related assets by querying similarity vectors.

Additionally, vector databases provide a way to run embedding models efficiently. It utilizes a complex query language and integrates database functions such as resource management, security control, scalability, fault tolerance, and efficient information retrieval to improve the efficiency of application development.

More importantly, vector databases are critical for developers to create unique app experiences. For example, users can search for similar images by taking a photo on their smartphone, thanks to the support of a vector database.

In addition, developers will be able to leverage other types of machine learning models to automatically extract metadata from content such as images and scanned documents. They can index this metadata along with vectors to enable hybrid searches of keywords and vectors. Search results can also be improved by incorporating semantic understanding into relevance rankings.

For example, new models like ChatGPT fall under the innovation category of generative artificial intelligence. These models can not only generate text but also manage complex human conversations. Some models can even run in multiple modes. For example, some models can generate images that match the description based on the scene described by the user.

However, generative models are prone to hallucinations, which can lead to chatbots delivering incorrect information to users. At this time, the vector database can make up for this shortcoming of the generative artificial intelligence model. It provides an external knowledge base for generative AI chatbots, ensuring that the information they provide is trustworthy.

2.4 How do vector databases work?

We all have a rough idea of ​​how traditional databases work—they store strings, numbers, and other types of scalar data in rows and columns. However, vector databases are based on vector operations, so their optimization and query methods are very different.

In a traditional database, we usually query the database for rows whose values ​​exactly match our query conditions. In a vector database, we will apply similarity measures to find vectors that are most similar to our query conditions.

Vector databases employ a series of different algorithms that all participate in approximate nearest neighbor (ANN) search. These algorithms optimize the search process through hashing, quantization, or graph-based search.

These algorithms are assembled into a pipeline that enables fast and accurate retrieval of the neighbors of a query vector. Since vector databases provide approximate results, the main trade-offs we need to make are accuracy and speed. The more accurate the results, the slower the query will be. However, a good system can provide super-fast searches with almost perfect accuracy.

The following is a common process for vector databases:

vector database process

  1. Index: Vector databases index vectors using algorithms such as PQ, LSH, or HNSW. This step maps vectors to data structures to speed up the search process.
  2. Query: The vector database compares the index's query vector to the index vector in the data set to determine the nearest neighbors by the similarity measure used by the specific index.
  3. Post-processing: In some cases, the vector database retrieves the final nearest neighbors from the dataset and post-processes them to return the final result. This step may include reordering nearest neighbors using different similarity measures.

3. Tencent Cloud Vector Database

Tencent Cloud Vector Database Free Trial: https://curl.qcloud.com/deqmCnLM
Tencent Cloud Vector Database Free Trial: https://curl.qcloud.com/deqmCnLM
Tencent Cloud Vector Database Free Trial: https://curl.qcloud.com/deqmCnLM

3.1 What is Tencent Cloud Vector Database

Tencent Cloud VectorDB is a fully managed, self-developed enterprise-level distributed database service dedicated to storing, retrieving, and analyzing multi-dimensional vector data. The database supports multiple index types and similarity calculation methods.A single index supports 1 billion-level vector scale, and can support millions of QPS and millisecond-level query latency.

Tencent Cloud Vector Database can not only provide an external knowledge base for large models and improve the accuracy of large model answers, but can also be widely used in AI fields such as recommendation systems, NLP services, computer vision, and intelligent customer service.

3.2 Advantages of Tencent Cloud Vector Database

Tencent Cloud VectorDB (Tencent Cloud VectorDB) is provided to users as a service dedicated to storing and retrieving vector data. It shows significant advantages in high performance, high availability, large scale, low cost, simplicity and ease of use, stability and reliability. .

In order to be more concise and concise, you can directly look at the brain map I made and feel the advantages of Tencent Cloud Vector Database more intuitively:

  • High performance
    The single index of the vector database supports 1 billion-level vector data scale, and can support millions of QPS and millisecond-level query latency.
  • High Availability
    Vector database provides multi-copy high availability features. Its multi-availability zone and three-node architecture has an availability of up to 99.99%, significantly improving the reliability and fault tolerance of the system. , ensuring that the database can still run normally in the face of challenges such as node failures and load changes.
    Large-scale
    The vector database architecture supports horizontal expansion, and a single instance can support millions of QPS, easily meeting the vector storage and retrieval needs in AI scenarios.
  • Low cost
    Just follow the instructions in the management console and operate a few simple steps to quickly create a vector database instance and host the whole process platform without any installation or deployment. and operation and maintenance operations, effectively reducing machine costs, operation and maintenance costs and labor costs.
  • Easy to use
    Supports rich vector retrieval capabilities. Users can quickly operate the database through the HTTP API or SDK interface, and the development efficiency is high. At the same time, the console provides complete data management and monitoring capabilities, and is simple and convenient to operate.
  • Stable and reliable
    The vector database is derived from Tencent Group’s self-developed vector search engine OLAMA. Nearly 40 business lines are running stably, and the average daily processing of search requests is up to 100 billion times. Service continuity and stability are guaranteed.
  • Embedding function
    The Embedding function of the vector database will automatically convert the original text, generate the corresponding vector data and insert it into the database or perform similarity retrieval, realizing the conversion of text to vector data. Integrated conversion reduces the user's operating steps and greatly lowers the threshold for use.

3.3 Tencent Cloud Vector Database’s current implementation project

At present, Tencent Cloud vector database has become the first choice for widespread adoption inside and outside Tencent enterprises. Various internal application products rely on Tencent Cloud vector database to achieve efficient data management and application and help business development. External industries are no exception. A large number of products have chosen Tencent Cloud Vector Database to take full advantage of its advantages. This trend is showing an increasing trend and has become a highlight in today's technological development, reflecting the outstanding status of Tencent Cloud Vector Database in the industry.

Tencent Cloud Vector Database is currently accessed by 40+ businesses within Tencent Group, with 160 billion requests/day; 1,000+ external users are accessed.

4. Tencent Cloud Vector Database Practical Combat (Financial Credit Database Analysis)

Financial Analysis Cases (Important)

4.1 Preparation

4.1.1 Purchasing Tencent Cloud Vector Database

On the Tencent Cloud product page, search for the vector database or directly click on the vector database in the new product.

After entering the homepage of Tencent Cloud Vector Database, click to experience it now:

On the instance creation page, select the configured region information, specifications and other information.

For detailed information, refer to the figure below. If some configurations have not been created, you can create them in advance according to the prompts in the figure below.

Note: The maximum usage time for the applied free trial instance is 1 month, and it will be recycled after the expiration of 1 month.

4.1.2 Log in to Tencent Cloud Vector Database

According to your own situation, enable access to the external network. If it is a production environment, it is not recommended to enable it. You only need to use the intranet yourself. In this article, for the purpose of testing and demonstration, access to the external network is enabled.

After enabling external network access, click the instance ID to enter the details page, as shown below, and click the login button.

Enter the vector database login interface, as shown in the figure. The account and password are required. The default account isroot, and the password is 密钥.

4.1.3 Tencent Cloud Vector Database SDK Preparation

We take the Python environment as an example and execute the following command to directly install the latest version.

pip install tcvectordb

The execution is shown in the figure below:

4.2 Case database development process

4.2.1 Create database

Use the following code to create a database:

import tcvectordb
from tcvectordb.model.enum import FieldType, IndexType, MetricType, ReadConsistency

# 创建数据库连接对象
client = tcvectordb.VectorDBClient(url='http://lb-*******.ap-guangzhou.tencentclb.com:50000', username='root', key='G283v2GaQRJG3vk******', read_consistency=ReadConsistency.EVENTUAL_CONSISTENCY, timeout=30)

# 创建数据库
db = client.create_database(database_name='t_vectordb_demo_01')

print(db.database_name)

4.2.2 Creating a collection

# 第一步,设计索引
index = Index(
            FilterIndex(name='id', field_type=FieldType.String, index_type=IndexType.PRIMARY_KEY),
            VectorIndex(name='vector', dimension=768, index_type=IndexType.HNSW,
              metric_type=MetricType.COSINE, params=HNSWParams(m=16, efconstruction=200)),

)

ebd = Embedding(vector_field='vector', field='text', model=EmbeddingModel.BGE_BASE_ZH)

# 第二步:创建 Collection
coll = db.create_collection(
            name='loan_data_analysis',
            shard=1,
            replicas=0,
            description='this is a collection of test embedding',
            embedding=ebd,
            index=index
        )
print(vars(coll))

4.2.3 Import data

import tcvectordb
from tcvectordb.model.enum import FieldType, IndexType, MetricType, ReadConsistency
from tcvectordb.model.enum import FieldType, IndexType, MetricType, EmbeddingModel
from tcvectordb.model.index import Index, VectorIndex, FilterIndex, HNSWParams
from tcvectordb.model.collection import Embedding, UpdateQuery
from tcvectordb.model.enum import FieldType, IndexType, MetricType, ReadConsistency
from tcvectordb.model.document import Document, Filter, SearchParams

# # 创建数据库连接对象
client = tcvectordb.VectorDBClient(url='http://lb-******.clb.ap-guangzhou.tencentclb.com:50000', username='root', key='G283v2******', read_consistency=ReadConsistency.EVENTUAL_CONSISTENCY, timeout=30)


# 指定写入原始文本的数据库与集合
db = client.database('t_vectordb_demo_01')
coll = db.collection('loan_data_analysis')


# 写入数据,可能存在一定延迟
# 1. 支持动态 Schema,除了 id、text 字段必须写入,可以写入其他任意字段,text 字段为创建集合时,设置的文本字段名
# 2. upsert 会执行覆盖写,若文档id已存在,则新数据会直接覆盖原有数据(删除原有数据,再插入新数据)
# 3. 参数 build_index 为 True,指写入数据同时重新创建索引。
res = coll.upsert(
    documents=[
        Document(
            id='1077501',
            text="1077501:10+ years",
            author='RENT',
            bookName='5000',
            page=36,
            funded_amnt=5000,
            funded_amnt_inv=4975,
            int_rate=10.65,
            installment=162.87,
            grade='B',
            sub_grade='B2',
            emp_title='',
            emp_length='10+ years',
            home_ownership='RENT'
        ),
        Document(
            id='1077430',
            text="1314167:< 1 year",
            author='RENT',
            bookName='2500',
            page=60,
            funded_amnt=2500,
            funded_amnt_inv=2500,
            int_rate=15.27,
            installment=59.83,
            grade='C',
            sub_grade='C4',
            emp_title='Ryder',
            emp_length='< 1 year',
            home_ownership='RENT'
        )
    ],
    build_index=True
)
注意:
1. 支持动态 Schema,除了 id、text 字段必须写入,可以写入其他任意字段,text 字段为创建集合时,设置的文本字段名
2. upsert 会执行覆盖写,若文档id已存在,则新数据会直接覆盖原有数据(删除原有数据,再插入新数据)
3. 参数 build_index 为 True,指写入数据同时重新创建索引。

After inserting the test data, we return to the Tencent Cloud vector database and view the data as shown below:

We can batch import similar databases below into the database

4.2.3 Reading data

To read data we use the query method function

Query method based on exact matching,query() is used to accurately find vectors that exactly match the query conditions. Specifically, it supports the following functions.
Supports retrieval based on the primary key id (Document ID) and the Filter expression of a custom scalar field.
Supports specifying query starting position offset and return quantity limit to realize data SCAN capability.

# Set filter
filter_param=Filter(Filter.In("text",["year", "years"]))


# query
doc_list = coll.query(document_ids=['1077501','1077430'], retrieve_vector=True, filter=filter_param, limit=2, offset=0, output_fields=['text','author'])


for doc in doc_list:
          print(doc)

The vector data taken out is as follows:

4.2.4 Data analysis

Convert a variable to its appropriate data type

Some variables are not of their appropriate data type and require preprocessing to be converted to the correct format. We have defined some functions to help automate this process. The function used to convert a variable to its appropriate data type is shown below.

# 将术语列转换为数字数据类型

def term_numeric(df, column):
    df[column] = pd.to_numeric(df[column].str.replace(' months', ''))
    
term_numeric(data, 'term')

#converting emp-length to numeric datatype
def emp_length_convert(df, column):
    df[column] = df[column].str.replace('\+ years', '')
    df[column] = df[column].str.replace('< 1 year', str(0))
    df[column] = df[column].str.replace(' years', '')
    df[column] = df[column].str.replace(' year', '')
    df[column] = pd.to_numeric(df[column])
    df[column].fillna(value = 0, inplace = True)
   

# 预处理日期列

def date_columns(df, column):
    # store current month
    today_date = pd.to_datetime('2020-08-01')
    # convert to datetime format
    df[column] = pd.to_datetime(df[column], format = "%b-%y")
    # calculate the difference in months and add to a new column
    df['mths_since_' + column] = round(pd.to_numeric((today_date - df[column]) / np.timedelta64(1, 'M')))
    # make any resulting -ve values to be equal to the max date
    df['mths_since_' + column] = df['mths_since_' + column].apply(lambda x: df['mths_since_' + column].max() if x < 0 else x)
    # drop the original date column
    df.drop(columns = [column], inplace = True)

Preprocessing of target columns

The target column in our dataset is loan status, which contains different unique values. These values ​​will need to be converted to binary. That is, it is 0 for bad borrowers and 1 for good borrowers. In our case, the definition of non-performing borrowers is those who fall under the following conditions in our target column: Charged-off, Default, Overdue (31-120 days), Non-compliant with credit policy status: Charged-off. The remainder are classified as good borrowers.

# 基于loan_status列创建一个新列,这将是我们的目标变量
data['good_bad'] = np.where(data.loc[:, 'loan_status'].isin(['Charged Off', 'Default', 'Late (31-120 days)',
                                                                       'Does not meet the credit policy. Status:Charged Off']), 0, 1)
# Drop the original 'loan_status' column
data.drop(columns = ['loan_status'], inplace = True)

Analysis to obtain weight of evidence (WOE) and information value

Credit risk models generally need to be interpretable and easy to understand. In order to achieve this, all independent variables must be categorical. Since some variables are continuous, we will use the concept of Weight of Evidence.

Evidence weight will help us convert continuous variables into categorical features. Continuous variables are divided into intervals and new variables are created based on their weight of evidence. Furthermore, information value helps us determine which features are useful in prediction. The information value of the independent variables is listed below. Variables with information value less than 0.02 will not be included in the model as they have no predictive power

Information value of term is 0.035478
Information value of int_rate is 0.347724
Information value of grade is 0.281145
Information value of emp_length is 0.007174
Information value of home_ownership is 0.017952
Information value of annual_inc is 0.037998
Information value of verification_status is 0.033377
Information value of pymnt_plan is 0.000309
Information value of purpose is 0.028333
Information value of addr_state is 0.010291
Information value of dti is 0.041026
Information value of delinq_2yrs is 0.001039
Information value of inq_last_6mths is 0.040454
Information value of mths_since_last_delinq is 0.002487
Information value of open_acc is 0.004499
Information value of pub_rec is 0.000504
Information value of revol_util is 0.008858
Information value of initial_list_status is 0.011513
Information value of out_prncp is 0.703375
Information value of total_pymnt is 0.515794
Information value of total_rec_int is 0.011108
Information value of last_pymnt_amnt is 1.491828

There is an imbalance in the class labels of the target columns in our training set, as shown in the bar chart below. Using such imbalanced data to train our model causes it to be biased towards predicting classes with the majority of labels. To prevent this, I used random oversampling to increase the number of observations from the minority class in the target column. It is important to note that this process is only performed on the training data.

From the above figure, we can clearly see that the final result is 0 for bad borrowers and 1 for good borrowers.

5. Summary at the end of the article

This article provides an in-depth understanding of database classification and Tencent Cloud Vector Database. It explores the importance of vector databases and their application under changing needs.
By introducing the advantages and actual project implementation of Tencent Cloud Vector Database, the article demonstrates its practical application in financial credit database analysis.

This article comprehensively introduces the importance, working principle and application of Tencent Cloud Vector Database in actual projects, providing readers with in-depth understanding and practical guidance.
From the perspective of use, Tencent Cloud Vector Database has many advantages, including high performance, high availability, large scale, low cost, simplicity and ease of use, stability and reliability, and intelligent operation and maintenance. It can be used in various application scenarios, including recommendation systems, natural language processing, computer vision, etc. In terms of getting started, Tencent Cloud also generously provides experience qualifications, so you can have a pleasant experience during the first time using it.

I believe that with the continuous development of artificial intelligence technology, the application of databases in the field of artificial intelligence will become more widespread. Vector database, as a database specially used to store and retrieve vector data, will play an increasingly important role in the field of artificial intelligence.

6. Recommended References

Data management in the AIGC era - vector database, scan the QR code to read!

Insert image description here

Guess you like

Origin blog.csdn.net/fly1574/article/details/134637101