Hands-on, teach you to build a distributed search engine with MaxCompute+OpenSearch

Abstract: Recently, customers often inquire about how to build a high-performance massive data search engine at low cost, such as implementing public account retrieval, video retrieval, and so on. Since the customer's data is on Alibaba Cloud, I hope to find a solution on the cloud. The author began to investigate some cloud products, and many people recommended OpenSearch to me, so I spent some time researching it. After using it, I found that the effect was good. It comes with word segmentation and cloud database synchronization functions. During the research process, I also found some problem, share with everyone.

background

Recently, customers often inquire about how to build a high-performance mass data search engine at low cost, such as realizing public account retrieval, video retrieval and so on. Since the customer's data is on Alibaba Cloud, I hope to find a solution on the cloud. The author began to investigate some cloud products, and many people recommended OpenSearch to me, so I spent some time researching it. After using it, I found that the effect was good. It comes with word segmentation and cloud database synchronization functions. During the research process, I also found some problem, share with everyone.

Next, we started to use Alibaba Cloud MaxCompute (formerly ODPS) and OpenSearch to build a video retrieval search engine Demo. I have about 10GB of data, the service setup took only 15 minutes, and the data synchronization and indexing took about 1 hour. Because of the choice of flexible billing, the experimental cost probably cost tens of yuan.

Let's show the search effect first, support some common word segmentation syntax, and OpenSearch comes with a rich SDK and API, which can be easily integrated into online business.

Experimental Architecture Diagram

 

The search engine architecture is based on OpenSearch, which is a typical distributed online real-time interactive query architecture, with no single point of failure, high scalability, high availability, free operation and maintenance, and low cost. Indexing and searching of large amounts of information can be done in near real-time, enabling fast real-time searches of billions of files and petabytes of data.

The distributed database architecture is based on MaxCompute, which is a fast and fully managed TB/PB level data warehouse solution. MaxCompute provides users with a complete data import solution and a variety of classic distributed computing models, which can solve users' massive data computing problems more quickly, effectively reduce enterprise costs, and ensure data security.

 Experiment preparation

1. Register as an Alibaba Cloud user, authenticate with real name and bind Alipay;

2. Open data plus service;

3. Enable MaxCompute and OpenSearch postpaid services.

 

experimental task

1. Use MaxCompute to import public datasets;

2. Use OpenSearch to create applications, configure data/index structure, and word segmentation;

3. Import data in full and build indexes;

4. Search effect test.

Step 1: Purchase and activate OpenSearch, MaxCompute, and big data development kit services

1.1 Open Opensearch service

Visit https://www.aliyun.com/product/opensearch , click Activate Now, and choose Post-Pay (Pay-As-You-Go).

 

 

1.2 Activating MaxCompute & Big Data Development Kit Services

1.2.1 Enabling MaxCompute

Visit https://www.aliyun.com/product/odps with your Aliyun real-name authentication account  , activate MaxCompute, and choose Pay-As-You-Go to purchase.

 

 

 

1.2.2 Creating a MaxCompute project

 Go to the Data Plus management console, open the MaxCompute success page in the front, and click the management console, or navigate to Product->Big Data (Data Plus)-> MaxCompute and  click the management console.

Create project

After entering the console page, navigate to "Big Data Development Kit -> Project List", click "Create Project", as shown in the figure:

In the pop-up box, select the payment method of I/O post-payment, and enter the project name:

Create a MaxCompute table

Go to the data development page of the Big Data Development Kit, go to Alibaba Cloud Data Plus Platform > Big Data Development Kit > Management Console as a developer, and click the corresponding project operation column under the project list to enter the workspace.

Note: If you use the Digital Plus platform for the first time, you need to register and activate AK first.

Step 2: Import the dataset to MaxCompute through the big data development kit

After entering the big data development kit workspace, we first import a test data.

Data description: The author quoted a MaxCompute public data set (in public beta), the address: https://yq.aliyun.com/articles/89763. Currently, MaxCompute open data categories include: stock price data, real estate information, film and television and its box office data. All data are stored in the public_data project in the MaxCompute product.

Next, we cite a film and television box office data.

It is very simple to use, and the prerequisite is to activate MaxCompute & Big Data Development Kit;

In the big data development kit, create a new script, name it opensearch_demo, and execute the following statement in the window.

add user ALIYUN$everyone;

After the execution is complete, all members in the user's project space can read each public data set.

Verify it:

select * from public_data.dwd_product_movie_basic_info where movie_name like '%生化危机%' limit 10;

Copy a copy of the data to your own Project. Note: OpenSearch has the concept of primary key, so we need to create a primary key in MaxCompute, which is implemented by the UUID function.

Execute the following statement in the window:

create table alian.demo_opensearch_case2 as select uuid() as id,* from public_data.dwd_product_movie_basic_info ;

 

After successful execution, verify the data;

select count(1) from alian.demo_opensearch_case2;

You can see that the dataset has been created;

Step 3: Create an Open Search Application

3.1 Enter the OpenSearch console and click "Create Application"

3.2 Select the product version, the author opened the standard version. If you need multi-table related search, please open the advanced version, if it is a single-table query, the standard version is fine.

3.3 Enter the application name MaxCompute_OpenSearch_Demo, and select East China 1 (Hangzhou) as the region, because MaxCompute currently only has East China, otherwise the data link will not work, click Next.

3.4 Select "Create application structure through data source". The initial application structure can be quickly created from the source table structure, saving the workload of manual construction and reducing the probability of errors.

3.5 Select ODPS, the table just created.

Select the ODPS project just created and the table demo_opensearch_case2

[Note] For the STRING type in the ODPS table, it needs to be converted to LITERAL to build the primary key.

 

3.6 Configure indexing, word segmentation and search display content

Select movie_name, director, scriptwriter, area, actors, type, movie_date, and movie_language as indexes, and set the default Chinese word segmentation method.

Add display fields to set search result content.

3.7 Created

Step 4: Synchronize data and create indexes

4.1 Activate the app

Select the quota and QPS. The dataset we use is about 8G, so the 10G quota is opened, and the QPS is the default item.

Note: The data of MaxCompute (original ODPS) is compressed. The data we use is 2GB after SIZE compression, but it is actually 8GB. The author purchased a 3GB OpenSearch quota before, but the import failed.

 4.2 Start building the index

The main thing here is to wait. The author waited for an hour.

 You can view the index build progress

Step 5: Search Test

Open the application management -> search test, enter any video, such as the recently released wrestling dad, and then automatically match the corresponding video information to complete the experiment.

The data sets provided by MaxCompute are very good, with a large amount of data and high freshness.

Summary: Here, we have completed the entire experiment. The author thinks that OpenSearch+MaxCompute is still very convenient, and it is very suitable for enterprises whose data size is more than 100GB and do not want high operation and maintenance costs and IT costs;

Original link

To read more good articles, please scan the following QR code: 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324529514&siteId=291194637