A Multi-label Classification Method for Trademark Retrieval

 
  

Click on " Xiaobai Learning Vision " above , and choose to add " Star " or " Top "

重磅干货,第一时间送达

Author丨Mirror Editor丨Jishi Platform

polar market guide

 

This paper proposes a multi-label classification and similarity retrieval system dedicated to trademarks, which comprehensively considers the shape, color, business field, semantics, and general characteristics of trademarks, and allows users to combine and assign weights to the above characteristics.

ad3cb47622f1a498693594b6d9ef8fa6.png

Paper address: https://arxiv.org/pdf/2205.05419v1.pdf

Trademark retrieval was the first commercial project I came into contact with. Trademark retrieval is different from general image retrieval and has many unique properties. At that time, I spent a long time researching and exploring. Later, the trademark retrieval system I developed was adopted by thousands of companies all over the world. Used by intellectual property agencies. Now that several years have passed, the development of the entire deep learning field has improved a lot. Recently, I happened to find this latest trademark search paper on arxiv, so I studied it carefully.

0. Introduction

A trademark is a picture, so trademark retrieval technology is naturally closely related to the field of image retrieval. However, the special feature of a trademark is that its content may be pure text, pure graphics, or a combination of graphics and text. Graphics may also be Real pictures (photographs) or abstract shapes (circles, triangles, rectangles, lines).

Text is a special symbol that can only be recognized by humans. In trademarks, it is often confused with graphics and interferes with search results; abstract graphics are often ambiguous, and different people will recognize them as different things due to different concerns. . Therefore, the traditional retrieval methods based on image information may not be able to achieve the best results in trademark retrieval.

d4d22a424ba122f077ed7eacfb1409e2.png

This paper proposes a multi-label classification and similarity retrieval system dedicated to trademarks, which comprehensively considers the shape, color, business field, semantics, and general characteristics of trademarks, and allows users to combine and assign weights to the above characteristics.

Since the text content in a trademark is usually the company name and other content that has nothing to do with the classification, the existence of the text will be regarded as interference caused by graphics. Therefore, this paper proposes to use a text detection network to locate the text content, and then use an erasure network to remove the text content. Eliminated from the mark.

At present, the most commonly used and recognized Vienna classification method in the trademark classification method also has certain problems in the multi-label classification task. This paper combines several other trademark classification methods to reorganize and integrate the Vienna classification method.

The contributions of this paper are as follows:

  1. Added text detection in the preprocessing stage, and a co-used text elimination method to improve the retrieval effect.

  2. Multiple models extract different features and then fuse, and propose a weight distribution scheme in the similarity calculation part.

  3. A set of multi-label taxonomic annotations is proposed by reorganizing the Vienna Classification Encoding.

  4. Combining existing taxonomic topologies, Vienna features and retrieval are understood and contrasted from a design perspective.

  5. The method of this paper and the past SOTA scheme were evaluated for quality

1. Background

1.1 Trademark search

Traditional trademark retrieval methods generally extract a series of manual features, and then use the KNN algorithm to compare similarities. Commonly used features include color histograms, shapes, SIFT local feature descriptors, etc. In most cases, it is necessary to use A bag-of-words system to reduce the feature dimension, the similarity is generally based on the distance of feature vectors.

With the advent of deep learning technology, feature extraction is naturally completed by convolutional neural networks. Better network structures and different supervised learning methods can optimize feature quality and improve retrieval results.

1.2 Multi-label classification

Since trademarks are marked with multiple labels, usually a trademark will have multiple different labeling information, so trademark retrieval requires a multi-label classification model. Multi-label classification based on deep learning usually relies on using the Sigmoid activation function at the network output layer, with BCELoss to supervise learning, but since ViT became popular, I have also paid attention to the Transformer-based multi-label classification model. The improvement of label classification accuracy is also crucial to the quality of trademark retrieval.

However, the multi-label classification method used in this article is still relatively primitive. It follows the set of processes I described above, and does not use the improvement of the loss function and model structure. If I were to develop this system, I should add this part of the improvement. .

1.3 Dataset and classification topology

Most of the trademark data is actually difficult for ordinary people to obtain. Although the website of the Trademark Office allows free inquiries, it is very difficult for you to obtain a large amount of trademark data at one time. However, there is a market for demand, and there are some in the industry. The trademark data vendors of 2009 collect and sell this data, and the price is quite high.

Afterwards, several free trademark datasets were released for research purposes, including Large Logo Dataset (LLD), METU, Logos in the Wild, etc., but most of these datasets are small in size (less than one million). And the labeling is relatively simple, and some even only label the brand. In the database of the State Trademark Office, each trademark will be classified and labeled in great detail, which is why the price of the trademark database is high. So in fact, it can also be found that for such a well-labeled and huge amount of data applications, using deep learning technology for landing applications is simply perfect.

At present, the more common and recognized trademark classification method in the world is called Vienna Classification (Vienna Classification), which was proposed by an organization called World Intellectual Property Organization (WIPO). In the industry, we generally refer to the label of Vienna Classification as Vienna code, or graphic element code.

To put it simply, the Vienna coding is a three-level tree classification structure. The deeper the classification, the finer the granularity. It describes the graphic features, semantic concepts, colors, shapes, etc. of a trademark. Here I give a part of the Vienna classification. The content of the first and second levels is easy for everyone to understand:

2dac64575f46bcee05e880ae0b74ab3d.png 09dbc0200f36a44dec082c55507c3efe.png

In addition to the Vienna code, there are also different trademark classification methods, such as Wheeler and Chaves.

Wheeler proposes to divide trademarks into three categories:

  • Wordmarks (stand-alone initials, company or product names)

  • emblems (trademarks in which the company name is inseparable from the image element)

  • only symbols (divided into fonts, graphics, abstract symbols)

However, the boundary of this classification method is not clear enough, and a large number of trademarks cannot be classified into a single class. Afterwards, Chaves proposed an alternative:

  • logotypes (same as Wordmarks, but with three subclasses: plain logotype, logotype with background, logotype with decoration)

  • logo-symbol (same as emblems)

  • logotypes with symbols (combination of logotype and symbols)

  • only sumbols

This article further summarizes and summarizes the Vienna classification method, which is divided into four categories:

  • Figurative (image-related)

  • Colors (color related)

  • Shapes (shape related)

  • Text (text related)

Next, let me introduce in detail:

Figurative

Contains the content of Vienna Classes 1 to 25, which are used to describe the image or semantic features of trademarks. Since the third level of the Vienna Classification is too detailed, and the number of trademarks under many classes is too small to be conducive to model classification, only the first and first levels are used. Second level classification labels.

Colors

The 29th category of the Vienna classification is used to describe colors, but some of the labels are not suitable for multi-label classification (such as 29.01.12 means that there are two dominant colors), so this article cleans this category, and finally retains 13 Label:

b972a55069f5043cf74c7ed6df2594e8.png
Shapes

Class 26 of the Vienna Classification is used to describe shapes (such as triangles, quadrilaterals, etc.), and the third-level labels are also too detailed and ambiguous, so only the second-level classification labels are used, and 26.07 and 26.13 are merged into 26.5 (other polygons) .

Text

Class 27 of the Vienna Classification defines words, but it is not applicable to multi-label classification, so this article uses Class 27 only as a mark of whether the trademark contains words.

The paper also summarizes and compares the similarities and differences of centralized classification methods:

64ebe01db47497f9a4f5f8c97396df52.png

In addition, this article also uses the Nice classification. The Nice classification is generally called the category of goods and services in the industry. In fact, it has nothing to do with the content of the trademark, and many large companies usually register their trademarks in all categories at the same time. , to avoid being misused by others, so the use of this information is not particularly reasonable in my opinion.

2. Method

f0f223c801fae7ecbfad61d31f794ac4.png

As can be seen from the picture, the retrieval system in this paper uses 7 models, namely shape, color, main category, subcategory, business field, text, and reconstruction features. The first 6 models are multi-label classification models, and the final reconstruction The feature model is to train an autoencoder network and use the encoder-decoder structure to learn general graphic features. It can also be noticed that the input to the shape model is the logo with textual content removed, and the features extracted from this are text-eliminated.

The features extracted by the last seven models are combined into a feature vector and stored in the database for nearest neighbor matching and recall, and then sorted by using a weighted distance calculation formula proposed in this paper.

2.1 Data preprocessing

The first step in the data preprocessing stage is to crop and zoom the image so that the trademark occupies the entire screen, avoiding the influence of the difference in the position of the trademark on the feature extraction. The second step is to use the CRAFT text detector to detect the text, and then use a network to eliminate Words are removed from the logo:

4ca9509408907a84b558f6baa760609f.png

2.2 Multi-label classification

2cff22db132f90f8b76aec6a08a43256.png

The multi-label classification model used in this article is shown in the figure. It can be seen that it is actually very simple, perhaps due to the consideration of time-consuming reasoning, a particularly complex network is not used. If I were to design it, I would use RepVGG (enhanced inference speed) or a Transformer-based network (enhanced feature quality).

2.3 Similarity retrieval

efd08a6b13813c2848f2d5004cf14fe8.png

The reconstruction feature network is an autoencoder structure, which is also a common form of Hourglass network.

The final feature vector is obtained by merging seven network features. The first six network feature dimensions are 128, and the seventh network feature dimension is 256. L2 normalization is performed on this feature vector to improve retrieval performance.

The method of similarity comparison uses KNN and LabelPowerset in this paper, but to be honest, LabelPowerset is not very practical for large-scale data retrieval. Now everyone uses a vector retrieval scheme based on approximate nearest neighbors.

In the similarity calculation part, this paper proposes a weighting scheme in which users can customize the weights:

c is the weight of different sub-features. It can be seen that when calculating the similarity, each sub-feature is taken out separately for weighted comparison calculation.

3. Implementation details

3.1 Data

Although the labeling information of trademark data is very rich, there are still many problems in it.

The most obvious is incomplete labeling : when labeling, labelers often only label the most conspicuous and representative parts, which is highly subjective. For example, the following trademark will only be marked in red, while black and blue will not be marked:

292f34b3aaff72a56f27ab152d5e7b33.png

The problem of incomplete labeling also occurs in text labeling. Among the EU trademark data used in this paper, only about 30% of the data contain text presence labels. According to the author’s observation, most of these words are the main body of the entire trademark, while The remaining vast number of trademarks contain words but are not marked. Therefore, this paper uses the CRAFT text detector to perform a machine annotation on the data.

All data is scaled to 256*256 when used.

3.2 Evaluation indicators

This paper uses two evaluation indicators:

Label Ranking Average Precision(LRAP):

0e51f94d2b8c6aea0c27655da1c93656.png

Label Ranking Loss(LRL):

4a2ac0595430ba681f0cb908f8b29265.png

4. Experimental results

4.1 Multi-label classification

e03568dd22c3b720f98be758c27b8c91.png

Judging from the results, the improvement of the correct rate of multi-label classification by text elimination is still obvious, and the LRAP index has increased from 0.6899 to 0.7699

4.2 Similarity retrieval

f611e94a90fcc456ca99869a7b56644a.png

Since the evaluation of retrieval quality is highly subjective, I think this index based on tag recall and ranking can only be used as a reference, and the size of the data also limits the selection of similarity matching methods.

1749de1a79b9e30e822c271bef497b65.png

This paper compares the retrieval quality of autoencoder network features and specific self-network features. It can be seen from the results that the features of autoencoder network are more focused on shape and less sensitive to color information. However, it is a dedicated network under this task, but the advantage is that a network feature can achieve a good performance on almost all tasks (except color), so this can be used as the user has no special requirements, or has higher requirements for retrieval speed When the feature is used.

4.3 Result display

d1206056556ce4f7ed82fee10f9a4d01.png

From the results of the recall of different sub-networks, it can be found that they can basically accomplish their preset goals, but from my personal point of view, because the text network only focuses on "whether the trademark contains words", the recall results are actually in the text There is no relevance in content, so the actual value generated may be very limited.

This problem also exists in the recall of network features based on the business field. I have mentioned that the categories of goods and services for trademark registration are actually irrelevant to the content of the trademark. Many companies tend to register their own in as many categories as possible. Trademarks to avoid their own trademarks being registered and misused by others. From the perspective of recall, "I found you because you are in the same product category as me", this logic may be valid, but I feel that the improvement of the quality of search results may be relatively limited.

The autoencoder network is more suitable for some precise matching, and can find some highly similar trademarks very well, but the quality of the subsequent results is actually relatively poor.

70724778ca2f3e6b6fa4bfb5e21d598f.png

By adjusting the weights of different sub-networks, the sorting results can be adjusted. From the experimental results, it is feasible, but to achieve the desired effect, I think it will take a long time to actually test and explore an optimal weight ratio. In actual product design, it is unrealistic to assign weights completely to users. Most users only want to click on the search to get the results they want, and the less adjustments they need to make, the better.

A reference that this article can provide us is that 30% color weight and 70% shape weight can achieve an optimal result on the test data set.

6d3a31f0f8413ef45c93ce14a29c6ac0.png
下载1:OpenCV-Contrib扩展模块中文版教程

在「小白学视觉」公众号后台回复:扩展模块中文教程,即可下载全网第一份OpenCV扩展模块教程中文版,涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。


下载2:Python视觉实战项目52讲
在「小白学视觉」公众号后台回复:Python视觉实战项目,即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目,助力快速学校计算机视觉。


下载3:OpenCV实战项目20讲
在「小白学视觉」公众号后台回复:OpenCV实战项目20讲,即可下载含有20个基于OpenCV实现20个实战项目,实现OpenCV学习进阶。


交流群

欢迎加入公众号读者群一起和同行交流,目前有SLAM、三维视觉、传感器、自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN、算法竞赛等微信群(以后会逐渐细分),请扫描下面微信号加群,备注:”昵称+学校/公司+研究方向“,例如:”张三 + 上海交大 + 视觉SLAM“。请按照格式备注,否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告,否则会请出群,谢谢理解~

Guess you like

Origin blog.csdn.net/qq_42722197/article/details/131255771