Graph practice | Shopee multilingual commodity knowledge graph technology construction method and application

Reprint public number | DataFunTalk


Introduction  Shopee is an e-commerce platform serving multiple markets around the world, committed to providing consumers with a more convenient, safe, fast and good consumption experience. Shopee is deeply involved in many different languages ​​and markets. On this international service platform, it is necessary to deal with complex corpus in multiple languages ​​and mixed languages. My personal work mainly focuses on the graphs related to e-commerce platform products and the construction of graph algorithms. I also hope that this sharing can bring you some gains. It includes: experience in constructing commodity knowledge graphs in multiple markets, the latest developments and new applications of commodity knowledge graphs, and how to build technical models and technical frameworks to meet the demands of complex e-commerce applications.

Full-text catalog:

1. Knowledge modeling

2. Knowledge acquisition

3. Knowledge Fusion

4. Knowledge application

5. Prospects for Knowledge Graph

Sharing guest|Zhang Yichi Shopee Listing Team Leader

Editing|Zhang Zhenfei Shenzhou Xinqiao

Production Community|DataFun


01

knowledge modeling

First, share the content related to knowledge modeling.

1. Knowledge Ontology

77419d7c2e3555e915ab14ff16f3597d.png

As can be seen from the figure above, consumers can use the Shopee e-commerce app to find products under specific categories through the category options, browse and purchase. The classification system is a very important ontology layer used to manage product information in the product map. The ontology layer of the commodity graph mainly includes the classification of commodities and the specific attributes of each category. Through the combination of such classifications and attributes, the specific information of each commodity entity in the entire commodity graph is represented.

The e-commerce classification is a tree structure, from the coarsest granularity to the finest granularity, and different classifications have different depths. Taking the mobile electronics category as an example, wearable electronic products can be subdivided under it, and the wearable category includes mobile watches and so on. For subdivided categories, we will sort out the attribute items and attribute values ​​​​that everyone cares about. Taking T-shirt as an example, consumers and the platform may pay more attention to information such as the brand and material of the T-shirt, where the brand and material are attribute items (Attribute Type). We will sort out the specific attribute values ​​(Attribute Value) corresponding to these attribute items such as brand and material. For example, the material contains cotton Cotten, silk and so on.

The ontology layer of the product knowledge graph can be constructed through a combination of category, attribute item (Attribute Type), and attribute value (Attribute Value). Use such an ontology to express the information of all specific commodity entities.

2. Knowledge Ontology - Ontology and Entity

ca9b6a6ccd3354dd3d93057492ee723b.png

In this figure, the upper part is the ontology, and the lower part is the entity of each commodity. Of course, in the commodity entity, there will also be different granularities. For example, a page we see when we buy things every day is actually an item, which is the product dimension. When we choose a specific model to buy, we choose a SKU Model, which is the most fine-grained product information. The combination of such an ontology system and commodity entities can realize the structured management and representation of large-scale commodity information.

3. Knowledge Ontology - Uplift All in One

f8958723533b790eaea80cd0426a32e7.png

With the development of the economy, e-commerce is constantly evolving in order to meet the rapidly changing market demand, and the ontology layer of the e-commerce platform is not static.

In the early stage of Shopee's construction, it had its own ontology classification and design in each language market. Later, we found that a unified set is more conducive to the intercommunication of commodities between multilingual corpus and multilingual markets, and the efficient conversion of commodity information between different languages, so we aggregated the ontologies between different languages ​​into Global- A globally unified system such as Category-Tree. Under the same classification system and the same attribute system, different versions of languages ​​can be used to manage commodity entity information in all markets.

4. Knowledge Ontology - Uplift Continuously

6459670e95cd987f705c83854ae44b00.png

In terms of graph ontology, the core pain point we encountered was how to iteratively change the ontology to keep pace with the times. As the market develops, new categories, new items and values ​​will continue to emerge. But new products, new items, and new values ​​are relatively small for the existing corpus, so how can we capture them in time? The idea of ​​this technology will start from New Phrase Mining. Ordinary NER model does not satisfy our application requirements well in the performance of OOV problem. Our core idea is to introduce MINER model to alleviate and improve OOV problem. The main idea is: take SpanNER as the basic model, introduce the information bottleneck layer, transform the objective function in the form of mutual information, and help the model optimize the ability to capture the context. Thereby improving the generalization ability of the model. Through this technology of continuously mining new category words, attribute items, and attribute values, the Span level accuracy has been increased by 4.5%+, and the Value level recall has been increased by 7.4%+. The effect is quite impressive. Based on such a set of continuous mining ideas, it can help to intelligently recommend adjustment suggestions for the ontology layer, combined with online effect evaluation, and continuously carry out iterations and cycles of mining based on new corpus.

02

knowledge acquisition

1. Challenges

eb5a7c684913e414c59a116c2fe1b32e.png

In the daily knowledge acquisition work, we also encountered many challenges. For example, when dealing with commodity corpus, we will encounter various languages, even a mixture of various complex languages. At the same time, it also needs to deal with fine-grained classification, and the classification system can reach thousands of categories. Under such fine-grained classification, different classifications have different corpus characteristics, and the combination of classification and attribute dimension can reach 10K+ different combinations. Combined with the different attribute values ​​under each item, the overall scale can reach 260K+. At such a scale, the accuracy of the overall service must be maintained above 90%.

Facing such challenges, we need better technical ideas. Based on limited developers and R&D time, we can quickly respond to the demands of online service iterations and ensure the effect of online services. Therefore, we need a set of Scalable Technique Structure to respond Our application requirements.

2. Item Category Classification

3096e308b7b115c5cdc6b0145a3b0aab.png

First, we will introduce the tasks and solutions related to commodity classification. The core goal of commodity classification problem is to understand the classification information of commodities, and improve and guarantee its accuracy. At the same time, it is also necessary to provide classified services to the merchant product release system to ensure the efficiency and stability of the system. The specific problem can be split into several tasks:

① How to make accurate recommendations for new products;

② Existing commodities are brought under the new classification system;

③ Capture and correct errors in stock commodity information in a timely manner.

365f58727073027b800d4bf5fa9ce0a6.png

With the development of e-commerce platforms, the expression of product information is also constantly changing to attract users' attention. This is a challenge for the model. It is not only necessary to build an accurate model, but also to continuously iteratively update it to maintain its effect.

d772688972b17bd5081d3895ca3a5911.png

In order to cope with information classification, it is necessary to design a set of model architecture. We have many such model architectures. For example, the first one is to make a coarse-grained classification of each product, which may be divided into dozens of the coarsest categories, and there are finer-grained classifications under each category. In this way, the amount of categories that each sub-model needs to classify is relatively small, and the classification effect will be more refined. The second is a more end-to-end framework. We directly input product information to find the most fine-grained classification it uses.

Both architectures have their pros and cons. The disadvantage of the first type is that there are many models that need to be managed. Taking a language market as an example, there are dozens of models that need to be managed. Combined with more than ten language markets, the number of managed models reaches hundreds of orders. The second model is more end-to-end, but the effects on some subdivided categories may vary, and the optimization of fine-grained categories will also affect the effects of other categories at the same time. We will make a more scientific choice based on the actual effect of these two systems.

Regardless of the system, the bottom layer relies on the text classification method and the multimodal method of combining graphics and text. Common text class models include Fasttext and BERT and so on. In the multi-modal part, after comparing various models, we choose Align-before-fuse for comprehensive identification of commodity graphic information, and finally find a suitable classification. The core idea of ​​the Align-before-fuse model is to pre-train through Image-Text Contrastive Learning, Image-Text Matching and Masked Language Modeling, and then reduce the impact of dirty data through Momentum Distillation, so as to achieve better classification results.

3fc3304bb3cddb9778079005dda7d94a.png

With the development and application of the model, the accuracy of our main categories in each market can be maintained at 85%~90%+. At the same time, it can also support high-frequency calls of different publishing systems.

0fd854e6803b4f4aaf91398f006f79f0.png

The second task is how to quickly respond to changes to the category system and convert products to new categories. The business background here is that with the development of the market, the emergence of many new products and the growth of categories. If you keep using a relatively coarse classification method, it is not conducive to the distribution of downstream e-commerce systems and customer consumption experience, and a detailed split is required. The technology is more challenging, because the new classification cannot directly obtain the natural training corpus, so the focus of the work is how to intelligently construct the training corpus, upgrade and respond to the requirements of the new classification system.

7894e3d900459c1732a1ca1fe6b0cdc2.png

The figure above shows the process and ideas of data mining. The core idea is to mine keywords of changing or emerging categories based on the methods of Keywords-Mining and OOD-Detection, and build automated samples based on keywords. For example, after excavating keywords of emerging categories, existing products or products on the market can be hit by such keywords and have a high degree of execution, then they can be added to the training corpus and become training samples for new categories. For data corpora with low execution or multiple possibilities, simple manual verification can quickly build training samples to help the model iterate efficiently.

3952cc71756a1760c9af744c8daf8ce7.png

Take the above case as an example. The original Global Category Tree has two categories. After expanding to 20+ fine-grained categories, both the text model and the multimodal model can reach 90%+ in many different markets. Accuracy, which can efficiently respond to the classification adjustment problem.

2b64156b4e260051279f8eaf35e8eb75.png

The third task is how to capture and correct misclassified commodities. The business background here is that misplaced product information has brought various negative effects on both consumers and the platform. For example, additional logistics costs will be added, which will affect the sales volume of merchants and increase the difficulty of product control. The technical difficulty is that this kind of misplaced goods is a relatively difficult case for the model, and it is difficult for the classification model to accurately capture these data.

fc1b2a9d61c49ba1b6b301e76cf25f4b.png

In order to solve this problem, we built a model Detection to identify misplaced products, and combined with the identified misplaced products to correct Correction, we found a more suitable classification. In the Detection model, the core idea is to pre-train the Shopee corpus based on CrossEncoder with multi-task learning, and then perform classification. By splicing product information and classification information, it is identified whether it is a wrong classification on each classification layer. For misplaced products, find the closest or highest performing category by way of recall and sorting. The core idea is to select one or more classifications with the highest reliability based on Sentence-BERT using Siamese Network Structures and Triplet Contrastive Learning, and make corrections.

e8574712928e3b1ef211e40bca31e59f.png

The scale of questionable corpus that needs to be processed or labeled is very large, so how to improve the model by only marking a small amount of data? Based on this problem, we have carried out the work of data corpus optimization, which can be understood as learning the confidence of the corpus through active learning. After three to four models, which data are learned through voting and optimization methods Expected to be an outlier. When sampling, the centorid data, outlier data, and random data are all sampled. In this way, the amount of labeling of the corpus is reduced to improve the model.

22813bc81d6020fd64a0ef6702773a23.png

Combining the above work, the service of identifying whether the product category is misplaced can achieve an accuracy of more than 98%. Badcases related to search queries are reduced by about 50% in key categories.

3. Item Attribute Recognition

69798f6952d0488830055d6592fd663d.png

Next, we will introduce the new identification of product attributes. As can be seen from the figure above, after inputting product information, attribute recognition is based on four different ideas: the first is based on the String-match Model; the second is based on the Rule-based Model, such as Warranty Duration: 1 year, which The first is to conform to the characteristics and rules of the corpus; the third is to identify attributes based on the NER model; the fourth is based on the Image model, a model related to vision and multimodality.

Based on these four different identification ideas, a variety of possible attribute items and values ​​are obtained from commodity information. For these identified attribute items and values, a layer of attribute value integration is performed, and items and values ​​with higher confidence are selected by combining various information. For example, the confidence of learning sources and so on. After learning the attribute values ​​with high confidence, it is necessary to combine the relationship between the attribute values ​​to supplement the product knowledge inferred from the product information.

fc1cdfe4977d2ee9d346af91e789b73e.png

Open set attribute values ​​usually have many different expressions, and the NER model is more suitable for capturing the existing values ​​in the product information expression. So we transformed the identification of product information attributes from the NER model to the MRC model. Through the solution of MRC, we hope to use Wordpiece tokenizer to alleviate the problem of OOV, and use LaBse PLM to solve some problems of multi-lingual, and use MRC+CRF to complete the task of identifying and extracting text attributes and commodity attributes.

c72f4a0030e9646817a3c8ec0b81d69e.png

After identifying and extracting a large number of attribute values, you will find that its expressions are various, and there may be spelling mistakes or synonyms. Just like the case of Samsung, they are all blue, but there are different expressions of "blue" and "biru". We need to normalize these words, so as to better respond to downstream applications and integrate all product information Convert to a standard information layer to facilitate more efficient understanding by downstream systems.

1a3d65888970cceff0f3813bbc279d13.png

Next, we need to understand the ambiguity of this information, because we found that the information extracted from the product will conflict. For example, the color in the product title information is "red", the color in the detailed information is "yellow", "silver" can indicate both the color and the material, and "red" may be red or it may be the brand information of Redmi. Inspired by the promat approach, we transformed this problem into a generation task. Based on the T5 model, the above figure is the overall flow chart. The key point is to convert the data into Template format, do Encoder and Decoder, and finally output the value corresponding to the item to be identified. Through comparative use, it is found that the performance of T5 is still good, and it has a relatively large improvement compared with other models.

2b6c045977bbe321c7de557a86e990a1.png

After identifying the information of the product, you can also use this information to do some reasoning. For example, if the warranty type is no warranty, then the warranty time will naturally be None. This kind of reasoning can be realized by mining the associated attributes of the knowledge graph.

6db09f2993a180e2f242610fb182ccb6.png

By analogy, not only can the product information be completed through associated attributes, the product graph includes the relationship between products and the relationship between products and attributes, and a series of information can also be completed between these relationships. We On this basis, a system such as a map is also constructed.

03

knowledge fusion

Next, we will introduce the part of knowledge fusion, which is divided into ontology fusion, entity fusion and information fusion.

1. Ontology Fusion

0e52339fbd6102d0f7ee91b719137fd7.png

Ontology layer fusion can be understood as commodity ontology, such as Shopee's commodity classification system and other classification systems in the market, which can be mapped and associated, including category mapping, attribute item mapping, and attribute value mapping. The core idea is that it is supported by many atomic technical modules. For example, in the mapping of categories, the mapping relationship can be summarized based on the classification information of commodities to the classification system. Attribute items can be combined with similar words, synonyms, etc., and then construct the association mapping relationship between the item and the value under the classification. Such association relationship will also be combined with actual conditions to limit the accuracy and conditions.

2. Entity Fusion

c3d1b4ee9dc284e134af1d3c54deb956.png

Focus on the integration of the physical layer, which can be understood as the identification and understanding of the relationship between commodities at the e-commerce level. Such as the same product, similar products or related products.

5705ee8cdd491ed4cc1c972f81334959.png

There are some classic ideas on the basic algorithms of different relationships. It is common to find their relationship based on the matching of graphic similarity. A further step is to do finer-grained matching of product information attribute items based on the product map, which can disassemble the specific requirements of the matching relationship between products in a more business-explainable manner. For example, we want to know whether two products meet the same brand, material, and color, or whether we want a finer-grained or coarse-grained product, so that it is more convenient for business to customize.

f56883b3de06d63fbfd751a3ec860042.png

In the matching based on the similarity between pictures and texts, the framework and method based on recall sorting are mainly constructed. Combining product information to build Embedding, and based on graphic-text Embedding to do search recall and fine sorting, to realize the construction of the same item relationship based on similarity.

249cca6fd5c645530a4c02d6978964af.png

On this basis, it is also hoped to build a more accurate same-type relationship based on the attribute dimension of the graph, so a concept was born: Standard Product Unit (spu), which is the standard product node. As can be seen from the figure above, under the fine-grained classification of each product, you can define those items and values ​​that are most concerned about the commodity relationship. For example, the Apple iPhone 13 Pro in the picture represents a series of product nodes, and the Apple iPhone 13 Pro sold by any merchant in any location is the same product. Of course, this product node is also engraved with finer granularity. After we precipitate such a product node, we can connect all the products that meet the product definition to realize product aggregation at a product granularity.

e4ca32662173d28664a6abb2f3de72f2.png

The advantage of this is that it is more explainable, it is convenient for users and the internal operation of the platform, and it can customize aggregates of different granularities.

b9a16f3a4e88559f5b8c0c6d311d50c4.png

The overall framework is shown in the figure above, which involves refinement of definitions, classification based on definitions, extraction of attributes, and aggregation of product dimensions based on the requirements of definitions combined with the extracted attributes. We can realize the production of SPU data assets by connecting all the modules. In the end, not only will all the product nodes be produced, but also all the product information will be connected, and the product information can also be transferred to the product dimension to realize the knowledge fusion of the final information layer.

b144358fd431423d61b9cb506c0684aa.png

So we built a knowledge map as shown in the figure above, which will have various product nodes and corresponding classification information, attribute information, and connections of various commodity entities.

04

knowledge application

Next, we briefly introduce our series of knowledge applications.

2c7d012d74b1e2cde66f595d92fd4f2e.png

Knowledge application provides a wide range of services, such as helping operators understand the market, product screening, and product quality verification; helping merchants to intelligently identify categories, price recommendations, and complete logistics information when publishing; helping consumers recommend cost-effective products Event venues, as well as various intelligent support for search recommendations.

05

Prospects of Knowledge Graph

Finally, the prospect of future knowledge map work is introduced.

38e6c7de639fec384662f3dff379cb4d.png

As can be seen from the previous map diagram, our product map can not only be connected to information such as product and product attribute classification, but also can further expand the relationship with users, merchants, and higher-dimensional information on various market platforms. And to achieve accurate intercommunication and reasoning between information, based on this complement to do a wider range of business applications.

00f1e659d6ab958552e3b985d6080d7b.png

In the current AIGC era, the birth of a large number of new technologies has impacted everyone's thinking, and various large-scale language models have been born. With the breakthrough of the chatGPT large model, the development of AI has reached a certain stage. The success of chatGPT proves that if we have enough data and a large enough model, we can achieve better knowledge reasoning. Under such a background, what kind of development opportunities and challenges do people who work on maps and our work face?

For large models, the help it can provide to the map is not particularly good, and it cannot meet the end-to-end requirements. Especially in the vertical field, each company has its own operating model and business standards. As shown in the figure above, we do a fine-grained product recognition. In this example, the accuracy rate is about 50%, which has not yet reached the end-to-end commercial application requirements. We still need to build a fine-grained sub-model. . Moreover, the calculation of large models is not a cost-effective choice in terms of the existing computing power consumption, and the models in the vertical field still have advantages. However, large models can assist us in the optimization of vertical domain models, such as the enhancement of training data and sample generation, which can help vertical domain models to improve rapidly.

Under the trend of large models, we also need to think about what role knowledge graphs can play. In fact, there are still some problems with the current large model. For example, the large model may provide non-real-time but plausible predictions, and there is still room for improvement in reasoning ability for more complex logical and mathematical reasoning. The knowledge map actually has some advantages in reasoning ability, so in the future we can explore whether it is possible to combine the structure of the knowledge map with the existing methodology, and combine it with the training method of the large model.

Judging from the current application, New Bing is already using search engines to supplement and enhance the effect of chatGPT-4, which also reduces knowledge-based errors to a certain extent. For example, for unique business knowledge, can we use the zero-fine-tuning technology to use the knowledge expression of the knowledge graph as a prompt to prompt the GPT large model to generate answers that are more in line with business scenarios. Of course, these are just some superficial ideas and applications. I believe that as the understanding of the model continues to deepen, there will be better ways to combine them.

The above is the content of this sharing, thank you all.

b15c40ce9868381da2ad589dc9ed39de.gif

e0f474320260099c569eb913b5a1e7d6.png

share guests

INTRODUCTION

cd0d732afd40cb0bb4d54347c1ebaf76.gif

Zhang Yichi

8deeb5c060b471e3deb5d593b71e1b38.gif

Shopee

0d7c93a0d12302e45c805ddb005c8797.gif

Listing Team Leader

1b8c344b72170e3dcc98f9eb6707736b.gif

Zhang Yichi, currently the person in charge of product algorithm of Shopee Marketplace Intelligence Listing, an e-commerce platform, serves the intelligent identification of products in more than ten markets around the world. applications in business, etc. Graduated from the University of London, academic papers have been published in BMVC / EMNLP / WSDM / CVPR and other domestic and foreign conferences and journals, and published a monograph.


OpenKG

OpenKG (Chinese Open Knowledge Graph) aims to promote the openness, interconnection and crowdsourcing of knowledge graph data with Chinese as the core, and promote the open source and open source of knowledge graph algorithms, tools and platforms.

52623c8b8e0b207e13e9f8159df48683.png

Click to read the original text and enter the OpenKG website.

Guess you like

Origin blog.csdn.net/TgqDT3gGaMdkHasLZv/article/details/131618164