How does Xianyu make the second-hand attribute extraction accuracy rate reach 95%+?

First effect

Figure 1-Demo of second-hand attribute extraction algorithm effect (personal care beauty)

background

Xianyu is a C2X app. From the perspective of product release, Xianyu products have the following characteristics compared to Taobao products:

  • Light release leads to insufficient product information

    Xianyu adopts the light release mode described by graphics and text, which caters to the user's rapid release experience, but also leads to the problem of insufficient product structured information. If the platform wants to better understand what the product is, it needs an algorithm to recognize the pictures and text described by the user.

  • Goods have unique second-hand attributes

    • Different from the first-hand attributes of Taobao’s new products (such as brand, model, specification parameters, etc.), second-hand attributes refer to attributes that can reflect the deterioration/preservation of the product after a period of time after the product has been purchased.For example, the [number of uses] of the product, [Purchase Channel], [Whether the packaging/accessories are complete], etc.

    • Different categories have unique second-hand attributes of the category,For example, there is a [shelf life] for a beauty makeupThe phone has [screen appearance],[Repair situation],Clothing category has [have it been in the water]Wait.

Problems and difficulties

Second-hand attribute extraction is an information extraction (Information Extraction) problem in the NLP field. The usual approach is to disassemble it into named entity recognition (NER) tasks and text classification (Text Classification) tasks.

The difficulties of the second-hand attribute extraction task are:

  • Different categories and different second-hand attributes/attribute clusters require different models to be constructed.

  • If you use supervised learning (Bert family), the marking work will be very heavy and the development cycle will become very long.

solution

Methodology

In today’s NLP environment, the Bert family (or various algorithms derived from Transformer) is still popular, dominating the major NLP lists such as GLUE and CLUE, and the task of information extraction is no exception. Therefore, the author has also used some scenarios in this plan. The Bert family is used. However, the author believes that no algorithm is omnipotent in various scenarios, only the most applicable algorithm in a given field and specified scenario. In addition, the author summarized his own methodology for attribute extraction:

  • The sentence pattern is relatively fixed, or the sentence pattern is restricted by the template. For example, the text description template is a typical time + place + person + event (what someone did in a certain place). Use NER. Suggested methods: CRF, BiLSTM+CRF , Bert family, Bert family + CRF, etc.

  • The sentence pattern is not fixed, but the domain/scene keywords are relatively fixed, or there are some keyword templates, common names, jargon, etc., use text classification:

    • There are not too many synonymous words and synonymous expressions (≤ dozens to hundreds of), and keywords are in lognormal distribution/exponential distribution (that is, there are many high-frequency and concentrated keywords). Suggested method: regular expression Formula + rules.

    • There are many synonymous words and synonymous expressions (≥ hundreds to thousands), such as place name recognition. Suggested method: use Bert family.

  • Sentence patterns and words are not fixed, typical sentiment analysis such as social comment/chat, suggested method: use Bert family.

Solution architecture

Figure 2-Second-hand attribute extraction scheme architecture diagram

NLP tasks

As mentioned earlier, different second-hand attribute recognition needs are disassembled into text multi-classification, multi-label classification and NER tasks.

  • Text multi-category : the "n choose 1" question, such as judging whether the product is free shipping based on the text (two categories).

  • Multi-label classification : that is, multiple "n choose 1" questions are performed at the same time, such as judging the appearance of a mobile phone product's screen (good/medium/poor) and body appearance (good/medium/poor) at the same time. The usual method of multi-label classification is to share the network layer for different labels, and add the loss function with a certain weight. Because there is a certain degree of connection between multiple labels, the effect is sometimes compared to multiple separate "n choose 1" The problem is better, and because multiple attributes (attribute clusters) are modeled together, it will be easier to train and infer.

  • NER : Named entity recognition.

Modeling method

1. Manual marking stage

As the labor cost of marking is relatively high, it is necessary to try to use AliNLP of the group to assist. The method is to use AliNLP's e-commerce NER model to parse the input text. Then disassemble the second-hand attributes belonging to the NER task, such as shelf life/warranty period/capacity/use times/clothing style, etc., which can be directly located to the relevant part of speech or entity keywords for BIO labeling; other second-hand belonging to the classification task Attributes can be marked on the basis of the segmentation results of e-commerce NER to improve the efficiency of manual labeling.

2. Algorithm training stage

This is the core of the program. The training algorithm of this program mainly adopts 3 ways:

(1) Use Albert-Tiny : Modeling adopts the mainstream pre-training + finetune scheme. Because the model has a faster inference speed, it is used in real-time online scenarios that require very high QPS and response. For NER tasks, you can also try to connect a layer of CRF at the end of the network or not.

Albert : Albert means "A lite bert", worthy of the name, its advantage lies in the fast training speed.
Albert's source code is basically the same as Bert's source code, but there are several important differences in the network structure:

  • The Word Embedding layer is factorized, which greatly reduces the number of parameters in the word vector. Suppose the size of the vocabulary is V, the length of the word vector is H, for Bert, the parameter of the word vector is V H; for Albert, the word length is first reduced to E, then expanded to H, and the parameter is V E+E* H, since E is much smaller than H, and H is much smaller than V, the amount of parameters used for training is drastically reduced.

  • Cross-layer parameter sharing: Taking albert-base as an example, albert will share the attention parameters of each layer or the parameters of the fully connected layer ffn between the 12 layers. The default is to share both. In the source code, it can be easily achieved through the reuse parameter of tenorflow.variable_scope. Parameter sharing further reduces the amount of parameters that need to be trained.

In addition, Albert has some optimizations on training tasks and training details, which are not listed here.

Albert is divided into different network depths:

•Albert-Large/xLarge/xxLarge: 24 layers • Albert-Base: 12 layers • Albert-Small: 6 layers • Albert-Tiny: 4 layers

Generally speaking, the more layers, the longer it will take to train and infer. Considering that the real-time of online deployment requires faster inference speed, this solution selects the smallest Albert-Tiny. The inference speed of the text is about 10 times higher than that of bert-base, and the accuracy is basically preserved (data quoted from github/albert_zh [1] ). 

(2) Use StrutBert-Base : Modeling adopts the mainstream pre-training + finetune scheme. According to calculations, it is about 1% to 1.5% higher than Albert-Tiny in second-hand attribute recognition, and can be used in offline T+1 scenarios. For NER tasks, you can also try to connect a layer of CRF at the end of the network or not.

StructBert : It is Ali's self-developed algorithm. The advantage is high accuracy . It has been ranked third on the GLUE list [2] .
The main optimization points of StrutBert's paper compared to Bert are in the two goals of the pre-training task, as shown in Figure 3:

Figure 3-StrutBert's pre-training task goals (quoted from StrutBert paper)
  • Word Structural Objective: Based on Bert’s MLM task, StrutBert adds the task of disrupting the word order and forcing it to reconstruct the correct word order: In the paper, a trigram is randomly selected to scramble, and then the following is added The formula serves as the constraint of the MLM loss function. StrutBert's inspiration may come from a paragraph on the Internet: "The research shows that the sequence of Chinese characters does not affect the reading. It turns out that the characters are all messed up after you read this sentence."

Figure 4-The objective function of Word Structural (quoted from the StrutBert paper)
  • Sentence Structural Objective: Unlike Bert’s NSP task, for a set of sentence pairs (A, B), it does not predict whether B is the next sentence of A (two classifications), but to predict whether B is the next sentence, previous sentence or Randomly drawn (three categories). For sentence pairs such as (A, B), it makes these three situations occur 1/3 in the training set.

The reason for choosing StrutBert for this solution is that there is a pre-trained model (interface) for the algorithm in the e-commerce field exclusively in the group, which is divided into different types according to the network depth:

•StrutBert-Base: 12 layers • StrutBert-Lite: 6 layers • StrutBert-Tiny: 4 layers

In the offline T+1 scenario, the pursuit of higher accuracy and no real-time requirements, so this solution chooses StrutBert-Base. 

(3) Using regular expressions : advantages: the fastest speed, more than 10-100 times faster than Albert-Tiny; and in many second-hand attributes with relatively fixed sentence patterns and keywords, the accuracy rate is higher than the above two algorithms ; And easy to maintain. Disadvantages: Very dependent on business knowledge, industry experience and data analysis to sort out a large number of regular patterns. 

3. Rule revision stage

  • Normalization of recognition results: For NER tasks, many of the recognized results cannot be used directly and need to be "normalized". For example, if the size of a men's clothing is recognized as "175/88A", it should be automatically mapped to "L" code".

  • There may be conflicts or dependencies between some second-hand attributes. Therefore, after algorithm identification, the identification results need to be modified according to business rules. For example, if a seller of a product claims to be "brand new", but at the same time indicates that it "only used 3 times", then "brand new" will automatically be downgraded to "non-brand new" (99 new or 95 new, different categories have slightly different classifications).

Algorithm deployment

  • Offline T+1 scenario: Deploy through ODPS (currently known as MaxCompute) + UDF, that is, the algorithm will be written as a UDF script in Python, and the model file will be uploaded to the ODPS as a resource.

  • Online real-time scenario: The model is deployed in a distributed manner through PAI-EAS, and data interaction is completed through iGraph (a real-time graph database) and TPP.

Algorithm evaluation

For each second-hand attribute of each category, the evaluation criteria are established, and then a certain level of data is sampled and handed over to outsourcing for manual evaluation. The evaluation work compares whether the results of manual recognition and algorithm recognition are consistent, and gives accuracy, precision, and recall rates.

final effect

Accuracy

The recognition results of this scheme have been manually evaluated. Each category has reached a very high level (98%+) in terms of accuracy and precision, and the error value is much smaller than the online limit, and it has been online and applied in Xianyu main Category of goods.

Show results

Figure 5-Demo of second-hand attribute extraction algorithm effect (mobile phone)

Application scenarios & follow-up prospects

The results of second-hand attribute extraction have been applied in scenarios including:

  • Pricing scenario

  • Chat scene

  • High-quality commodity pool mining

  • Search shopping guide

  • Personalized product recommendation

Future prospects:

  • At present, second-hand attribute extraction covers a total of Xianyu mainstream category products, and as the development progresses, follow-up plans cover all categories.

  • At present, the extraction of second-hand attributes mainly relies on text recognition. Xianyu products are described in pictures and texts. In the future, you can consider working hard on the pictures to improve the structured information of the products through image algorithms.

  • Use and analyze the second-hand attributes of commodities to form high-quality commodity standards and expand the high-quality commodity pool.

reference

Albert paper: https://arxiv.org/abs/1909.11942
StructBert paper: https://arxiv.org/abs/1908.04577
Albert_zh source code: https://github.com/brightmart/albert_zh
GLUE ranking: https:// gluebenchmark.com/leaderboard

References

[1] github/albert_zh:  https://github.com/brightmart/albert_zh
[2]  GLUE list:  https://gluebenchmark.com/leaderboard

Can't be idle? Come Xianyu!

PICK ME

Xianyu's technical team pursues more value through innovation and continuously drives business changes.

From the old line of idle business, to the creation of "worry-free shopping", "playing community" and "new offline",

From publishing books, speaking out at summits, to open source patents and overseas communication,

If you can’t stay idle, you can enjoy the fish-the technical team’s ultimate exploration and deep cultivation is our confidence.

 Join now 

1. Recruit client/server/front-end/architecture/quality engineers

2. Send your resume to [email protected]

3. You can also find us in Toutiao, Zhihu, Nuggets, facebook, twitter

Guess you like

Origin blog.csdn.net/weixin_38912070/article/details/112856108