[Reprint] Analysis of Search Intent Recognition

[Reprint] Source address: https://weibo.com/ttarticle/p/show?id=2309404000076505691403

       For a search engine, in many cases, the user only needs to input the content to be queried in the search input box, and the rest is left to the search engine to handle. Ideally, search engines would prioritize returning the results users want. The ideal is very rich, but there will always be some skinny reality, and users cannot find the most desired results through search. It is understandable if there is no user search content in the application at all. On the contrary, it is a capital embarrassment. Embarrassing issues are mostly discussed and addressed in this article.

Why can't I find it

  1. Different users often express different demands for the same kind of demand, and there is often a common phenomenon that the query input by the user cannot express the demand clearly and accurately.

  2. The search system has a weak ability to understand the user's query and cannot dig out the real needs of the user.

  3. The sorting of the recall result set is unreasonable, and the content required by the user may be ranked behind and not exposed.

  The above points are probably the main reasons why users cannot find the content of their needs. This article mainly discusses the first two points, mainly to solve how to better understand the needs of users and carry out accurate recall, and then to the third point involved. Sorting plays a positive role.

  As a user, the knowledge level and expression ability of users will be different. When different users want to search for the same product, the query input will be different, as shown below:

  It can be seen that for the same product, there are often different queries, and the relatively accurate ones are "Cranberry Capsules Europe" and "blackmore Cranberry"; the brand priority is "blackMores"; the efficacy priority is "Ladies Dysmenorrhea", " "Urinary system infection"; the wrong input is "Cranberry", the input alias is "Holy Berry"; the more vague input is "Gynecology", "Inflammation". Therefore, the user's input generally has differences in expression, vocabulary, and demand clarity.

  To solve these problems, it is necessary to obtain the real needs of users through the query input by the user. This article calls the understanding of user input QueryParser, including: query segmentation (word segmentation), query intent recognition, query rewriting (query expansion/query Error correction/query deletion, etc.), then this article mainly focuses on the specific application of query intent recognition and query rewriting in koala Haitao search.

1.Query intent recognition

  This article mainly introduces vertical search. The query in different vertical engines will have its own characteristics. For example, there must be a lot of queries with patterns such as "air tickets from city a to city b" in the log of Qunar.com, and most of the e-commerce websites must be a combination of types of data such as "product/brand/model/style/price" , most of the music applications should be queries related to artist and song name. Compared with general search, vertical search may be more targeted to mine users' intentions.

1.1 Difficulties in Intent Recognition

  1. The input is not standardized, as described in the previous article, there are differences in the expression of the same appeal by different users.

  2. Multi-intent, the query word is: "water", whether it is mineral water or a make-up water for girls.

  3. Data cold start. When there is little user behavior data, it is difficult to obtain accurate intent.

  4. There is no fixed evaluation standard. The quantifiable indicators such as pv, ipv, ctr, and cvr are the overall evaluation of the search system. There is no standard quantitative indicator for the prediction of user intent.

1.2  Methods of Intent Recognition

1.2.1  Exhaustive method of vocabulary

  This method is the simplest and most violent. The query intent is obtained by direct matching of the vocabulary. At the same time, it is also possible to add simple categories with more concentrated query patterns.

  ·Query word: Germany [addr] Aitamei [brand] Milk powder [product] Three paragraphs [attr]

  ·Query pattern: [brand]+[product];[product]+[attr];[brand]+[product]+[attr]

  当然查询模式是可以做成无序的。这种意图识别的方式实现较为简单,能够较准确的解决高频词。由于query一般是满足20/80定律,20%的query占据搜索80%的流量。但是,80%得长尾query是无法通过这种方式来解决的,也就是说这种方式在识别意图的召回可能只占20%。同时,需要人工参与较多,很难自动化实现。

1.2.2 规则解析法

  这种方法比较适用于查询非常符合规则的类别,通过规则解析的方式来获取查询的意图。比如:

  ·北京到上海今天的机票价格,可以转换为[地点]到[地点][日期][汽车票/机票/火车票]价格。

  ·1吨等于多少公斤,可以转换为[数字][计量单位]等于[数字][计量单位]。

  这种靠规则进行意图识别的方式对规则性较强的query有较好的识别精度,能够较好的提取准确信息。但是,在发现和制定规则的过程也需要较多的人工参与。

1.2.3 机器学习方法

  意图识别其实可以看做是一个分类问题,针对于垂直产品的特点,定义不同的查询意图类别。可以统计出每种意图类别下面的常用词,对于考拉海淘而言,可以统计出类目词,产品词,品牌词,型号词,季节时间词,促销词等等。对于用户输入的query,根据统计分类模型计算出每一个意图的概率,最终给出查询的意图。但是,机器学习的方法的实现较为复杂,主要是数据获取和更新较困难,数据的标注也需要较准确才能训练出较好地模型。

2. query意图识别在考拉海淘中的应用

  考拉海淘是一个电商类的产品,目前其搜索意图相对单一为产品购买。本文主要讨论考拉海淘中用到的query改写,类目相关,命名实体识别和Term Weight等内容。考拉的搜索系统有大量的用户访问,我们希望通过对用户query的意图分析来提高搜索体验,目前,考拉系统的架构包含下图所示的几个部分:

2.1 实体词识别

  通过对日志分析,将用户常用的搜索词分为以下四类:地址(澳洲),品牌词(爱他美),产品词(奶粉),属性词(三段)。当用户输入query时,如果能准确的识别每个实体词,就能去索引里面精确匹配对应的字段,从而提高召回的准确率,在排序中也可以用到实体词进行优化。举一个栗子:有一个商品的标题是”AYAM BRAND 雄鸡标辣椒金枪鱼“,它的类目是“冷面/熟食/方便菜其他熟食”。当用户搜“辣鸡面”的时候,通过单字逻辑召回这款商品。通过实体识别会得到这个商品的产品词是“金枪鱼”,而query要搜的产品词是“面”。这样就可以判断出其实这是一个误召回,进而可以将这个商品进行过滤或者是排序的时候放到较后的位置。

  我们的实体词识别模型是通过crf来进行训练的,语料是用户搜索的真实query,用一个相对准确的词典(品牌词/产品词/属性词/地址词)去标注语料。具体的标注预料如下所示:

  ·爱B-brand 他I-brand 美I-brand 奶B-product 粉I-product 三B-attr 段I-attr

  训练出的模型对于地址,品牌词,产品词的识别准确率平均95%左右,英文属性词的识别准确率还有待提高,crf模型还有一个比较好的地方是具有一定的泛化能力。另外,模型的训练是使用考拉平台上的商品数据,所以对非考拉平台的产品和品牌识别的准确率也不理想。但是,最重要的是识别本平台已有的实体,尽可能准确的向用户展示最准确的商品搜索结果。

2.2 query改写

  query改写包括:query纠错,query扩展,query删除,query转换。本文主要讨论在考拉中常用的query扩展,query删除和query转换。

2.2.1 query扩展

  搜索召回依赖索引数据,商品数据依赖于编辑运营的录入,数据的完整性很难得到保障,也就是说很难从各个角度来描述这个商品。

  还是用例子说明,一个商品的标题是“Fisher-Price 费雪碎花儿童学步鞋”,由于用户输入的差异性存在,会有用户搜索”婴儿鞋”,”宝宝鞋”。很明显这个学步鞋恰恰用户所需的商品,但是因为数据的不完整性而无法被召回。这就是前文提到的有商品却无法展示给用户,这是最不希望遇到的情况。这时候就需要用到query扩展,我们会维护一个同义词扩展表,当用户输入一个query的时候,会进行同义词扩展,从而尽可能召回所有与用户相关的商品。

2.2.2 query删除

  query删除一般的应用场景是在当用户输入query过多时导致无法正常召回,可以通过丢词的方式来筛选用户的query,从而召回与query最相关的商品。

  依旧用例子说明,当用户的query为”卡乐比水果麦片”时,由于这款商品可能被下架,或者商品种类较少,通过query删除,可以把原query改写为“水果麦片”,进而可以召回其他品牌的水果麦片。query删除是需要用到实体识别的,因为要决定query中的哪些数据被删除才能对用户原意图造成的影响最小。像”卡乐比水果麦片”,通过意图识别得到”卡乐比“是品牌,”水果麦片“是产品,显然用户更需要的是水果麦片,而不是“卡乐比”其他类型的麦片。

2.2.3 query转换

  会存在这样一种情况,确实没有商品是满足用户的明确需求。比如,用户搜索”祖马龙”,考拉海淘并没有这款商品。也无法通过query同义词扩展和query删除来对原query进行处理。通过session数据可以发现,用户搜索“祖马龙”后会伴随着“香水”这个query出现,利用用户行为数据是可以挖掘出“祖马龙”和”香水”这两个query是相关的。当用户搜索”祖马龙”而无法召回时,是可以把query转换为”香水”来尽可能满足用户的需求。

2.3 类目相关

  当用户搜索“Adidas”的时候,是想要搜索“运动鞋”,还是“衣服”,又或者是“沐浴露”。当然,你可能说不同的用户有不同的需求,这就涉及到个性化搜索的内容了,暂时不在本文的讨论范围内。如果用户行为数据足够多,直接使用统计分析就可以找到query对应的类目相关程度。当然,统计算法也是机器学习的一种。但是,仍有一部分问题是需要机器学习算法来完成的。

  通过对用户行为数据的挖掘,发现“Adidas”对应的类目相关性排序为:运动鞋>衣服>沐浴露。当用户搜索“Adidas”的时候,会按照类目相关性的顺序,将运动鞋排在最前面。当然,考虑到多样性,排序时会通过类目打散将衣服和沐浴露适当的掺杂在运动鞋中。

  query的类目相关性是通过用户行为数据进行挖掘的,一些长尾的类目虽然与query相关,由于马太效应却无法被挖掘。比如query“面膜”所挖掘出的相关性类目为“男士面膜”/“女士面膜”/“面膜粉”等,而“孕妇面膜”这个类目却一直处于不相关的状态。其实,“男士面膜”/”女士面膜”/”面膜粉”/“孕妇面膜”在”面膜”这个维度都是相关的,我们通过虚拟类目的做法来解决这种长尾问题。离线将这四个类目归一为一个虚拟类目,当用户的query落在虚拟类目中的大部分类目时,认为这个query与虚拟类目包含的其他类目也具有相关性。

2.4 Term Weight

  中文自然语言处理的第一步就是分词,分词的结果中,每个词的重要性显然应该时候区别的。Term Weight就是为了给这些词不同的打分,根据分值就可以判断出核心词,进而可以应用到不同的场景。比如,有一个商品的标题为“碗装保温饭盒套装”,通过Term Weight可以得到核心词为“饭盒”。当用户搜”碗”召回这个商品的时候,是可以根据term weight来进行排序降权的。

  通过以上几点可以看出,query意图识别在一个搜索系统中是必不可少的,可以说query意图识别的精确程度高低决定着一次搜索质量的优劣。​​​​

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326145997&siteId=291194637