Similarity sorting algorithm and search engine principle for hundreds of millions of papers (translation)

Sergey Feldman is a senior applied research scientist at AI2 in Seattle, focusing on natural language processing and machine learning.

Original address: Building a better search engine for semantic scholars | By Sergey Feldman | AI2 Blog (allenai.org) icon-default.png?t=N7T8https://blog.allenai.org/building-a-better-search-engine-for-semantic-scholar- ea23a0b661e7

2020 is the year of searching Semantic Scholar  ,  a free, AI-powered scientific literature research tool from the Allen Institute for Artificial Intelligence. One of our biggest efforts this year has been improving search engine relevancy, and starting at the beginning of the year I was tasked with figuring out how to use about 3 years of search log data to build better search rankings.

Ultimately, we ended up with a search engine that provided more relevant results to our users, but at first I underestimated the complexity of getting machine learning to work properly in search. “No problem,” I thought to myself, “I can do the following and be completely successful in 3 weeks”:

  1. Get all search logs.
  2. Do some feature engineering.
  3. Train, validate, and test great machine learning models.
  4. deploy.

Although this appears to be established practice in the search engine literature, much of the experience and insight from the practical work of actually making search engines work is often not published for competitive reasons. Because  AI2  is focused on the common good of AI, we make much of our technology and research open and free for use. In this article, I will explain "all in all" why the above process is not as simple as we hope, and detail the following problems and their solutions:

  1. Data is definitely dirty and needs to be carefully understood and filtered.
  2. Many features improve performance during model development but lead to strange and unwanted behavior when used in practice.
  3. Training a model is great, but choosing the right hyperparameters is not as simple as optimizing  nDCG on the hold-out test set  .
  4. Well-trained models can still make weird mistakes that require post-hoc correction to fix them.
  5. Elasticsearch is complex and difficult to get right.

In addition to this blog post, and in the spirit of openness, we are releasing the complete Semantic Scholar search reranking model currently running at www.semanticscholar.org , along with all the artifacts you need to do the reranking yourself. Check it out here:  GitHub - allenai/s2search: The Semantic Scholar Search Reranker

Search Ranker Overview

First, let me briefly introduce Semantic Scholar’s ​​advanced search architecture. When searching on Semantic Scholar, the following steps are performed:

  1. Your search query will go to Elasticsearch (we index nearly ~190M papers).
  2. The top results (we currently use 1000) are re-ranked by a machine learning ranker.

We have recently made improvements to (1) and (2), but this blog post is mainly about the work on (2). The model we use is  the LightGBM  ranker with LambdaRank  objective  . It is very fast to train, fast to evaluate, and easy to deploy at scale. True, deep learning has the potential to deliver better performance, but model wobble, slow training (compared to LightGBM), and slower inference are all disadvantages.

The data must be structured as follows. Given a query q, an ordered result set R = [r_1, r_2, ..., r_M], and the number of clicks per result C = [c_1, c_2, ..., c_M], we input/output the following The pairs are fed into LightGBM as training data:

f(q, r_1), c_1

f(q, r_2), c_2

...

f(q, r_m), c_m

where  f  is the characterization function. With at most m rows per query, LightGBM optimizes a model such that if c_i > c_j, then model(f(q, r_i)) > model( f (q, r_j)) as much training data as possible.

One technical point here is that you need to correct for location bias by weighting each training sample by the inverse propensity score of its location . We calculate propensity scores by running random position swapping experiments on search engine results pages.

Feature engineering and hyperparameter optimization are key components in making this possible. We'll come back to these later, but first I'll discuss training data and its difficulties.

The more data, the more questions there are

Machine Learning Wisdom 101 says “the more data, the better”, but this is an oversimplification. Data must be relevant , and deleting irrelevant data is helpful. We ended up needing to remove about a third of the data that didn't satisfy the heuristic "does it make sense" filter.

What does it mean? Assuming the query is  aerosol and surface stability of SARS-CoV-2 compared to SARS-CoV-1 , the search engine results page (SERP) returns the following paper:

  1. Aerosol and surface stability of SARS-CoV-2 compared to SARS-CoV-1
  2. Proximal origin of SARS-CoV-2
  3. SARS-CoV-2 viral load in upper respiratory tract specimens from infected patients
  4. ...

We expected the click to be at position (1), but in this hypothetical data, it is actually at position (2). The user clicked on a paper that didn't exactly match their query. There are legitimate reasons for this behavior (e.g. the user has read the paper and/or wants to find related papers), but to a machine learning model this behavior will look like noise unless we have features that allow it to correctly infer this the underlying cause of the behavior (e.g., based on characteristics of content clicked on in previous searches). Current architecture does not personalize search results based on the user's history, so this training data makes learning more difficult. Of course, there's a trade-off between data size and noise - you can have more noisy data or less clean data, with the latter being better suited to solving the problem.

Another example: Let's say a user searches for deep learning , and the search engine results page returns these years and cited papers:

  1. Year = 1990, Citations = 15000
  2. Year = 2000, Citations = 10000
  3. Year = 2015, Citations = 5000

Now click on position (2). For the sake of argument, assume that all 3 papers are equally "about" deep learning; that is, they have the phrase deep learning in the title/abstract/location the same number of times. Topicity aside, we believe that the importance of an academic paper is determined by a combination of recency and citation count, and here the user clicks neither on the latest paper nor on the most cited paper. This is a bit of a straw man example, for example, if the number (3) has zero citations, then many readers might prefer the number (2) to come first. Nonetheless, using the above two examples as a guide, the filter used to remove "meaningless" data checks the following conditions for a given triple (q, R, C):

  1. Do all clicked papers get more citations than non-clicked papers?
  2. Are all clicked papers newer than unclicked papers?
  3. Are all clicked papers a better textual match to the query in the title?
  4. Do all clicked papers more closely match the query in the author field textually?
  5. Are all clicked papers a better textual match to the query in the Venue field?

I require that acceptable training examples satisfy at least one of these 5 conditions. Each condition is satisfied when the values ​​(citation number, recency, match score) of all clicked papers are higher than the maximum value among unclicked papers. You may notice that the abstract is not in the list above; including or excluding it makes no real difference.

As mentioned above, this filter removes approximately one-third of all (query, result) pairs and provides approximately 10% to 15% improvement in our final evaluation metrics, which will be discussed in a later section Describe in more detail. Note that this filtering is done after suspicious bot traffic has been removed.

Feature engineering challenges

We generated a feature vector for each (query, result) pair, with a total of 22 features. The first version of the featureizer produced 90 features, but most of them were useless or harmful, reaffirming the hard-won wisdom that when you do some work for them, machine learning algorithms often Work better.

The most important features include finding the longest subset of query text in the title, abstract, location, and year fields of a paper. To do this, we generate all possible ngrams of length no more than 7 from the query and perform a regular expression search in each field of the paper. Once we have a match, we can calculate various features. Below is the final list of features grouped by paper field.

  • title_fraction_of_query_matched_in_text
  • title_mean_of_log_probs
  • title_sum_of_log_probs*match_lens
  • abstract_fraction_of_query_matched_in_text
  • abstract_mean_of_log_probs
  • abstract_sum_of_log_probs*match_lens
  • abstract_is_available
  • venue_fraction_of_query_matched_in_text
  • venue_mean_of_log_probs
  • venue_sum_of_log_probs*match_lens
  • sum_matched_authors_len_divided_by_query_len
  • max_matched_authors_len_divided_by_query_len
  • author_match_distance_from_ends
  • paper_year_is_in_query
  • paper_oldness
  • paper_n_citations
  • paper_n_key_citations
  • paper_n_citations_divided_by_oldness
  • fraction_of_unquoted_query_matched_across_all_fields
  • sum_log_prob_of_unquoted_unmatched_unigrams
  • fraction_of_quoted_query_matched_across_all_fields
  • sum_log_prob_of_quoted_unmatched_unigrams

Some of these features require further explanation. For more details, visit the appendix at the end of this article. If you want gory details, all the characterization happens here .

To understand the importance of all these features, below is a plot of SHAP values ​​for a model currently running in production   .

If you haven't seen a SHAP diagram before, it's a little tricky to read. The SHAP value for sample i and feature j  is a number that tells you roughly "for this sample  i , how much this feature  contributes to the final model score".  For our ranking model, a higher score means that the paper should be ranked closer to the top. Each point on the SHAP graph is a specific (query, result) click pair sample. The color corresponds to the feature's value in the original feature space. For example, we see that the title_fraction_of_query_matched_in_text feature is at the top, which means it is the feature with the largest (absolute) sum of SHAP values. It goes from blue on the left (low eigenvalues ​​close to 0) to red on the right (high eigenvalues ​​close to 1), which means that the model has learned a roughly linear relationship between how well the query in the title matches the ranking of the paper. As one would expect, more is better.

Some other observations:

  • Many relationships appear to be monotonic, and that's because they are roughly: LightGBM allows you to specify univariate monotonicity for each feature, which means that if all other features are held constant, if the feature goes up/down, the output score must be Rise monotonically (rise and fall can be specified).
  • Understanding the matched query volume and matched log probability is important, not redundant.
  • The model learns that more recent papers are better than older papers, even though this feature does not have a monotonicity constraint (the only feature that does not have such a constraint). As one would expect, academic search users like recent papers!
  • When the color is gray, it means the feature is missing - LightGBM can handle missing features natively, which is a big benefit.
  • Venue functionality seems very important, but that's only because a small percentage of searches are venue-oriented. These features should not be removed.

As you might expect, there are many little details about these features that are important to use correctly. It’s beyond the scope of this blog post to go into the details here, but if you’ve ever done feature engineering, you’ll know the drill:

  1. Design/tweak functionality.
  2. Train the model.
  3. Perform error analysis.
  4. Be aware of strange behaviors you don't like.
  5. Go back to (1) and adjust.
  6. repeat.

Nowadays, it's more common to do this loop, except replacing (1) with "Design/tune neural network architecture" and adding "See if model trains" as an extra step between (1) and (2).

Assessment questions

Another unassailable dogma of machine learning is the split of training, validation/development, and testing. It's very important, it's easy to get it wrong, and it has complex variations (one of my favorite topics ). The basic statement of the idea is:

  1. Use training data for training.
  2. Select model variants (including hyperparameters) using validation/development data.
  3. Estimate generalization performance on the test set.
  4. Never use the test set for anything else.

This is important, but often impractical outside of academic publications because the test data available to you does not reflect "real" production test data very well. This is especially true when you want to train a search model.

To understand why, let's compare/contrast the training data with "real" test data. Training data is collected as follows:

  1. User issues query.
  2. Some existing systems (Elasticsearch + existing reranker) return the first page of results.
  3. The user views the results from top to bottom (possibly). They may click on some results. They may or may not see every result on this page. Some users will go to the second page of results, but most won't.

Therefore, the training data for each query has 10 or 20 or 30 results. On the other hand, during production, the model must rerank the first 1000 results obtained by Elasticsearch. Again, the training data is just the first few documents selected by the existing reranker, while the test data is the 1000 documents selected by Elasticsearch. The naive approach here is to take the search log data, divide it into training, validation and testing, and then go through the process of a good set of engineering (features, hyperparameters). However, there is no good reason to think that optimizing on similar training data will mean you perform well on "real" tasks, since they are completely different. More specifically, if we create a model that is good at re-ranking the top 10 results from the previous re-ranker , this does not mean that this model will be good at re-ranking the 1000 results from ElasticSearch. The bottom 900 candidates were never part of the training data and may not look like the top 100, so reranking all 1000 candidates is not at all the same task as reranking the top 10 or 20.

In fact, this is a problem in practice. The first model pipeline I put together used retained nDCG for model selection, and the "best" model in the process gave weird errors and was unusable. Qualitatively, there doesn't seem to be much difference between a "good" nDCG model and a "bad" nDCG model - both are bad. We needed another evaluation set that was closer to a production environment, and a big thank you to AI2 CEO  Oren Etzioni  for coming up with the essence of the idea I'm going to describe next.

Counterintuitively, the evaluation set we ended up using was not based on user clicks at all. Instead, we randomly sampled 250 queries from real user queries and decomposed each query into its component parts. For example, if the query is  soderland etzioni emnlp open ie information extraction 2011 , its components are:

  • 作宇: etzioni, soderland
  • Location: emnlp
  • Year: 2011
  • Text: Open i.e. information extraction

This decomposition is done manually. We then send this query to previous Semantic Scholar Search (S2), Google Scholar (GS), Microsoft Academic Graph (MAG), etc. and see how many results at the top satisfy all components of the search (e.g. author, location, year , text matching). In this example, assume that S2 has 2 results, GS has 2 results, and MAG has 3 results that satisfy all components. We'll take 3 (the largest of them) and require that the first 3 results of this query must satisfy all of its component conditions (point above). This is a sample paper that meets all the components of this example. It was published at EMNLP in 2011 by Etzioni and Soderland and contains the exact ngram "open IE" and "information extraction".

In addition to the author/place/year/text sections above, we also examined citation order (highest to lowest) and recency order (most recent to most recent). To get a "pass" for a particular query, the top result of the reranker model must match all components (as in the example above) and follow citation order or recency ordering. Otherwise, the model will fail. It's possible to do a more fine-grained evaluation here, but an all-or-nothing approach worked.

The process is not fast (2-3 days of work for two people), but in the end, we break down the 250 queries into their component parts, the target number of results for each query, and the number of results used to evaluate any proposed model that satisfies the 250 queries Which part of the code.

Hill climbing on this metric has proven to be more productive for two reasons:

  1. It is more related to user-perceived search engine quality.
  2. Each "failure" comes with an explanation of which component was unsatisfactory. For example, authors do not match and citation/recency ordering is not considered.

Once we formulated this evaluation metric, hyperparameter optimization became sensible and feature engineering was significantly faster. When I started model development this evaluation metric was around 0.7 and the final model scored 93.250 on this particular set of 0 queries. For the selection of 250 queries, I don't feel a difference in metrics, but my hunch is that if we continue model development with a fresh set of 250 queries, the model may improve further.

post hoc correction

Even the best models sometimes make seemingly stupid ranking choices because that's the nature of machine learning models. Many of these errors can be fixed with simple rule-based post-event corrections. The following is a partial list of post hoc corrections to model scores:

  1. Quoted matches are higher than unquoted matches, and more quoted matches are higher than less quoted matches.
  2. Exact year matches will be moved to the top.
  3. For queries with an author's full name (such as  Isabel Cachola ), results for that author will be moved to the top.
  4. Results that match all untuples in the query are moved to the top.

You can see the post-mortem correction in the code here .

Bayesian A/B test results

We conducted A/B testing for several weeks to evaluate the performance of the new reranker. Here are the results when looking at the (average) total number of clicks per query issued:

This tells us that people click on the search results page about 8% more often. But will they click on a higher position result? We can check this by looking at the maximum reciprocal rank for each query click. If there is no click, the maximum value assigned is 0.

The answer is yes - the maximum bottom ranking for clicks increased by about 9%! To understand the changes in click position in more detail, here is a histogram of the highest/maximum click positions used for control and testing:

This histogram excludes non-clicks and shows that most of the improvement occurs in position 2, followed by position 3 and position 1.

Why not do all this in Elasticsearch?

The following two parts are written by  Tyler Murray, Director of Search Engineering at Semantic Scholar   .

Elasticsearch provides a powerful set of tools and integrations that enable even the most advanced users to build a variety of matching, scoring, and reranking clauses. While powerful, these functions tend to be confusing and at worst ridiculous when combined across many fields/clauses/filters. For more complex search use cases, the effort involved in debugging and adjusting boosts, filters, and rescoring can quickly become untenable.

LTR  is often the preferred approach for search teams looking to move from manually tuned weighting and rescoring to a search system trained on real-world user behavior . The following advantages and disadvantages arise when implementing LTR compared to the other method:

advantage:

  • Re-scoring occurs during the query lifecycle of Elasticsearch.
  • Avoid or minimize the network costs associated with "passing" reranked candidates.
  • Keep technology more closely aligned with the main storage engine.
  • Search technology uptime is isolated from the cluster rather than dispersed across services.

shortcoming:

  • The plug-in architecture requires Java binaries to run in the Elasticsearch JVM.
  • When modifying, deploying, and testing, iterations can be very slow due to the need for a full rolling restart of the cluster. This is especially true for larger clusters. (>5TB)
  • Although Java maintains an active and mature ecosystem, most cutting-edge machine learning and AI technologies currently exist in the Python world.
  • In a space where gathering judgments is difficult and/or simply impossible at scale, LTR methods become difficult to train effectively.
  • There is limited flexibility in testing ranking algorithms side by side in A/B testing without running multiple ranking plugins in the same cluster.

As we look to the future of how ranking changes will be tested and deployed, the disadvantages of the Elasticsearch plug-in approach greatly outweigh the advantages on two main axes; first, iteration and testing speed, as this is critical for us to initiate a user-centered approach to improvement. Offline measurements are critical to testing the soundness of various models as we iterate, but the final measurement will always be how the model performs in the wild. With the plug-in architecture provided by Elasticsearch, iteration and testing become quite tedious and time-consuming. Second, the robust toolchain enabled through the Python ecosystem outweighs any short-term delayed regressions. The flexibility of integrating various language models and existing machine learning techniques has proven fruitful in solving a wide range of relevance problems. Converting these solutions back to the Java ecosystem will be a huge undertaking. All in all, Elasticsearch provides a solid foundation for building powerful search experiences, but as we need to deal with more complex correlation problems and iterate faster, we increasingly need to look beyond the Elasticsearch ecosystem.

Tuning candidate queries in Elasticsearch

Getting the right combination of filtering, scoring, and rescoring clauses turned out to be much more difficult than expected. This is partly due to working from an existing baseline used to support plugin-based ranking models, but also due to some issues with index mapping. A few notes to help guide others through their journey:

Don’t:

  • Allow the documents you’re searching against to become bloated with anything but the necessary fields and analyzers for search ranking. If you’re using the same index to search and hydrate records for display you may want to consider whether multiple indices/clusters are necessary. Smaller documents = faster searches, and becomes increasingly more important as your data footprint grows.
  • Use many multi_match queries as these are slow and prove to generate scores for documents that are difficult to reason about.
  • Perform function_score type queries on very large result sets without fairly aggressive filters or considering whether this function can be performed in a rescore clause.
  • Use script_score clauses, they’re slow and can easily introduce memory leaks in JVM. Just don’t do it.
  • Ignore the handling of stopwords in your indices/fields. They make a huge difference in scoring, especially so with natural language queries where a high number of terms and stopword usage is common. Always consider the common terms (<= v7.3) query type mentioned below or a stopword filter in your mapping.
  • Use field_name.* syntax in filters or matching clauses as this incurs some non-trivial overhead and is almost never what you want. Be explicit about which fields/analyzers you are matching against.

Do:

  • Consider using common terms queries with a cutoff frequency if you don’t want to filter stopwords from your search fields. This was what pushed us over the edge in getting a candidate selection query that performed well enough to launch.
  • Consider using copy_to during indexing to build a single concatenated field in places where you want to boost documents that match multiple terms in multiple fields. We recommend this approach anywhere you are considering a multi_match query.
  • Use query_string type queries if your use case allows for it. IMO these are the most powerful queries in the ES toolbox and allow for a huge amount of flexibility and tuning.
  • Consider using a rescore clause as it improves performance of potentially costly operations and allows the use of weighting matches with constant scores. This proved helpful in generating scores that we could reason about.
  • Field_value_factor scoring in either your primary search clause or in a rescore clause can prove incredibly useful. We consider highly cited documents to be of a higher relevance and thus use this tool to boost those documents accordingly.
  • Read the documentation on minimum_should_match carefully, and then read it a few more times. The behavior is circumstantial and acts differently depending on the context of the use.

Conclusion and acknowledgments

A new search is live on  semanticscholar.org  and we think it's a big improvement! Give it a try and give us some feedback by emailing  [email protected] .

The code is also available for you to scrutinize and use. Feedback welcome.

The entire process took about 5 months and would not have been possible without the help of a large portion of the Semantic Scholar team. I especially want to thank Doug Downey and Daniel King for working tirelessly with me to brainstorm ideas, look at countless prototype model results, and tell me how they worked in new and interesting ways. Broken. I would also like to thank  Madeleine van Zuylen  for all her wonderful annotation work on this project, and  Hamed Zamani  for helpful discussions. Thanks also to the engineers who took my code and magically made it work in production.

Appendix: Details about features

  • *_fraction_of_query_matched_in_text — What part of the query was matched in this specific field?
  • log_prob refers to the probability of actual matching by the language model. For example, if a query is deep learning for sentiment analysis , and phrase sentiment analysis is a match, we can compute its log probability in a fast, low-overhead language model to understand the surprise level. The intuition is that we not only want to know how many queries were matched in a particular field, but also whether the matched text was interesting. The lower the probability of a match, the more interesting it should be. For example, "viral load advantage" is 4 grams more surprising than "they go to the store." *_mean_of_log_probs is the mean log probability of a match in the field. We use  KenLM  as our language model instead of something like BERT - it's lightning fast, meaning we can call it dozens of times for each feature and still be able to characterize it quickly - enough to run in production Python code. (Many thanks to  Doug Downey  for recommending this function type and KenLM.
  • *_sum_of_log_probs*match_lens — Taking the average log probability does not provide any information about whether the match occurred multiple times. Summing favors papers where the query text matches multiple times. This is mostly related to abstracts.
  • sum_matched_authors_len_divided_by_query_len — This is similar to the matching of title, abstract, and location, but the matching of each paper author is done one at a time. There are a few extra tricks to this feature, we're more concerned with last name matching than first and middle name matching, but not absolutely. You may come across some search results where papers with matching middle names are ranked higher than papers with matching last names. This is a feature improvement TODO.
  • max_matched_authors_len_divided_by_query_len — The sum gives you an idea of ​​how many matches the author field was matching overall, and the max tells you what the maximum individual author match is. Intuitively, if you search for Sergey Feldman, one paper might be (Sergey Patel, Roberta Feldman) and another ( Sergey Feldman Deman , Maya Gupta), the second paper is much better. The max feature allows the model to learn this.
  • author_match_distance_from_ends — Some papers have 300 authors, and you're more likely to get an author match purely by chance. Here we tell the model author where to match . If you matched the first or last author, this feature would be 0 (and the model would learn that the smaller number is important). If the matching author scores 150 out of 300, the feature is 150 (large values ​​are considered bad). Early versions of this feature were just len(paper_authors), but the model learned to penalize papers with many authors too harshly.
  • fraction_of_*quoted_query_matched_across_all_fields — Even though we have fractional matches for each paper field, it would be helpful to know how many queries were matched when all fields were combined so that the model doesn't have to try to learn how to add.
  • sum_log_prob_of_unquoted_unmatched_unigrams — Log probability of unmatched unigrams in this article. Here, the model can figure out how to penalize incomplete matches. For example, if you search for deep learning to identify earthworms, the model may only find papers without the word depth or without the word earthworm . It may degrade matches by excluding very surprising terms such as earthworm , assuming that citations and recency are comparable.

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/134171450