Machine learning best practices, Google machine learning catch-43:

This article is a translation and interpretation of the article Rules of Machine Learning: Best Practices for ML Engineering. Students who have read my translated articles know that my translated articles are generally not very honest and not so "loyal to the original". This article is no exception. There are roughly three forms of interpretation of the original text in this article:

  • Original translation. The author himself explains it well, and I have nothing to add. I can basically translate the original text.
  • Half translation and half interpretation. For some items, I think I have some experiences and thoughts that I can share with you, so I will add some of my own interpretations to them.
  • is omitted. There are also times when I feel that the author is too careful (luo) and detailed (suo), or the item is relatively basic and does not require much explanation, so I will omit the original text to varying degrees.

This form may cause some loss of original text information for some students, so students who want to read the original text can find the original text here:http:/ /martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf. Or search for other people's translations that are more faithful to the original work.

about the author

What kind of NB person dares to write something with a title called "Rules of Machine Learning" without fear of being too shy? First, let’s briefly introduce Martin Zinkevich, the author of this article.

Martin Zinkevich is now a senior scientist at Google Brain. He is responsible for and participates in machine learning projects in products such as YouTube, Google Play and Google Plus. This article is also based on the author’s various experiences and lessons learned in machine learning projects on these three products. Become. Before joining Google, he was a senior scientist at Yahoo. He twice won Yahoo's highest honor, the Yahoo Team Superstar Awards, in 2010 and 2011, and made many outstanding contributions to Yahoo's advertising system.

With such a NB background, we have reason to believe that what this guy writes has sufficient reference value.

Summary introduction

This article divides the process of applying machine learning in products into three major stages from shallow to deep, and subdivides some aspects in these three major stages to logically classify the 43 rules. To put it simply, if you are building a machine learning system from scratch, you can refer to the corresponding items here at different stages to ensure that you are on the right path.

Text begins

To make great products:
do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.

To a certain extent, this sentence is a high-level summary of the entire article (it may be more appropriate to call it a manual). The actual work of ML is indeed more of an engineering problem than an algorithm problem. Prioritize the results from engineering efficiency, and then consider algorithm upgrades after this part is squeezed out.

Before Machine Learning

Rule #1: Don’t be afraid to launch a product without machine learning.

Rule 1: Don’t be afraid to launch a product without machine learning.

The central idea can be summarized in one sentence: If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.

Rule #2: First, design and implement metrics.

Rule 2: Design and implement metrics before you start.

Before building a specific machine learning system, first record as much detailed historical information as possible in the current system and retain characteristic data. This will not only retain feature data, but also help us understand the status of the system at any time and the changes in the system when making various changes.

Rule #3: Choose machine learning over a complex heuristic.

Rule 3: Don’t use overly complex rule systems, use machine learning systems.

To put it simply, complex rule systems are difficult to maintain and cannot be expanded. However, we can easily convert it to an ML system and make it maintainable and expandable.

ML Phase I: Your First Pipeline

When building your first ML system, you must pay more attention to the construction of the system architecture. While machine learning algorithms are exciting, it can be frustrating when the underlying infrastructure isn’t strong enough to find the problem.

Rule #4: Keep the first model simple and get the infrastructure right.

Rule 4: The first model should be simple, but the architecture should be correct.

The core idea of ​​the first version of the model is to capture the main features, fit the application as closely as possible, and get it online quickly.

Rule #5: Test the infrastructure independently from the machine learning.

Rule 5: Test architectural flows independently of machine learning.

Ensure that the architecture is individually testable and encapsulate the training part of the system to ensure that other parts are testable. In particular:

  1. Test whether the data enters the training algorithm correctly. Check whether specific characteristic values ​​are as expected.
  2. Test whether the prediction results given by the experimental environment are consistent with the online prediction results.
Rule #6: Be careful about dropped data when copying pipelines.

Rule 6: Be careful about discarded data when copying a pipeline.

When copying data from one scene to another, pay attention to whether the data requirements of both sides are consistent and whether there is any data loss.

Rule #7: Turn heuristics into features, or handle them externally.

Rule 7: Convert heuristic rules into features, or process them externally.

The problems solved by machine learning systems are usually not new problems, but further optimization of existing problems. This means that there are many existing rules or heuristics that can be used. This part of the information should be fully utilized (such as the sorting rules used in rule-based recommendation sorting). Here are several ways heuristics can be used:

  1. is preprocessed using heuristic rules. If the heuristic rule is very useful, you can use it this way. For example, in spam identification, if a sender has been blocked, then there is no need to learn what "blocking" means, just block the sender.
  2. Manufacturing characteristics. Consider creating a feature directly from heuristic rules. For example, if you use a heuristic rule to calculate the correlation of a query, you can use this correlation score as a feature. Later, you can also consider using the original data for calculating the correlation score as a feature in order to obtain more information.
  3. Mining the original input of the heuristic rules. If there is an app’s rule heuristic rule that integrates information such as the number of downloads and title text length, you can consider using these original information as separate features.
  4. Modify label. This can be used when you feel that the heuristic rules contain information that is not contained in the sample. For example, if you want to maximize the number of downloads, you also want to pursue the quality of the downloaded content. One possible method is to multiply the label by the average number of stars of the app. In the field of e-commerce, similar methods are often used. For example, in click-through rate estimation projects, you can consider adding weight to the samples corresponding to the final ordered products or high-quality products.

Existing heuristic rules can help the machine learning system make a smoother transition, but you should also consider whether there is a simpler implementation with the same effect.

Monitoring

In summary, maintain good monitoring habits, such as making alarms responsive and building a Dashboard page.

Rule #8: Know the freshness requirements of your system.

Rule 8: Know your system’s freshness requirements.

If a model update is delayed by one day, how much impact will your system have? What if it's a week's delay? Or longer? This information allows us to prioritize monitoring. If your model's revenue drops by 10% if it's not updated for a day, consider having an engineer monitor it 24/7. Understanding the system's freshness requirements is the first step in deciding on a specific monitoring solution.

Rule #9: Detect problems before exporting models.

Rule 9: Detect problems before the model goes live.

Before the model goes online, completeness and correctness checks must be done, such as calculation confirmation of AUC, Calibration, NE and other indicators. If there is a problem before the model goes online, you can be notified by email. If there is a problem with the model that the user is using, you need to be notified by phone.

Rule #10: Watch for silent failures.

Rule 10: Pay attention to silent failure.

This is a very important issue that is often overlooked. The so-called silent failure refers to the problem that all processes are completed normally, but there is a problem with the data relied on behind it, causing the model effect to gradually decline. This problem does not often occur in other systems, but it is more likely to occur in machine learning systems. For example, a data table that training relies on has not been updated for a long time, or the meaning of the data in the table has changed, or the coverage of the data suddenly decreases, which will have a great impact on the effect. The solution is to monitor the statistical information of key data and perform manual inspection of key data periodically.

Rule #11: Give feature column owners and documentation.

Rule 11: Assign a person responsible to the feature group and document it.

The feature column here refers to a feature group. For example, the set of features of the country that the user may belong to is a feature column.

If the system is large and has a lot of data, it becomes very important to know who generated each set of data. Although the data is briefly described, the specific calculation logic of the features, data sources, etc. need to be recorded in more detail.

Your Fist Objective

Objective is the value that the model attempts to optimize, while metric refers to any value used to measure the system.

Rule #12: Don’t overthink which objective you choose to directly optimize.

Rule 12: Don’t get too hung up on which goal to optimize for.

In the early days of machine learning, even if you only optimize one goal, many indicators will generally rise together. So don’t worry too much about which one should be optimized.

Although the boss said this, in my own practical experience, if you only optimize one goal, the overall effect of the system may not increase. For example, the CTR model of a typical recommendation system will indeed increase after it goes online, but the corresponding CVR is likely to decrease. At this time, a CVR model is needed. Only by using both models at the same time can the system effect be truly improved. The reason is that each goal only focuses on one sub-process of the entire system process. Greedily optimizing this sub-process may not necessarily lead to the global optimal solution. Usually, it is necessary to optimize several main sub-processes before the solution can be achieved. Improve the overall effect.

Rule #13: Choose a simple, observable and attributable metric for your first objective.

Rule 13: Choose a simple observable and attributable metric for your first objective.

The objective should be simple and measurable, and be an effective proxy for the metric. Behaviors best suited to be modeled are behaviors that can be directly observed and attributed, such as:

  1. Was the link clicked?
  2. Has the software been downloaded?
  3. Is the message forwarded?
    ……

Try not to model behaviors that have no direct effect the first time, such as:

  1. Will the user visit the next day?
  2. How long do users stay on the site?
  3. How many daily active users are there?

Indirect indicators are good metrics and can be observed with ABTest, but they are not suitable for use as optimization indicators. Additionally, never attempt to learn the following goals:

  1. Are users satisfied with the product?
  2. Are users satisfied with the experience?
    ……

These metrics are very important, but very difficult to learn. Some proxy indicators should be used to learn, and these indirect indicators should be optimized by optimizing the proxy indicators. For the sake of the company's development, it is best to have humans connect the learning goals of machine learning with the product business.

Rule #14: Starting with an interpretable model makes debugging easier.

Rule 14: Using a highly interpretable model can reduce the difficulty of debugging.

Give priority to models whose prediction results have probabilistic meaning and whose prediction process can be explained, which can make it easier to confirm the effect and debug problems. For example, if LR is used for classification, then the prediction process is nothing more than some multiplication and addition. If the features are discretized, there is only addition. In this way, it is easy to debug how the prediction score of a sample is calculated. So it's easy to debug if something goes wrong.

Rule #15: Separate Spam Filtering and Quality Ranking in a Policy Layer.

Rule 15: Separate the work of garbage filtering and quality sorting and place it in the policy layer.

The data distribution in the environment where the sorting system works is relatively static. In order to get better sorting, everyone will abide by the rules set by the system. However, garbage filtering is more of an adversarial task, and the data distribution changes frequently. So instead of letting the sorting system handle spam filtering, there should be a separate layer to handle spam. This is also an idea that can be promoted, that is: the sorting layer only does the things of the sorting layer, and its responsibilities are as single as possible, and other tasks are handled by more suitable modules in the architecture. In addition, in order to improve the model effect, junk information should be removed from the training data.

ML Phase II: Feature Engineering

The focus of the first stage is to feed data into the learning system, to have basic monitoring indicators and a basic architecture. After this system is established, the second phase begins.

Overall, the core work of the second phase is to add as many effective features as possible to the first version of the system, which can generally lead to improvements.

Rule #16: Plan to launch and iterate.

Rule 16: Be prepared for continuous iteration and launch.

To put it simply, it is necessary to deeply realize that system optimization will never have an end point, so the system design must be very iterative-friendly. For example, whether adding and deleting features is simple enough, whether correctness verification is simple enough, whether model iterations can be run in parallel, etc.

Although this is not a specific actionable rule, this kind of mental preparation is very helpful for the development of the entire system. Only when we are truly aware of the nature of continuous iteration of the system, can we design the best and corresponding tools for continuous iteration when designing the online and offline architecture, instead of building a one-off system.

Rule #17: Start with directly observed and reported features as opposed to learned features.

Rule 17: Prioritize the use of directly observed or collected features rather than learned features.

The so-called learned features refer to features learned using another algorithm, rather than simple features that can be directly observed or collected. The learned features may not be suitable for your current model due to external dependencies or complex calculation logic, so there are risks to stability and effectiveness. Directly observable characteristics are relatively stable because they are relatively objective and less dependent.

Rule #18: Explore with features of content that generalize across contexts.

Rule 18: Explore the use of content features that can span scenarios.

The central idea is to make more use of features that can be used in multiple scenarios, such as global click-through rate and page views, which can be used as features in multiple scenarios. This can be used as a feature in some cold start or scenarios where effective features are lacking.

Rule #19: Use very specific features when you can.

Rule 19: Try to use very specific features.

If the amount of data is large enough, using a large number of simple features is a simpler and more effective choice than a small number of complex features.

The so-called very specific refers to features that cover a relatively small sample size, such as the ID of the document or the ID of the query. Although each of these features only covers a small part of the features, as long as the overall coverage of this set of features is relatively high, such as 90%, it is OK. Regularization can also be used to eliminate features with low coverage or poor correlation. This is also one of the reasons why everyone prefers large-scale ID features. Now many major manufacturers use large-scale ID features in their ranking model features.

Rule #20: Combine and modify existing features to create new features in human­-understandable ways.

Rule 20: Combine and modify existing features in a human-understandable way to obtain new features.

Discretization and crossover are the two most commonly used ways of using features. Its essence is to use feature engineering to increase the nonlinearity of the model without changing the model itself. There is nothing much to say about these two methods. What is worthy of agreement is that when large-scale ID-type features intersect, for example, one end is the keywords in the query, and the other end is the keywords in the document, it will produce a large amount of confusion. There are two ways to deal with cross features:

  1. dot product. In fact, the number of keywords contained in the query and the document is calculated.
  2. intersection. The meaning of each dimensional feature is that a certain word appears in the query and the document at the same time. If it appears at the same time, the dimensional feature is 1, otherwise it is 0.

The so-called "human understandable way", my understanding is that discretization and intersection must be based on the understanding of business logic, and cannot be crossed randomly.

Rule #21: The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have.

Rule 21: The number of feature weights that can be learned in a linear model is roughly proportional to the amount of training data.

There are complex statistical principles behind this, but you just need to know the conclusion. The inspiration this principle gives us is to choose the feature generation method based on the amount of data, for example:

  1. If your system is a search system, there are millions of words in the query and documents, but you only have thousands of labeled samples. Then don’t use ID-level keyword features. Instead, consider dot product features and control the number of features to a level of dozens.
  2. If you have millions of samples, you can cross-feature the document and query keywords, and then use regularization for feature selection. This way you will get millions of features, but after regularization there will be fewer. Therefore, there are tens of millions of samples and hundreds of thousands of features.
  3. If you have a billion-level samples or higher, then you can use query and document ID-level features, and then add feature selection and regularization. Billions of samples, tens of millions of features.

To sum up, determine how to use features based on samples. If the samples are not enough, perform high-level abstract processing on the features, and the guidance will match the sample magnitude.

Rule #22: Clean up features you are no longer using.

Rule 22: Clean up features that are no longer used.

If a feature is no longer useful and its intersection with other features is no longer useful, it should be cleaned up to keep the architecture clean.

When considering which features to add or retain, you need to count the sample coverage of the features. For example, some personalized feature columns with very low overall coverage can be covered by only a few users, so there is a high probability that this set of features will not be of much use. But on the other hand, if the coverage of a feature is very low, for example, only 1%, but its discrimination is very large, for example, 90% of the samples with a value of 1 are positive samples, then this feature is worth adding or retaining.

Human Analysis of the System

Before we go any further, we need to understand something you won’t be taught in a machine learning course: how to observe an analytical model and improve it. In the author's words, this is more like an art, but there are still some rules to follow.

Rule #23: You are not a typical end user.

Rule 23: You are not a typical end user.

The central idea of ​​this rule is that while it is necessary to eat your own dog food, don't always measure the quality of a model from an engineer's perspective. Not only may this not be worth it, but it may not see the problem. The so-called unworthiness is because engineers' time is too expensive, and everyone knows this; and the so-called inability to see problems is because engineers can easily fail to see problems when looking at the models they develop. The so-called "ignorance of the true face of Lushan".

So the author believes that a reasonable approach is to let real end users measure the quality of a model or product. Either through the online ABTest or through crowdsourcing.

Rule #24: Measure the delta between models.

Rule 24: Measure differences between models offline.

The original article didn't say offline, but I understood from the context that he meant offline. This rule says that before a new model is put online, it needs to be compared with the old model. The so-called difference comparison refers to whether the results given by the old and new models are sufficiently different for the same input. For example, for the same query, whether the difference between the two ranking models is large enough. If the offline calculation finds that the difference is very small, there is no need to go online for testing, because the difference will definitely not be big after going online. If the difference is relatively large, then you also need to see whether the difference is a good difference. By observing the differences, you can learn what impact the new model had on the data, whether it was good or bad.

Of course, the premise of all this is that you need to have a stable comparison system. At least the difference between a model and itself should be very small, preferably zero.

Rule #25: When choosing models, utilitarian performance trumps predictive power.

Rule 25: When choosing a model, utility metrics are more important than predictive power.

This is a very useful lesson. Although when we train the model, the objective is generally logloss, which means we are actually pursuing the predictive ability of the model. However, we may have multiple uses in upper-layer applications. For example, it may be used for sorting. In this case, the specific prediction ability is not as important as the sorting ability; if it is used to delineate thresholds and then judge spam based on the thresholds, then the accuracy It's more important. Of course, in most cases these indicators are consistent.

In addition to what the author said, there is another situation that requires special attention, that is, we may sample samples during training, resulting in the overall predicted value being higher or lower. If this predicted value is used for direct sorting, then this change does not matter much, but if it is used for other purposes, such as multiplying with other values, typically such as CTR*bid in advertising scenarios, or in e-commerce recommendation sorting CTR*CVR, in this type of scenario, the accuracy of the prediction value itself is also important, and it needs to be calibrated so that it is aligned with the sample click rate before sampling.

Rule #26: Look for patterns in the measured errors, and create new features.

Rule 26: Find patterns in errors and create new features.

This is a general idea used to improve model performance. Specifically, it refers to observing the samples in the training data that the model predicts incorrectly, and seeing whether the sample can be correctly predicted by the model by adding additional features. The reason why the data in the training set is used is because this part of the data is what the model has tried to optimize. The error here is that the model knows that it has made a mistake and cannot learn it at the moment, so if you give it other features that are good enough , it may be able to learn this sample correctly.

Once you find faulty patterns, you can look for new characteristics outside of your current system. For example, if you find that the current system tends to mistakenly rank long articles later, you can add the article length feature and let the system learn the relevance and importance of article length.

Rule #27: Try to quantify observed undesirable behavior.

Rule 27: Try to quantify observed negative behavior.

If a problem is observed in the system that the model has not been optimized, a typical problem is that the recommendation system is not strong enough, then efforts should be made to convert this dissatisfaction into specific numbers. Specifically, it can be marked through manual labeling and other methods. Remove unsatisfactory items and make statistics. If the problem can be quantified, it can be used as a feature, objective or metric later. The overall principle is "quantify first, then optimize".

One more thing, the optimization here does not necessarily mean optimizing with models, but refers to overall optimization. For example, it may be difficult to optimize the model for problems such as recommendation system overmatch, but as long as the energy can be quantified, it can be minimized through other methods, such as learning the characteristics of overqualified items separately, or making a certain tilt in the recall stage.

Rule #28: Be aware that identical short-­term behavior does not imply identical long­-term behavior.

Rule 28: Be aware that similar behavior observed in the short term may not necessarily persist in the long term.

Suppose you build a system that calculates the click-through rate of each document under each query by learning each specific document ID and query ID. Through offline comparison with ABTest, you find that the behavior of this system is exactly the same as the current system, and this system is simpler, so you put this system online. You will find later that this system will not give any new documents under any query. Weird? It's not surprising at all, because you only let it remember the previous historical data, and it has no information about new data.

So the only way to measure whether a system works over the long term is to have it trained on real data collected while the model was online. Of course this is difficult.

Speaking more broadly, the author's rule actually refers to the generalization ability of a system or model. If a system or model cannot make good predictions for new data, then no matter how its offline performance is, it cannot represent it. real ability. Looking at it from another perspective, a system must have the ability to continuously learn and adapt to new data in order to be a qualified machine learning system, otherwise it will just be a parrot.

Training-Serving Skew

The difference between training and serving is a big topic. The main reasons include the different data acquisition methods during training and serving, the different data distribution between training and serving, and the feedback loop between the model and algorithm. The author said that this difference has appeared in many product lines of the G family, and has had a negative impact. But let me say that this is definitely not just a problem for Google. Google is definitely doing better. There are many more such problems in various small and medium-sized factories, so this part of the experience is very valuable. The core of solving such problems is to monitor changes in the system and data to ensure that all differences are within the scope of monitoring and will not sneak into the system.

Rule #29: The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.

Rule 29: The best way to ensure consistency between service and training is to save the features during service, and then feed the features into the training process through logs.

This sentence basically expresses the core routine to ensure that differences are minimized. This method based on feature logs can greatly improve the effect while reducing code complexity. Many teams at Google are also migrating to this approach.

Rule #30: Importance weight sampled data, don’t arbitrarily drop it!

Rule 30: Give importance to the sampling samples and do not discard them randomly!

This is the only one where the author uses an exclamation point. You can imagine the bitterness behind it. When we have too much training data, we only take a portion of it. But this is wrong. The correct approach is that if you give a sample a 30% sampling weight, then give it a training weight of 10/3 during training. Through such importance weight, the calibration of the entire training results can be guaranteed.

One more thing, this calibration is very important, especially for advertising systems, or systems where multiple prediction values ​​are added or multiplied to get the final result. If the individual values ​​are not calibrated, are too low or too high, then their meaning will be incorrect after multiplication or addition. Calibration is less important if the model predictions are used directly for ranking, because calibration does not affect the ranking, only the specific values.

Rule #31: Beware that if you join data from a table at training and serving time, the data in the table may change.

Rule 31: If you are joining a table during training and service, be aware that the data in this table may change.

For example, there are some document features stored in a certain table. Before offline training, you need to go to this table to retrieve these features for training. But there is a risk here, that is, the data in this table will not be retrieved offline when you retrieve it. The data is different from the online service and has changed. The best solution is to record the characteristics in the log on the server side, so as to ensure the consistency of the data. Or if the change frequency of this table is relatively low, you can also consider making hourly or daily backups to reduce this difference. But remember that this method does not completely solve the problem.

Rule #32: Re­use code between your training pipeline and your serving pipeline whenever possible.

Rule 32: Try to reuse code between the training pipeline and the serving pipeline.

Training is generally done offline in batches, while services are done online streaming. Although there are big differences in the way data is processed between the two, there is still a lot of code that can be shared. The sharing of these codes can introduce the differences between training and serving at the code level. In other words, the logging feature is to eliminate differences from the data perspective, and code reuse is to eliminate differences from the code perspective. A two-pronged approach will achieve better results.

Rule #33: If you produce a model based on the data until January 5th, test the model on the data from January 6th and after.

Rule 33: If the training data is before January 5th, then the test data should start from January 6th.

The main purpose of this rule is to make the test results closer to the online results, because when we use the model, we are training with the data before the service day, and then predicting the data for that day. The test results obtained in this way may be lower, but they are more realistic.

Rule #34: In binary classification for filtering (such as spam detection or determining interesting e­mails), make small short-­term sacrifices in performance for very clean data.

Rule 34: In binary classification problems that serve filtering (such as spam filtering), some short-term performance can be sacrificed for clean data.

In filtering tasks, samples marked as negative will not be displayed to the user. For example, 75% of the samples marked as negative may be blocked from being displayed to the user. But if you only get the samples for the next training from the results displayed to the user, obviously your training samples are biased.

A better approach is to use a certain percentage of traffic (for example, 1%) to specifically collect training data, and users in this part of the traffic will see all samples. This will obviously affect the actual online filtering effect, but it will collect better data and be more conducive to the long-term development of the system. Otherwise, the system will become more biased as it is trained and gradually become unusable. At the same time, it can ensure that at least 74% of negative samples are filtered out, and the impact on the system is not great.

But if your system filters out 95% or more of the negative samples, this approach is less feasible. Even so, in order to accurately measure the effectiveness of the model, you can still test it by constructing a smaller data set (0.1% or less). One hundred thousand level samples are enough to give accurate evaluation indicators.

Rule #35: Beware of the inherent skew in ranking problems.

Rule 35: Be aware of the data bias inherent in sorting problems.

When a new sorting algorithm significantly changes the online sorting results, you are actually changing the data that the algorithm will see in the future. This bias will appear at this time. There are several ways to solve this problem. The core ideas are to focus more on the data that the model has seen.

  1. Give stronger regularization to features that cover more queries (or similar roles, depending on the business). In this way, the model will focus more on features that only cover a part of the samples, rather than general features. This will prevent hot items from appearing in the results of irrelevant queries.
  2. Only features are allowed to have positive weight values. This way any good feature will be better than "unknown" features.
  3. Don't use features that are only relevant to the document. This is an extreme case of the first item, otherwise it will lead to a situation similar to the Harry Potter effect, that is, a document that is popular under any query will not appear everywhere. Removing features that are only relevant to the document will prevent this from happening.
Rule #36: Avoid feedback loops with positional features.

Rule 36: Use positional features to avoid feedback loops.

Everyone knows that the sorting position itself will affect whether the user will interact with the item, such as clicking. Therefore, if there are no position features in the model, the original impact due to position will be calculated on other features, causing the model to be inaccurate. This problem can be avoided by adding location features. Specifically, location features are added during training, removed during prediction, or all samples are given the same location features. This will allow the model to weight features more correctly.

It should be noted that location features should remain relatively independent and should not be associated with other features. Position-related features can be expressed by a function, and other features can be expressed by another function and then combined. In specific applications, this purpose can be achieved by ensuring that the location features do not intersect with any other features.

Rule #37: Measure Training/Serving Skew.

Rule 37: Measure the difference between training and service.

Overall, there are many reasons for this difference, which we can subdivide into the following parts:

  1. Difference between training set and test set. This difference will often exist and is not necessarily a bad thing.
  2. Difference between test set and "day two" data. This difference will always exist, and performance on this "second day" data is what we should strive to optimize, such as through regularization. If the difference between the two is too large, it may be because some time-sensitive features are used, causing the model effect to change significantly.
  3. The difference between "next day" data and online data. If the results given during training for the same sample are inconsistent with the results given during online service, then this means that there is a bug in the engineering implementation.

ML Phase III: Slowed Growth, Optimization Refinement, and Complex Models

There are usually some clear signals that mark the end of the second phase. First, the monthly boost will gradually decrease. You start to make trade-offs between different indicators, some going up and some going down. Well, the game gets interesting. Now that the gains are no longer easy to come by, machine learning has to become more complex.

In the first two stages, most teams can have a good time, but at this stage, each team needs to find their own path.

Rule #38: Don’t waste time on new features if unaligned objectives have become the issue.

Rule 38: Don’t waste time on new features if the objective is not agreed upon.

When the system as a whole reaches a stable period, everyone will start to pay attention to issues other than the optimization goals of the machine learning system. At this time, the goal is not as clear as before, so if the goal has not been determined, don't waste time on features first.

Rule #39: Launch decisions are a proxy for long­term product goals.

Rule 39: Go-live decisions are a proxy for long-term product goals.

This sentence is a bit awkward to read. The author gave a few examples to illustrate. I think the core is talking about one thing: the long-term development of systems, products and even companies needs to be comprehensively measured through multiple indicators, and whether a new model goes online depends on Take these indicators into consideration. The so-called agency means that optimizing these comprehensive indicators is optimizing the long-term goals of the product and the company.

Decision making becomes easy only when all indicators are getting better. But often things are not that simple, especially when different indicators cannot be converted. For example, system A has one million daily actives and four million daily income, and system B has two million daily actives and two million daily income. You will Switch from A to B? Or the other way around? The answer is probably not, because you don't know whether the increase in one indicator will cover the decrease in another indicator.

The point is, no single metric can answer: “Where will my product be in five years?”

And each individual, especially engineers, obviously prefers goals that can be directly optimized, and this is also a common scenario for machine learning systems. There are also some multi-objective learning systems trying to solve this problem. But there are still many goals that cannot be modeled as machine learning problems, such as why users come to visit your website and so on. The author says that this is an AI-complete problem, also often called a strong AI problem. Simply put, it is a problem that cannot be solved by a single algorithm.

Rule #40: Keep ensembles simple.

Rule 40: Keep your ensemble strategy simple.

What is a simple ensemble? The author believes that an ensemble that only accepts the output of other models as input and does not come with other features is called a simple ensemble. In other words, your model is either a pure ensemble model or an ordinary base model that receives a large number of features.

In addition to keeping it simple, ensemble models should also have some good properties. For example, the performance improvement of a certain base model cannot reduce the performance of the combined model. Also, it is best for the base models to be interpretable (for example, calibrated), so that changes in the base model are also explainable to the upper combined model. At the same time, an increase in the predicted probability value of a base model will not reduce the predicted probability value of the combined model.

Rule #41: When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals.

Rule 41: When effects plateau, look for essentially new sources of information rather than optimizing existing signals.

You add some demographic characteristics of users, you add some textual characteristics of documents, etc., but the improvement in key indicators is less than 1%. What to do now?

At this time, you should consider adding some fundamentally different features, such as the history of documents the user has viewed in the past day or week, or data from another data source. In short, it is necessary to add features of completely different dimensions. You can also try using deep learning, but also adjust your expectations for ROI and evaluate whether the added complexity is worth the benefits.

Rule #42: Don’t expect diversity, personalization, or relevance to be as correlated with popularity as you think they are.

Rule 42: Variety, personalization, or relevance may have a weaker correlation with popularity than you think.

Diversity means diversity of content or sources; personalization means that each user gets something different; relevance means that the results returned by a query are more relevant to this query than other queries. Therefore, the meanings of these three indicators are different from ordinary ones.

But the problem is that the common stuff is hard to beat.

If your metrics are clicks, dwell time, views, shares, etc., you are essentially measuring the popularity of something. Some teams sometimes want to learn a diverse personalization model. To this end, personalization features and diversification features will be added, but in the end it will be found that these features do not receive the expected weight.

That doesn’t mean variety, personalization and relevance aren’t important. As noted earlier, diversity or correlation can be increased through subsequent processing. If you see an increase in long-term goals at this point, you can be sure that variety/relevance is useful. At this time, you can choose to continue using subsequent processing, or directly modify the objective to be optimized based on diversity and relevance.

Rule #43: Your friends tend to be the same across different products. Your interests tend not to be.

Rule 43: Your friends on different products will generally be the same, but your interests will usually be different.

Google often uses the same friend relationship prediction model on different products and achieves very good results. This proves that friend relationships on different products can be transferred. After all, they are the same group of people. But when they try to apply personalized features from one product to another, they often don't get good results. A feasible approach is to use raw data from one data source to predict behavior from another data source, rather than using processed features. Additionally, the user's history of behavior on another data source can be useful.

Summarize

It is not difficult to see from the 43 experiences mentioned above that the master author believes that for most machine learning application scenarios, most of the problems we need to solve are engineering problems, and what is needed to solve these engineering problems is not complex theory. , it is more about careful consideration and exquisite pursuit of details, structure, and process. These are things we ordinary people who are not masters can do. If the masters make a system with a score of 95 or above, then as long as we optimize the engineering structure, process and details enough, we can also make a system with a score of at least 80 system.

Translator's introduction: Zhang Xiangyu ([email protected]), currently the person in charge of Zhuanzhuan's recommendation system, is responsible for Zhuanzhuan's recommendation system. He once served as the recommendation system development manager of Dangdang.com. Over the years, he has been mainly engaged in recommendation system and machine learning related work. He has also done anti-spam and anti-cheating related work, and is keen to explore the application practice of big data technology & machine learning technology in other fields. . He is recruiting recommendation algorithm and recommendation architecture engineers. If you are interested, please send your resume by email.


----------------------------------------------------------------------------------------------------------------------------------

Source of this article: "Rules of Machine Learning: Best Practices for ML Engineering"

By Martin Zinkevich google research scientist.

These are the forty-three rules of Google machine learning practice shared by Martin Zinkevich at the NIPS 2016 Workshop.
Terms

Instance: The thing to be predicted

Label: Prediction task results

Feature: An attribute of the entity used in the prediction task

Feature Column: A collection of related features

Example: A collection of entities (and their characteristics) and tags

Model: A statistical representation of a prediction task. Train a model on examples and then use the model to make predictions

metric: Something you care about. It is possible to optimize directly.

Objective: A metric that your algorithm tries to optimize

Workflow (pipeline): All basic components of a machine learning algorithm. This includes collecting data from the front end, entering the data into a training data file, training one or more models, and exporting the model for production.
Overview

To create great products:

You need to use machine learning as a good engineer, not as a great machine learning expert (which you are not).

In fact, most of the problems you face are technical. Even if you have theoretical knowledge comparable to that of a machine learning expert. To achieve breakthroughs, most cases rely on good features of examples rather than good machine learning algorithms. So the basic approach is as follows:

1. Make sure each connection end of your workflow is reliable

  1. Set reasonable goals

  2. Keep the common sense features added as simple as possible

  3. Make sure your workflow is always reliable

This approach can be quite profitable, satisfy many people over a long period of time, and may even lead to a win-win situation. Only consider using more complex methods if simple techniques don't work. The more complex the method, the slower the final output of the product.

When all the simple techniques are exhausted, it may be time to consider cutting-edge machine learning techniques.

This document mainly consists of four parts:

Part One: Helping you understand if it’s time to build a machine learning system

Part 2: Deploy your first workflow

Part 3: Release and iteration when adding new features to the workflow, and how to evaluate the model and training-serving shew

Part 4: What to do after reaching a plateau.

before machine learning

Rule 1: Don’t be afraid to release a product that doesn’t use machine learning

Machine learning is cool, but it requires data. If you don’t absolutely need machine learning, don’t use it without data.

Rule 2: Prioritize the design and implementation of metrics

Before defining what your machine learning system will do, document your current system "footprint" as much as possible. reason:

1. In the early days, it was relatively easy to obtain permission from system users.

2. If you think something will be important in the future, it’s best to start collecting historical data now

3. If you already have metrics in mind when designing the system, everything will be smoother in the future. In particular you don't want to need to grep in the logs in order to measure your metrics.

4. You can notice what has changed and what has not. For example, say you want to directly optimize for daily active users. However, early in your management of the system, you may notice drastic changes to the user experience that may not significantly change this metric.

The Google Plus team measures "expands per read", shares (reshares per read), likes (plus-ones per read), comments/read ratio, comments per user, Number of shares per user, etc. These are used to measure the quality of a post at service time. Likewise, it is important to have an experimentation framework that can cluster users into groups and experiment to generate statistical results. See Rule 12

Rule 3: Prefer machine learning over heuristics.

Machine learning models are better updated and easier to manage

Machine Learning Phase 1: First Workflow

Take the infrastructure construction of the first workflow seriously. While it's fun to use your imagination to come up with models, you first need to make sure your workflow is reliable so that problems are easy to spot.

Rule 4: The first model should be simple and the infrastructure should be correct.

The first model will improve your product the most, so it doesn’t need to be magical. Instead, you'll run into more infrastructure problems than you think. Before others use your magical new machine learning system, you need to decide:

1. How to obtain samples for learning algorithms

2. What are the definitions of “good” and “bad” for your system?

3. How to integrate your model into your application. You can apply your model online or precompute the model offline and save the results to a table. For example, you may want to pre-categorize web pages and store the results in a table, or you may want to categorize chat messages directly online.

Choose simple features to make it easier to ensure:

1. These features are correctly applied to the learning algorithm

2. The model can learn reasonable weights

3. These features are correctly applied to the server model.

If your system can reliably adhere to these three points, you have accomplished most of the work. Your simple model can provide benchmark metrics and baseline behavior that you can use to measure more complex models.

Rule 5: Test infrastructure in isolation.

Make sure the infrastructure is testable. The learning part of the system is encapsulated independently so that everything surrounding it can be tested.

Rule 6: Be aware of missing data when replicating workflows

We sometimes create a new workflow by copying an existing workflow. The data needed in the new workflow is likely to be discarded in the old data flow. For example, if we only record data about posts that users have seen, then this data is useless if we want to model "why a specific post was not read by users."

Rule 7: Either turn heuristics into features or handle them externally

The problems machine learning attempts to solve are often not entirely new. Many existing rules and heuristics can be leveraged. These same heuristics can be very helpful when you are tuning machine learning.
Monitoring

In general, implement good alert monitoring such as making alerts actionable and having reporting pages.

Rule 8: Understand the freshness requirements of the system

How much performance will be degraded if the system is a day old? What if it was a week ago, or a quarter ago? Knowing this can help you understand monitoring priorities. If the model is not updated for a day, your income will drop by 10%, so it is best to have an engineer to pay continuous attention. Most ad serving systems have new ads to process every day, so they must be updated daily. Some require frequent updates, while others do not, depending on different applications and scenarios. Additionally, freshness can vary over time, especially if features are added or removed from your model.

Rule 9: Before exporting (publishing) your model, be sure to check for various issues

Export the model and deploy it to online services. If there is a problem with your model at this time, this is a problem that users will see. But if the problem occurs before, it is a training problem and the user will not notice it.

Always check the integrity of the model before exporting it. In particular, make sure your model can perform satisfactorily on retained data. If you feel there is a problem with the data, there is no need to export the model! Many teams that continuously deploy models will check the AUC before exporting. If model problems occur before exporting, you will receive a warning email. However, if users encounter model problems, they may need a dismissal letter. Therefore, before affecting users, it is best to wait and export when you are sure.

Rule 10: Watch out for hidden failures

Machine learning systems are more likely to have this problem than other types of systems. For example, a related table will no longer be updated. Although machine learning will still adjust and behave appropriately, it is gradually declining. Sometimes you find tables that have not been updated for months. At this time, a simple update can improve performance better than any other change. For example, due to changes in implementation, the coverage of a feature will change: for example, it starts to cover 90% of the samples, but suddenly only 60% is covered. Google Play conducted an experiment. There was a table that remained unchanged for 6 months. Just updating this table increased the installation rate by 2%. By tracking statistics and manually checking when necessary, you can reduce such errors.

Rule 11: Assign authors and documentation to features

If the system is large and has many features, be sure to know the creator or administrator of each feature. If the person who understands the trait is leaving, make sure someone else understands the trait. Although many feature names provide a basic description of what the feature is, it would be nice to have a more detailed description of the feature, such as where it comes from and how it can help.
Your first goal

For your system, you have many metrics that you care about. But for your machine learning algorithm, typically you need a single objective—the number your algorithm “try” to optimize. The difference between a metric and a goal is: a metric is any number that your system reports. This may or may not be important.

Rule 12: Don’t overthink the goals you choose to directly optimize for

There are thousands of metrics that you care about and are worth testing. But early in the machine learning process, you'll find that they all go up even if you're not optimizing directly. For example, you care about the number of clicks, dwell time, and daily active users. If you only optimize for clicks, you'll typically see an increase in dwell time as well.

Therefore, when improving all indicators is not difficult, there is no need to worry about how to weigh different indicators. But too much is too little: don't confuse your goals with the overall health of your system.

Rule 13: Choose a simple, observable, and attributable metric for your first goal

Sometimes you think you know your true goals, but as you observe the data and analyze the old system and the new machine learning system, you will find that you want to adjust again. Moreover, different team members cannot agree on the real goals. Machine learning goals must be easily measurable and must be representative of “real” goals. Therefore, train on a simple machine learning objective and create a "decision layer" that allows you to add additional logic on top (the simpler the better) to form the final ranking.

The easiest to model are user behaviors that can be directly observed and attributed to an action of the system:

1. Are the sorted links clicked?

2. Have the sorted items been downloaded?

3. Have the sorted items been forwarded/replyed/emailed to subscribe?

4. Have the sorted items been evaluated?

5. Are the items displayed labeled as trash/pornographic/violent?

Avoid modeling indirect effects initially:

1. Will the user visit on the second day?

2. How long is the user’s access time?

3. What do daily active users look like?

Indirect effects are a very important metric that can be used when making A/B testing and release decisions.

Finally, don’t try to ask machine learning to answer questions like:

1. Are users happy using your product?

2. Do users have a satisfactory experience?

3. Whether the product improves the overall happiness of users

4. Have these affected the overall health of the company?

These are important, but so difficult to assess. Instead of this, it is better to consider other alternatives: for example, if the user is happy, the stay time should be longer. If the user is satisfied, he will visit again.

Rule 14: Start with an interpretable model to make debugging easier.

Linear regression, logistic regression and Poisson regression are directly motivated by probabilistic models. Each prediction can be interpreted as a probability or an expected value. This makes them easier to debug than models that use goals to directly optimize classification accuracy and ranking performance. For example, if there is a deviation between the probability during training and the probability during prediction, or the probability observed on the production system, it means there is some kind of problem.

For example in linear, logistic or Poisson regression, there are subsets of data where the average predicted expectation is equal to the average mark (1-moment calibrated or just calibrated). If there is a feature that has a value of either 1 or 0 for each example, then those examples with a value of 1 are calibrated. Likewise, if both are 1, then all samples are calibrated.

Often we use these probabilistic predictions to make decisions: for example, ranking posts by expected value (e.g. probability of click/download, etc.). However, remember that when it comes time to decide which model to use, the decision is not just about the probabilities of the data provided to the model.

Rule 15: Distinguish between spam filtering and quality ranking at the decision-making level

Quality ranking is an art, while spam filtering is a war. Those who use your system know exactly what you use to evaluate the quality of a post, so they will do whatever they can to make their posts have those attributes. Therefore, quality ranking should focus on ranking what is published honestly. If spam is ranked high, the quality ranking learner will be greatly compromised. In the same way, vulgar content should be taken out of the quality ranking and processed separately. Spam filtering is another story. You must take into account that the features you want to generate will change frequently. You will enter a lot of obvious rules into the system. At least make sure your model is updated daily. At the same time, it is important to consider the credibility of the content creator.

Machine learning stage two: feature engineering

Importing training data into the learning system, completing evaluation records of relevant indicators of interest, and building a service architecture are all very important tasks in the first stage of the machine learning system life cycle. When you have a working end-to-end system and have built unit tests and system tests, then you enter phase two.

In Phase 2, there are many gains that can be easily achieved. There are many obvious features that can be added to the system. Therefore, the second stage of machine learning involves importing as many features as possible and combining them in the most intuitive way. At this stage, all indicators should still be rising. Versions will be released regularly. This will be a great time to attract a lot of engineers at this stage to fuse all the data you want to create an amazing learning system

Rule 16: Plan releases and iterations

Don't expect this model released now to be the last. Therefore, consider whether the complexity you add to the current model will slow down subsequent releases. Many teams only release a model once a quarter, or even for many years. Here are three basic reasons why a new model should be released:

1. New features will continue to appear

2. You are adapting regularization and combining old features in new ways, or

3. You are adjusting your goals.

Regardless, it's always good to invest more in a model: looking at examples of data feedback can help find new, old, and bad signals. So when you build your model, think about how easy it is to add, remove, or reorganize features. Think about how easy it is to create a new copy of a workflow and verify its correctness. Consider whether it is possible to have two or three replicas running in parallel. Finally, don’t worry about whether feature 16 of 35 makes it into this version of the pipeline. You'll get all of this next quarter.

Rule 17: Prioritize directly observable and recordable characteristics over learned characteristics.

First, what are learned traits? The so-called learned features refer to features generated by an external system (such as an unsupervised clustering system), or features generated by the learner itself (for example, through decomposition models or deep learning). These features are useful, but involve too many issues and are not recommended for use in first models.

If you use an external system to create a feature, remember that the system itself has its own goals. And its goals may not be related to your current goals. This external system may be outdated. If you update a feature from an external system, it is possible that the meaning of the feature has changed. Be careful when using features provided by external systems.

The main problem with factorization models and deep learning models is that they are non-convex. Therefore, the optimal solution cannot be found, and the local minimum found in each iteration is different. This difference makes it difficult to tell whether an effect on a system is meaningful or just random. A model without esoteric features can lead to very good baseline performance. Only when this benchmark is achieved will more esoteric approaches be considered.

Rule 18: Extract features from different contexts

Typically, machine learning only accounts for a small part of a larger system, so you have to try to look at a user behavior from different angles. For example, in the popular recommendation scenario, posts in the "hot recommendations" in the forum will generally have many comments, shares, and reads. If you use these statistical data to train the model and then optimize a new post, it is possible Make it a popular post. On the other hand, there are many options for automatically playing the next video on YouTube. For example, it can be recommended based on the viewing order of most users, or based on user ratings. In short, if you use a user behavior as a label for the model, then looking at this behavior under different contextual conditions, you may get richer features, which is more conducive to model training. Note that this is different from personalization: personalization is the process of determining whether a user likes something in a specific context and discovering which users like it and how much.

Rule 19: Try to choose more specific features

With the support of massive data, even learning millions of simple features is easier to implement than learning just a few complex features. Since the retrieved text tags and normalized queries do not provide much normalizing information, only the ordering of tags in the header query is adjusted. Therefore, you don’t have to worry about the fact that although the overall data coverage is as high as more than 90%, there is not much training data available for a single feature in each feature group. In addition, you can also try regularization methods to increase the number of examples corresponding to each feature.

Rule 20: Combine and modify existing features in a reasonable way

There are many ways to combine and modify features. Machine learning systems like TensorFlow can preprocess data through ‘transformations’. The two most basic methods are: "discretizations" and "crosses"

Discretization: Split a continuous feature into many independent features. For example, age, 1~18 is used as a feature, 18~35 is used as a feature, etc. Don't think too much about boundaries, usually basic quantiles will do the best.

Crossover: Combine multiple features. In TensorFlow terminology, a feature column is a group of similar features, such as {Male, Female}, {United States, Canada, Mexico}, etc. Cross here refers to merging two or more feature columns. For example, the result of {Male, Female} × {United States, Canada, Mexico} is a cross, which forms a new feature column. Suppose you use the TensorFlow framework to create such a cross, which also contains the feature {male, Canadian}, so this feature will also appear in the male Canadian example. It should be noted that the more feature columns merged in the crossover method, the larger the amount of training data required.

If the feature column generated by the crossover method is particularly large, it may cause overfitting.
For example, suppose you are doing some kind of search and you have a feature column containing keywords in both the query request and the document. So if you choose to use the intersection method to combine these two feature columns, the new feature column obtained will be very large, containing many features inside it. When this happens in a text search scenario, there are two possible ways to deal with it. The most commonly used is dot product (dot product). The most common processing method of dot product is to count all the common feature words in the query request and the document, and then discretize the features. Another method is intersection. For example, if and only if the keyword appears in both the document and the query result, we can obtain the required features.

Rule 21: The number of feature weights learned through a linear model is roughly proportional to the amount of data

Many people believe that reliable training results cannot be obtained from a thousand examples, or that because a specific model is selected, a million examples must be obtained, otherwise model training cannot be carried out. What needs to be pointed out here is that the size of the data is positively related to the number of features that need to be trained:

1) If you are dealing with a search ranking problem, where documents and queries contain millions of different keywords, and there are a thousand labeled examples, then you should use the dot multiplication mentioned above these characteristics. In this way, a thousand samples can be obtained, corresponding to a dozen features.

2) If you have a million examples, you can cross-process the feature columns in the document and query request through regularization and feature selection. This may produce millions of feature numbers, but again use regularization Redundant features can be greatly reduced. In this way, it is possible to obtain 10 million samples, corresponding to 100,000 features.

3) If you have billions or tens of billions of examples, you can also cross-process feature columns in documents and query requests through feature selection or regularization. In this way, it is possible to obtain one billion samples, corresponding to ten million features.

Rule 22: Clean up features you no longer need

A feature that is no longer used is technically a liability. If a feature is no longer used and cannot be combined with other features, get rid of it! You have to make sure your system is clean enough to satisfy the characteristics that will give you the most promising results as quickly as possible. For those that have been cleared, you can add them back if needed one day.

As for what features to keep and add, an important metric to weigh is coverage. For example, if certain features only cover 8% of users, retaining or not retaining them will not make any difference.

On the other hand, when adding or deleting features, the corresponding data volume must also be considered. For example, if you have a feature that only covers 1% of the data, but 90% of the examples containing this feature have passed training, then this is a good feature and should be added.

Manual analysis of the system

Before entering the third stage of machine learning, there are some things that cannot be learned in machine learning courses that are also worthy of attention: how to detect a model and improve it. This is more of an art than a science. Here are a few more anti-patterns to avoid:

Rule 23: You are not a typical end user

This is probably the easiest way to get a team into trouble. While both fishfooding (using prototypes only within the team) and dogfooding (using prototypes only within the company) have many advantages, in either case, developers should first confirm whether this approach meets performance requirements. Avoid making a change that is obviously bad. At the same time, any product strategy that seems reasonable should also be further tested, whether by having non-experts answer questions or through an online experiment with a team of real users. There are two main reasons for doing this:

First, you are too close to the implemented code. You only see a certain side of a post, or you're susceptible to emotional influence (e.g., cognitive bias).

Secondly, as a development engineer, time is too precious. And sometimes it doesn't work.

If you really want to get user feedback, you should use user experience methodologies. Create user personas early in the process (see Bill Buxton’s Designing User ExperienCES for details) and then conduct usability testing (see Steve Krug’s Don’t Make Me Think for details). User personas here involve creating imaginary users. For example, if your team is all male, then designing a 35-year-old female user persona will have a much stronger effect than designing several 25 to 40-year-old male users. Of course, it’s also a good idea to let users test the product and observe their reactions.

Rule 24: Measure differences between models

Before releasing your model online, one of the simplest and sometimes most effective tests is to compare the difference between the results produced by your current model and the model that has been delivered. If the difference is small, there is no need to do the experiment, and you know that your model will make no difference. If the difference is significant, then continue to determine whether the change is a good one. Examining queries that differ significantly from each other can help understand the nature of the change (for better or worse). However, the premise is to ensure that your system is stable. Make sure that the difference between a model and itself is small (ideally there should be no difference at all).

Rule 25: When choosing a model, practical performance is more important than predictive ability

You might use your model to predict click-through rates (CTR). When ultimately the key question is the scenario in which you want to use your predictions. If you use it to sort text, the quality of the final sort is more than just the predictions themselves. If you are using it to check for junk files, the accuracy of prediction is obviously more important. In most cases, these two types of functions should be consistent. If they are inconsistent, it means that there may be some small gain in the system. Therefore, if an improvement measure can solve the problem of log loss but causes a decrease in system performance, then do not adopt it. When this happens frequently, it's usually time to revisit your modeling goals.

Rule 26: Find new patterns and create new features from errors

Suppose your model predicts incorrectly on a certain example. In a classification task, this could be a false positive or a false negative. In a ranking task, this may be a group in which forward judgments are weaker than reverse judgments. But more importantly, in this example the machine learning system knew it was wrong and needed to fix it. If you give the model a feature that allows it to repair at this time, the model will try to repair the error on its own.

On the other hand, if you try to create a feature based on an error-free example, the feature will most likely be ignored by the system. For example, let’s say someone searches for “free games” in the Google Play Store app search, but one of the top results is for another app, so you create a feature for other apps. But if you maximize the number of installs of other apps, that is, people install other apps when searching for free games, then the features of this other app will not have the effect it should.

Therefore, the correct approach is that once a sample error occurs, a solution should be found outside the current feature set. For example, if your system reduces the ranking of posts with longer content, you should generally increase the length of your posts. And don’t get bogged down in too specific details. For example, if you want to increase the length of a post, you should not guess the specific meaning of the length. Instead, you should directly add a few related features and let the model handle it. This is the simplest and most effective method.

Rule 27: Try to quantify observed anomalous behavior

Sometimes team members feel helpless about some system properties that are not covered by the existing loss function, but it is useless to complain at this time, but should make every effort to convert the complaints into real numbers. For example, if app search shows too many bad apps, then human review should be considered to identify these apps. If the problem can be quantified, it can then be used as a feature, goal, or metric. In short, quantify first, then optimize

Rule 28: Pay attention to the difference between short-term behavior and long-term behavior

Suppose you have a new system that looks at each doc_id and exact_query and then calculates the click-through rate for each document based on its per-query behavior. You find that it behaves almost exactly like the current system with parallel and A/B testing results, and it's simple, so you launch the system. But no new applications are displayed, why? Since your system only displays documents based on its own query history, it doesn't know that it should display a new document.
The only way to understand how a system behaves in the long term is to train it on current model data only. This is very difficult.
Deviation between offline training and actual online service

The reasons for this deviation are:

1) Training workflow and service workflow process data differently;

2) The data used for training and service is different;

3) A feedback loop between the algorithm and the model.

Rule 29: The best way to ensure that training is close to actual service is to save the features used during service time, and then use these features in subsequent training

Even if you can't do this for every example, doing a small subset is better than nothing so that you can verify consistency between serving and training (see Rule 37). Teams at Google that have taken this approach are sometimes surprised by its effectiveness. For example, the YouTube homepage will switch to logging features when serving, which not only greatly improves service quality, but also reduces code complexity. There are many teams that have adopted this strategy on their infrastructure.

Rule 30: Give sampled data weights according to their importance and don’t discard them haphazardly

When there is too much data, you will always be tempted to discard some to reduce the burden. This is definitely a mistake. Several teams have had problems because of this (see Rule 6). While it is true that data that is never shown to users can be discarded, for other data it is better to assign importance. For example, if you definitely sample sample X with a 30% probability, then give it a final weight of 10/3. Using importance weighting does not affect the calibration properties discussed in Rule 14.

Rule 31: Note that data in tables used during training and serving may change

Because the features in the table may change and have different values ​​during training and serving, this will cause, even for the same article, the results predicted by your model during training and the results predicted during serving will be different. The simplest way to avoid this kind of problem is to log the signature at service time (see Rule 32). If the data in the table changes slowly, you can also create snapshots of the table every hour or every day to ensure that the data is as close as possible. But this doesn't completely solve the problem.

Rule 32: Try to reuse code between training and serving workflows

First of all, it needs to be clear: batch processing and online processing are not the same. In online processing, you must process each request in a timely manner (for example, each query must be looked up separately), while in batch processing, you can combine the completion. When serving, what you have to do is online processing, while training is a batch processing task. Still, there's plenty of room for code reuse. For example, you can create system-specific objects where all join and query results are stored in a human-readable form, and errors can be easily tested. Then, once all the information is collected during serving or training, you can formulate interoperability between this specific object and the format required by the machine learning system through a common method, and the training and serving bias is eliminated. Therefore, try not to use different programming languages ​​​​when training and serving, after all, this will prevent you from reusing code.

Rule 33: The data used for training is different from the data used for testing (for example, in terms of time, if you use all the data before January 5 for training, then the test data should use January 6 and after)

Usually, when evaluating your model, the data generated after using the data you used for training can better reflect the actual online results. Because of possible daily effects, you may not predict actual click-through rates and conversion rates. But the AUC should be close.

Rule 34: In application scenarios of binary classification filtering (such as spam detection), do not make too much performance sacrifice for pure data.

Generally in filtering application scenarios, negative examples are not displayed to users. But if your filter blocks 75% of the negative examples during service, then you may need to extract additional training data from the examples shown to the user and train. For example, if a user marks a system-approved email as spam, you may need to learn from this.

But this approach also introduces sampling bias. If you instead marked 1% of all traffic as "paused" during service and sent all such samples to users, you could collect cleaner data. Now your filter blocks at least 74% of negative examples that can become training data.

Note that this approach may not work well if your filter blocks 95% or more of the negative examples. But even so, if you want to measure the performance of the service, you can choose to make more detailed sampling (such as 0.1% or 0.001%), and 10,000 examples are enough to accurately estimate performance.

Rule 35: Be aware of the bias inherent in ranking problems

When you radically change a sorting algorithm, on the one hand it will cause completely different sorting results, and on the other hand it may significantly change the data that the algorithm may have to deal with in the future. This introduces some inherent bias, so you must be fully aware of this beforehand. The following methods can effectively help you optimize training data.

1. Perform higher regularization on features that cover more queries than those that only cover a single query. This approach makes the model prefer features that are specific to individual queries rather than features that can be generalized to all queries. This approach can help prevent very popular results from entering irrelevant queries. This contrasts with more traditional recommendations, which suggest higher regularization for more unique feature sets.

2. Only allow features to have positive weights, thus ensuring that any good features will be more suitable than unknown features.

3. Don’t have document-only features. This is an extreme version of Rule 1. For example, even if a given app is a top download right now, you wouldn't want to show it everywhere regardless of the search request. This would be easy to implement without just relying on document-like features.

Rule 36: Avoid positional feedback loops

The location of content can significantly affect the likelihood that a user will interact with it. Obviously, if you pin an app to the top, it will be clicked more frequently. An effective way to deal with this kind of problem is to add location features, that is, location features about the content on the page. If you use position features to train the model, the model will be more biased toward "1st-position" features. Therefore, your model will give lower weights to other factors (features) of those examples where "1st-position" is True. And when serving, you don't give any entities location characteristics, or you give them all the same default characteristics. Because you've already given the candidate set before you decide in what order to display them.

It's important to keep any positional features separate from other features of the model. Because the location features are different during training and testing. The ideal model is the sum of the position feature function and the functions of other features. For example, don't intersect position features with text features.

Rule 37: Measure Training/Service Variance

Many situations can cause deviations. Roughly divided into several categories:

1. The difference between the performance on training data and test data. Generally speaking, this always exists, but it's not always a bad thing.

2. Performance difference between test data and new time generated data. Again, this is always present. You should adjust regularization to maximize performance on new temporal data. However, if this performance difference is large, it may indicate that some time-sensitive features are employed and the model's performance is degraded.

3. Performance difference between new time data and online data. If you apply the model to an example of the training data and also to the same service example, they should give exactly the same results (see Rule 5 for details). Therefore, this discrepancy may indicate an engineering anomaly.
The third stage of machine learning

There is some information suggesting that the second phase is over. First, monthly growth is starting to weaken. You start to think about trade-offs between metrics: In some tests, some metrics increased, while others declined. This is going to get more and more interesting. Growth is becoming increasingly difficult to achieve, and more sophisticated machine learning must be considered.

Warning: Compared to the previous two stages, this part will have many open-ended rules. Phase 1 and 2 machine learning is always happy. When it reaches the third stage, the team has to find their own path.

Rule 38: Don’t waste time on new features if the goals are inconsistent and become a problem

When measurement bottlenecks are reached, your team begins to focus on issues outside the scope of the ML system’s goals. As mentioned before, if the product goal is not included in the algorithm goal, you will have to modify one of them. For example, you might be optimizing for clicks, likes, or downloads, but publishing decisions still rely on human evaluators.

Rule 39: Model release decisions are a proxy for long-term product goals

Alice has an idea to reduce the logical loss of installed predictions. She adds a feature and then the logical loss goes down. When tested online, she saw actual install rates increase. But when she called a launch review meeting, someone pointed out that daily active users were down 5%. The team decided not to release the model. Alice was disappointed, but realized that the release decision relied on multiple metrics, only some of which machine learning could directly optimize.

The real world is not an online game: there are no "attack values" and "health" to measure the health of your product. The team can only rely on collecting statistical data to effectively predict how the system will perform in the future. They have to care about user stickiness, 1 DAU, 30 DAU, revenue and the interests of advertisers. These indicators in A/B testing are actually just proxies for long-term goals: satisfying users, increasing users, satisfying partners, and making profits; even at this time, you can also consider proxies for high-quality, valuable products. and agency for a thriving business five years later.

The only time it's easier to make a release decision is when all the metrics are changing for the better (or at least not changing). When a team has a choice between a complex ML algorithm and a simple heuristic; if the simple heuristic does better on these metrics, then the heuristic should be chosen. In addition, there is no clear priority among all indicator values. Consider the following two more specific scenarios:

If the existing system is A, the team won't want to move to B. If the existing system is B, the team won't want to move to A. This seems to contradict rational decision-making: however, the expected change in the indicator may or may not occur. Any change therefore carries considerable risks. Each metric covers some of the risks that the team is concerned about. But no metric covers the team’s primary concern—“Where will my product be in five years?”

Individuals, on the other hand, prefer single goals that they can directly optimize. The same goes for most machine learning tools. In such an environment, an engineer who can create new features can always produce stable product releases. There is a type of machine learning called multi-objective learning that starts to deal with this type of problem. For example, set a minimum limit for each goal, and then optimize the linear combination of indicators. But even so, not all metrics can be easily expressed as ML goals: if an article was clicked, or an app was installed, it might just be because that content was shown. But it's even harder to figure out why a user came to your site. How to completely predict whether a website will be successful in the future is an AI-complete problem. Just as hard as computer vision or natural language processing.

Rule 40: Keep the ensemble simple

A unified model that receives original features and directly sorts content is the easiest model to understand and easiest to fix vulnerabilities. However, an ensemble model (a "model" that combines the scores of other models) will perform better. To keep things simple, each model should be either an ensemble model that only receives input from other models, or a base model with multiple features, but not both. If you have models that were trained separately and were based on other models, combining them can lead to bad behavior.

Only use simple models to integrate those that only take the output of your base model as input. You will also want to add properties to these ensemble models. For example, an increase in the base model’s generated score should not decrease the ensemble model’s score. In addition, it would be best if the connected model is semantically interpretable (e.g., calibrated), so that changes in its underlying model will not affect the integrated model. In addition, forcing the prediction probability of the lower classifier to increase will not reduce the prediction probability of the ensemble model.

Rule 41: When encountering a performance bottleneck, rather than refining existing information, it is better to find new sources of quality information

You've added artificial statistical attribute information to the user, added some information to the words in the text, gone through template exploration and implemented regularization. Then, there are almost several quarters where your key metrics don't improve by more than 1%. Now how to do?

Now it's time to build infrastructure for completely different characteristics (for example, the documents a user accessed yesterday, last week, or last year, or data from different properties). Use a wikidata entity or something in-house (like Google’s knowledge graph) for your company. You may need to use deep learning. Start adjusting your expectations for return on investment and work accordingly. As with any engineering project, you need to balance newly added features with increased complexity.

Rule 42: Don’t expect a strong connection between variety, personalization, relevance, and popularity

Diversity in a collection of content can mean many things, but diversity in the source of content is the most common. Personalization means that each user gets results that are of interest to him or her. Relevance means that a particular query is always more appropriate for some queries than for others. Obviously, the definitions and standards of these three properties are different.

The problem is that standards are hard to break.

Note: If your system is counting clicks, time spent, views, likes, shares, etc., you are actually measuring the popularity of your content. There are teams trying to learn personalized models with diversity. To personalize, they add features that allow the system to personalize (features that represent user interests), or diversity (features that indicate that the document shares features with other returned documents, such as author and content), and then discover these features get lower weight than they expected (sometimes a different signal).

That doesn’t mean variety, personalization and relevance aren’t important. As the previous rule points out, you can add variety or relevance through post-processing. If you see growth in longer-term goals, then at least you can claim that diversity/relevance is valuable beyond popularity. You can continue to use post-processing, or you can directly modify your goals based on diversity or correlation.

Rule 43: Your friends will always be the same across different products, but your interests will not be the same.

Google's ML team often applies a model that predicts the closeness of a connection in one product to another product and finds that the effect is very good. On the other hand, I’ve seen several teams struggle with the personalization features of their product lines. Yes, it looked like it should work before. But now it looks like it won't. What sometimes works is - using raw data for one attribute to predict the behavior of another attribute. Even if you know that a user has a history of another attribute working, keep this in mind. For example, the presence of user activity on two products may speak for itself.




Guess you like

Origin blog.csdn.net/u013995172/article/details/80419504