Dangdang recommends the team's machine learning practice

http://www.csdn.net/article/2015-10-16/2825925

Let me talk about my original intention first. I don’t need to go into details about the fact that the machine learning system is now more popular and more NB. However, due to the particularity of machine learning systems, it is not easy to build a reliable and easy-to-use system. Whenever I see the wonderful sharing of my colleagues, I will think, how are these complex and delicate systems constructed? What is the build process like? Are there some pits behind this? Have some experience? Is it possible to "steal" to learn from?

So I hope to do a more "process-oriented" sharing, and share with you some of our practices when building the system, some pits, and how to climb out of the pits.

In addition, I will focus more on the "small team" in this sharing. First, because Dangdang's current ML team is indeed relatively small, and secondly because as far as I know, not every company has a large and neat lineup like BAT. Therefore, the experience and practice of the small team may have unique reference significance. I hope this sharing can provide you with some reference from different perspectives.



The practical experience shared today is from the ML team of Dangdang Recommendation Group.



Our team is responsible for the construction, tuning, maintenance and improvement of the machine learning system in Dangdang recommendation/advertising from scratch. Except for the computing platform, which is maintained by other teams, every link in the ML pipeline is responsible. The produced models are used for the ordering of part recommendation module and part advertising module.



Before sharing begins, it is necessary to clarify the positioning of this sharing. As shown in the figure above, this sharing does not involve these contents. Students who need it can refer to other wonderful sharing on CSDN.



The above are all involved in this sharing, of which "process-oriented" is the focus of this time. The positioning of the people to share is as follows:



No matter what stage you are in building a machine learning system, if you can gain something or inspire from this sharing, then this PPT made by the author with the "post-long vacation syndrome" will not be in vain. ...



This is the outline of our sharing this time: first briefly talk about some of my understanding of "small team", and then spend the main time sharing with you Dangdang's small team machine learning practice. Then I will summarize some of the pits we guessed in practice, and what we have learned from these pits. Then I'll give some perspectives on future work and possible directions, using some references as examples. Finally, there is a question and answer session.

A Brief Introduction to Small Teams

First, let me talk about my understanding of small teams.



Why do small teams appear? This question may seem like a bit of nonsense at first, because every team grows from small to large. That's true, but the machine learning team still has some peculiarities of its own.

Compared with some functional systems, one of the characteristics of machine learning systems is uncertainty, that is, the effect of this system cannot be quantified from the beginning. This has led to decision-makers being more cautious in their investment and not investing too many people at the very beginning.
Talents in this area are indeed relatively scarce and difficult to recruit. There are quite a lot of beautiful resumes, but very few have actual ability or experience. In line with the principle of rather lack rather than waste, a small and sophisticated team is a better choice.


Where are the challenges of small teams making systems? This is our primary concern. The essence of the small team challenge is actually two words: few people. Several specific challenges arise from this fundamental limitation.

The first is the high demand for individual soldier capabilities. It's easy to understand. A small number of people means that everyone needs to play a big role, so the requirements for individual soldiers are relatively high. For this problem, there are actually not many good solutions, mainly external recruitment and internal training.
Secondly, in the process of system development, everyone generally needs to be responsible for multiple tasks, which is not only a challenge to individual ability, but also a challenge to collaboration ability. But on the other hand, this is actually the best training for employees, allowing everyone to grow at the fastest speed.
Again, it is a matter of direction and demand selection. Because of the small number of people, you need to be very careful when deciding your next move, minimizing unproductive inputs. This is indeed a limitation at times, but from another point of view, it "forces" us to focus on the most important parts, and good steel is used on the edge.
The last point is that the risk of a single point is higher. Since each person is responsible for more parts, changes such as resignation and vacation of each person will have a greater impact on the system. This problem is also mainly solved through internal training and external recruitment, but there is also a way to retain people with challenging things. Which method works best depends on the specific circumstances of the specific environment.
Looking at it this way, the challenges of small teams are not small, but on the other hand, small teams also have some unique advantages.



The first is that the team is easy to cohere. This is also a natural advantage of any small team.
The second is ease of collaboration. Many things do not need a meeting, just turn around and a few words can be done.
Again, the advantage of iteration speed. Since all the things involved in the process are in charge of a few people, there is no need to coordinate too many resources, so as long as these few people work hard, the iteration speed will be faster.
Last but not least, is the growth of the team. Since everyone is responsible for a lot of things, the growth rate will naturally be fast, and the personal sense of achievement will be relatively high. If it is properly deployed, the entire team will be in a very dynamic and positive state.
Dangdang Recommends Machine Learning Practices

Let 's take some time to share how Dangdang's machine learning team leverages the big rock of machine learning systems.



The figure above shows the overall architecture of our recommended backend. As can be seen from the above architecture diagram, the machine learning system exists as a subsystem and directly interacts with the recommendation job platform (the offline job platform that generates the recommendation results).

These architecture diagrams are just to let everyone know the position and role of the machine learning system in the entire recommendation system, not the focus of this sharing, and there is no need to understand it.



The architecture diagram on this page is an enlarged detail of the red box in the previous page. As can be seen from the figure, the machine learning system plays a role in the ranking of the results. I will not expand the details of this architecture here. Interested students can refer to the sharing of Meituan students some time ago, which is a relatively similar architecture.



The above architecture diagram is a further development of the red box part of the architecture on the previous page, which is an architecture diagram of the machine learning system itself. Experienced students can see that this diagram includes the main process components of the machine learning system.

Later we will talk about how this system is built and what the process is like. The initial stage of the system is an exploration stage. The significance of this stage is to find out whether your problem is a suitable problem to be solved by ML technology.

Machine learning is very powerful, but it is not omnipotent, especially in some fields that require strong artificial prior, it may not be the most suitable solution, especially not suitable as a system startup solution. At this stage, the tools we use are R and Python.



In the figure on the right side of the previous page, the part framed in red can be solved with R, the part framed in blue is more appropriate in Python, and the part framed in green is both.

Why choose R and Python?

Let's talk about R first.

Because of the versatility of R, it can be called the Swiss Army Knife of the data science world.
It is because R has been popular for many years and is a mature tool, and it is easy to find solutions when encountering problems.
At that time (2013) sklearn and the like were not perfect and easy to use, and it was not easy to find solutions when there were problems.
Let's talk about Python.

Python has high development efficiency and is suitable for rapid development and iteration. Of course, attention should be paid to engineering quality.
Python has strong text processing capabilities and is suitable for processing text-related features.
The combination of Python, Hadoop and Spark and other computing platforms is strong, and it is scalable when the amount of data expands.
However, the R part can actually be replaced by Python now, because the toolkits represented by sklearn, Pandas, Theano, etc. have become more mature.

But after the initial stage of exploration, when it comes to systems with large data volumes, R is no longer suitable. There are two main reasons: the small amount of data that can be processed and the slow processing speed.

The first is that pure R only supports a single machine, and all data must be loaded into memory, which is obviously an obvious obstacle to big data processing, but some new technologies may alleviate this problem, but we Haven't tried it.
The second is that the calculation speed is relatively slow, which of course also refers to the speed under the large amount of data.
Therefore, as shown on the left side of the architecture diagram, once the stage of large data volume is reached, tools represented by Hadoop and Spark will come on the stage and become the main tools used.

After the initial exploration and verification stage, it is time to enter the engineering iteration step.



Shown is a typical process we developed.

After the verification is passed, it will enter the next important link, which I call "full process construction", which refers to the construction of the ML system to be built and the subsequent users to form a complete development environment.

What needs to be emphasized here is "completeness", that is, not only the samples, features, training and other links related to the ML model should be built, but also the links that use the model later, such as sorting display, etc., should also be built together. This point will be mentioned again later.

If it is the first time to build the system, the "full process build" will take a long time to complete. But this step is the cornerstone of all subsequent work, and the time and effort invested is well worth it.

After this step is completed, a system has actually been constructed. Of course, it is a system with only form and no gods, because each part may be completely unoptimized, and some parts may only have a body without content.

After that, I entered the "infernal way" of optimization iteration. The work of this part is to constantly find points that can be optimized, then try various solutions, do offline verification, and do online AB if you feel that it meets the online standards. After the system process is built, it is basically constantly reincarnating in this iteration. (The original meaning of Infernal Affairs refers to the 18th layer of the 18 layers of hell, implying the infinite reincarnation of suffering.)



In fact, this development process is especially like the process of building a house. First, you need to lay the foundation, then build a rough house, and then continue to decorate and conduct various inspections until you can move in. After living in for a period of time, you may feel dissatisfied with something, or if there is a new and more beautiful decoration method, it may be decorated again. so repeatedly. Until one day you make a fortune and want to change the house, that is the time for the overall reconstruction and upgrade of the system.



The tools we use on this page are all common mainstream tools on the market, except for the dmilch tool.

dmilch (milch means milk in German): Dangdang MachIne Learning toolCHain is a set of feature engineering-related tools that we summarize and extract in continuous iterations. Contains some common tools for feature processing, such as feature regularization, normalization, common index calculation, etc. Similar to FeatureFu, which was open sourced by LinkedIn some time ago, it is all for the convenience of feature processing, but the angle is different.



This page introduces a few key points in our workflow. In fact, small teams have natural advantages in this aspect, so our central idea is "running in small steps".

The first key point is the seriality between changes. This may be a unique feature of an algorithmic system like machine learning. If multiple improvements are made together, sometimes it is impossible to distinguish what factors have played a real role, just like a pair of traditional Chinese medicine, I don’t know what works. , and we hope that the real "artemisinin" can be extracted.

The second point is the project promotion mechanism. We have meetings about once or twice a week, the main content is to verify the improvement effect, discuss the plan, etc., and confirm the next steps on the spot.

Technicians actually don't like to hold meetings, so why are they still held every week? I think one of the most important purposes is to let everyone participate in the discussion, be jointly responsible for the project, and grow together. There is a division of labor in the work undertaken, but there is no division of labor in the discussion. Everyone must have ideas and suggestions for the system. This also ensures that everyone absorbs the unfamiliar areas of each other, which is more conducive to growth.

There is also a topic that has to be said about the experimentation of new technologies. If we continue to use our previous example of building a house, the new technology is like tall furniture and the like. If you don’t have one or two town houses at home, you are embarrassed to say hello to people.

Our experience in this regard is that we must first thoroughly understand and use the existing technology, and then talk about the new technology, not too late. For example, the collaborative filtering algorithm in recommendation generally calculates different data in different dimensions such as purchases, browsing, comments, and collections to see which one has better results. When the value of the familiar technology has been "squeezed" dry, it is not too late to try new technology.

Another important point is that other people's technology may not be suitable for you. Different companies have different business scenarios, data scales, and data characteristics. New technologies proposed by others should be adopted with caution.

We have tried a certain technology from an international manufacturer with confidence, but we have not achieved good results after repeated attempts, but have added a lot of complexity. Later, after communicating with some colleagues, I found that everyone did not get good results. Therefore, the moon in foreign countries may only be rounder in foreign countries. Which technology to use depends on what kind of seedlings are suitable for planting in the soil where your system is located.

Before the end of this part, I will briefly introduce the effect of our model after the recommendation advertisement is launched: the recommended first screen click-through rate has increased by 15%~20%. The click-through rate of the ad has increased by about 30%, and the RPM has increased by about 20%. It can be seen that the effect is still obvious.

In those years, the pits

we The next important link shared today is the various pits we stepped on.

"Don't forget the past, the teacher of the future", the pit may be the most valuable part of each sharing. We have also stepped on a lot of pits when building the system. Here, I will share with you a few pits that I think are relatively large, hoping to help you. I will introduce a few pits first, and then talk about the feeling and gains we climbed out of the pits.

Only see the model, but not the system.

If we want to rank the pits we have stepped on, this pit must be the first. Because if you fall into this pit, the basis for guiding your system direction may be completely wrong.

Specifically, this problem refers to the fact that when building the system, we basically only paid attention to the quality of the machine learning model, the AUC, and the NE at the beginning, but did not pay attention to the final effect of the model online. The consequence of this is that we feel that the model is very good in terms of indicators and other aspects, but it has no effect at all when it goes online. Because we ignore how the model is used, and have been creating the same "optimized" model behind closed doors, the final effect will naturally not be good.

What is the correct posture? From our experience, in the early stage of system construction, it is necessary to know clearly: what you are building is not a model, but a model-centric system. It is very important to know what to do after the model comes out and how to use it.

Although the model is the center of the system, it is not the whole of the system. At each stage of system design, development, and tuning, problems must be viewed from the perspective of the system, not only the model, but not the system (product). Otherwise, when you call up a model with AUC=0.99, you may find that you have gone further and further with the system when you look up.

Therefore, when building a machine learning system, you should pay attention to both the model and the system. If you only see the model but not the system, you are likely to create a "vase system" with beautiful indicators but no actual effect.

Not paying attention to visual analysis tools

This is a problem that is easy to ignore at first, but will cause you to suffer in the later stage (this refers to non-deep learning systems).

Because the machine learning system is a black box to some extent, our energy will habitually focus on parameters, models, etc., and instinctively feel that the inner workings of the model do not need to be concerned. But our experience is that if you only focus on the outside of the black box and don't care about the inside at all, then if the model does not work well, it will be difficult to locate the problem. Conversely, if the effect is good, it will be a bit inexplicable, just like the light in your bathroom suddenly turns on by itself, or the TV suddenly turns on by itself, which will always make people very uncomfortable.

Our feelings on this issue are very deep. When we first built the system, we found that the effect was not good. In fact, there were not many rules and regulations to help locate the problem. You can only change the various features back and forth, and change the sample processing. If the effect is good, it will be fine, not good, and then toss.

Later, we made a set of web pages, which displayed each sample, the characteristics and parameters of each case, the number of occurrences of the sample, the sorting in the candidate set, etc., all displayed. It is like an anatomy of the entire system plus the model, hoping to see as many internal details of the system as possible, which is very helpful for analyzing the problem.

This system has helped us a lot. Although it can't be regarded as a "systematic" approach, after presenting a lot of things in front of you, you will find that some things are different from what you think, and you will also find some things you don't know at all. something that comes to mind. This is especially valuable for systems like machine learning, which are a bit of a black box. Until now, this system is something we rely heavily on every time we verify the effect, it can be said to be our other pair of eyes.

Too much reliance on algorithms

This pit is believed to have been encountered by many students. Let me give an example. We encountered a text processing problem at that time, to filter out a large number of irrelevant and useless text words. At the beginning, a lot of various algorithms and various tunings were used, but the satisfactory results were not obtained for a long time.

Finally, we showed the trick: human flesh filtering. Specifically, it took three people three days to manually go through the text (thousands and tens of thousands of words), and the effect was immediate. At that time, there may be a better algorithm for the problem, but from the perspective of systems and engineering, the overall ROI is still the highest.

Therefore, although machine learning is an algorithm-based system, it cannot be rigid in thinking. Everything is solved by algorithms. In some places, it is more appropriate to add a rifle to Xiaomi.

The key process and data are not controlled by your own team.

This pit can be said to be not an easy pit to find, especially in the early stage of the system, which is relatively hidden. We also discovered this problem after suffering some losses.

In many companies, front-end display, log collection and other work are handled by dedicated teams, and teams such as recommended ads are directly used. The benefits of this are obvious, allowing the machine learning team to focus on their jobs, but the downside is that the data they collect is not always what we expect.

for example. The exposure data we used at the beginning was done for us by the brother team, but after we took it, we found that it was not quite right with other data, and it took a long time to find the problem. This problem directly affects whether the samples we get are correct or not, so it has a great impact on us.

So what's causing this problem? In fact, it is not that the brothers are not serious, but they do not fully understand our needs for data, and they do not use the data, so the quality of the data will be at risk. After suffering this loss, we now also use this part of the work to do it ourselves, so that we can monitor the whole process of whether the data is correct or not, and if there is a problem, we can solve it internally without coordinating various resources.

The pit that the team is not "full stack"

is a relatively complicated pit. In the last pit, I mentioned that we found a problem with the data quality, and then did this part of the exposure collection work by ourselves. But locating the cause of the problem and taking over by yourself is not done when there is a problem with the data. The reason was simple and brutal: there was no front-end talent in our group at the time.

Because the exposure problem involves a series of actions from the browser to the backend system, and the front end is the first link of these actions. But when we assembled the machine learning team, we didn’t realize that there would be a front-end, and thought that people with back-end + models were enough. Therefore, we are relatively powerless to face this problem. It was not until a colleague with rich front-end experience joined our group that we located the problem and made the decision to take over.

The lesson this question has taught us is: we should be more cautious when building teams, and we should look at them from a more systematic perspective. We can’t say that we only recruit algorithm engineers to do machine learning, which will lead to team-level shortcomings and bury some problems. Foreshadowing.

However, some problems may be difficult to predict before they are encountered, so this pit is indeed more complicated.

giant system

The last pit, of course, has to be left to a big pit. I call this pit a "giant system".

What do you mean by giant system? To put it simply, it is to make the whole system into a "one" system, rather than sub-modules into multiple subsystems. The meaning of making a system is that the modules within the system have high coupling and strong correlation, and the samples, features, training, prediction, etc. are all glued together and cannot be separated. What are the consequences of doing so?

direct example. Our first version of the system took a week to go live. And the maintenance after that is quite difficult, and it is very difficult to change things. Why did this happen? My reflection is: when learning theory, it is natural to take samples, features, and training pipelines as a set of things. This kind of thinking is directly reflected in the system, which is a giant system. Maybe there is no problem when you only have a dozen features and hundreds of samples. But when your features rise to millions and samples rise to tens of millions, you need to think carefully about whether your system is a little out of control.

So what's a better way? Our later solution is: big system and small work. The phrase "big system and small work" was not invented by me. It was a concept mentioned by the WeChat team after the Spring Festival this year (or last year) when they talked about the architecture of the red envelope grabbing system. I think this statement is well distilled, and I agree with it very much. This approach means that although your system is very large and complex, you still need to separate modules when doing it, which is conducive to development, expansion and maintenance.

The characteristic of machine learning systems is that at the beginning, you may use very few features, so you think you can do it in one system. It can get huge without knowing it, and if you just focus on the model, it's easy to create an unmaintainable gigantic system.

The Long March has just started

After our team has gone through many "pits" just now, a system can be said to be built, but this is only the first step of the long march. For us, in fact, the new machine learning system itself has many complexities that are different from traditional software systems, and there are still many challenges to be solved. Here I briefly describe these complexities, as well as the challenges, in two references. Students who are interested in in-depth understanding can find the article for a detailed look.



The first is a paper by Google Research about machine learning technical debt. The title is also very interesting, which can be translated as: "Machine Learning: High-interest technical debt credit card".

The main point of this article is that the construction of machine learning systems is very complicated. If you are inexperienced or not careful enough, you will easily “owe debts” in many aspects. These debts did not have much impact at the time, but due to the high “interest”, It will be painful for you to get up later.

The above picture shows several specific dimensions of technical debt, which I organized according to the article after reading the article. These dimensions are also highly consistent with our own practice. At that time, the article was full of arrows.

For example, the "blurred subsystem boundary" mentioned in the upper right of the figure is similar to the "giant system" I mentioned before, which also means that there is no division inside the system.

Another example is the "system-level spaghtti" mentioned in the lower right. Spaghetti code is often used to refer to a mess of code. Since machine learning systems are generally built during exploration, not fully designed and built like other systems, it is easy to generate spaghetti code.

If these dimensions can be considered before building the system, the development, upgrade and maintenance of the system will be much easier. I believe that these experiences are also summed up by giant companies like Google who have fallen into many pits. The giants are still like this, and it is naturally not easy for us.



The next article is a tutorial done by Leon Bottou, the SGD tycoon of FB, at ICML 2015. The title is: Two big challenges in machine learning, which is a relatively systematic and practical article, talking about two new challenges faced by machine learning.

The first point is pretty horrific: machine learning destroys software engineering. But when you think about it, it does. The development process of machine learning systems is mostly exploratory and incremental, which is very different from traditional software engineering, which poses challenges to system developers. I think it is very likely that there will be a dedicated "machine learning system architect" position in the future.

The second point is that the current experimental methods have also encountered limits. At first glance, this is a scientific experiment, but it is not. Since the development of machine learning systems is exploratory, various experiments are often performed during development to verify various effects. This overall method framework also needs to be carefully designed. Clearly, in Bottou's view, none of the current methods are appropriate.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326774585&siteId=291194637