Big Data secret behind Youku

In this article Youku data in the data table of technical experts Mende bright sharing Youku migrate from Hadoop to value the business and the platform after Ali cloud MaxCompute.

According to this article gathered from video presentations and PPT.

Hello everyone, I am a Mende bright, now doing data related things in Taiwan Youku data. Am honored, I just witnessed Youku never MaxCompute to such a course there, because just fine and I'm entry Youku almost five years, we happened to be when coming five years, do it from Hadoop to MaxCompute of such an upgrade. This is the May 2016 to May 2019 and now Youku development course, the above is computing resources, storage resources is below. We can see the whole number of users, as well as data tables, it is in fact a form of exponential growth. But after May 2017, when Youku completed the Hadoop migration MaxCompute, Youku calculate consumption, as well as storage consumption is actually declining, the entire migration has been a very large gains.
image

Youku said the following about the operational characteristics.

The first feature from the big data platform above the overall complexity of the user, not just students and technical students in the use of data, but also including some BI students, test students, and even product operations are likely to use this platform for big data.

The second characteristic is a complex business, Youku is a video website, it has a very complex business scenarios, from the post categories, in addition to the browser as the page, there will be some player-related data, performance-related data. From the entire business model, there are live, with a membership, advertising, there is some very different scenes big screens.

The third characteristic is that the amount of data is very large, the amount of logs a day will reach 100 billion level, which is next to a very large amount of data, and will do very complex calculations.

The fourth is more interesting, whether it is small companies, large companies, awareness of the cost is very high. Youku also has a very strict budget, included in the Ali Group is a very strict budget system, but we also often do some important battles, like the Battle of two-eleven, like our World Cup campaign in the summer, as well as the Spring Festival We will engage in various battles. In this case, in fact, elastic computing resources required is very high.

Based on the above Youku operational characteristics, I compiled MaxCompute provides excellent support for several features of our business.
image

The first, easy to use.
Second, improve the ecology.
Third, performance is very powerful.
Fourth, resource usage is very elastic.

The first feature, easy to use. MaxCompute have a very complete link, whether it is data from the development, operation and maintenance or data, including data integration, data quality control, as well as the entire map data, data security. Youku year after Hadoop moved from MaxCompute, we do not have the greatest experience of their own to maintain middle of the night often cluster together, and do not have to run the task, write a task, people's demand came a premise, I might give him a few weeks row, and now I can tell him, I'll give you run it, you can come out. Prior to include BI analyst also like to log the client, write a script, write your own schedule, I often say the number did not come out today, Why? Including high-level look at the numbers, you may want to come out to 12 o'clock. Now basically all the important data will be at 7 o'clock outputs, including some of the basic business needs, in fact, analyst or product that they themselves can achieve, do not need all the data needs are mentioned here.
image

The second feature, complete ecosystem. Youku in 2017 is based entirely on ecological Hadoop, after which moved to MaxCompute, ecological Serverless big data services based on Ali cloud provides. You can see on the open source components in the entire MaxCompute are there, but better than open source to use, easier. You can see from Chart among us is MaxCompute, the left-dependent Mysql, Hbase, ES, Redis these are to do a two-way synchronization from the Sync Center. On the right there will be resource management, resource monitoring, data monitoring, including data assets, there are some data specification. Our underlying data input, including some of the Group's acquisition tool, go down to the top, there is available to developers with DataWorks, including a number of command-line tools; there are available to QuickBI BI and data services for personnel.
image

The third characteristic, powerful performance, MaxCompute support Youku EB-class data storage, data sample one hundred billion of analysis, including one hundred billion data report, 10W-level instances of concurrent tasks. In these times before the maintenance of Hadoop, it is unthinkable.
image

The fourth characteristic, resource use flexibility. We migration in 2016, in fact, Youku Hadoop cluster size has reached more than a thousand, this was still a relatively large scale. At that time we encountered many problems, including memory problems like this NameNode, room no way to expansion problems, was very painful, including some of the above operation and maintenance management problems. We continue to ask for resources operation and maintenance, operation and maintenance told that, say that you've spent how many resources, how much how much. The problems we face is how to use computing resources on demand, when a lot of homework at night, in the afternoon after my whole clusters are empty down, with no one, resulting in a waste. In fact MaxCompute perfect solution to this problem.
image

First, it is based on usage-based billing, not to say to you how much the machine, and then charge you much money, really is how much you use the resources received much money, this is at the cost, than they go maintain the cluster, it may be a half-cut (down 50%) of such income.

The second, in fact MaxCompue computing resources can be time-sharing, such as the production queue, the time will increase the number of early morning to ensure that the report can come out as soon as possible. During the day time to allow the development of a number of high computing resources, allowing analysts, developers to temporarily run some of the data will be smoother number.

Third, MaxCompute rapid expansion capability, for example, suddenly there is a strong business demand, found that the data could run, not enough computing resources, all queues are blocked, this time with the operation and maintenance can actually say direct sound , help a key expansion, he knocked two seconds to get a command. In this case, all resources can quickly digest it.

image

Why Youku above is adopted MaxCompute, Youku under the following business scenarios, some of our typical scenario, applications. This figure is actually Youku, including the possibility of internal Ali Group now some very typical technical architecture diagram. Intermediate can be seen, MaxCompute the position of the intermediate core, primarily a left input, the right side is a tendency of the output, the green line is a real-time link, comprising Now from the entire data source, such as DB also log good or local log server, what we stored by TT & Datahub to do analysis MaxCompute above. Of course, now very fire Flink real-time calculation, as the link is actually a real-time processing.

It includes a synchronous DB, in addition to real-time link, will go through the DB by day / hourly, to the data synchronization MaxCompute, the data can also be synchronized to the results Hbase, Mysql DB above this. And then through a unified service layer for applications to provide services. The following is a machine learning algorithm Pai do some training, training to apply the results of the above then passed through an algorithm OSS.

This figure may also be a number of the industry's more popular warehouse hierarchical chart, because here we are in the data table, all data is unified from ods layer cdm layer and layer ads, to go up one level do fine, and then to the top, through the interface services, file services, SQL services, to provide diversified services. Beyond the above, provide some internal data products, for executives, small Second, there may be some external, such as the application of these data on the number of players like Youku, including heat.
This figure is actually since we moved on MaxCompute platform, two very classic case from Hadoop. We open up different scenarios to users through the data sets, enabling to come and go to two different scenarios, to enhance business value.

Second, it may be internal, we do it for the amount by some of the BU Youku, as well as within the group, we do sample amplified by a uniform label, the amount of Youku lead to other BU, BU is the amount of the other guide to Youku, so to achieve a win-win results.
image

This picture Most Internet companies are unlikely to be involved, that the issue of anti-cheating. This is what we do in MaxCompute an anti-cheating architecture, to extract its features from the original data, and then through algorithmic models, including machine learning, deep learning, flow graph model to support anti-cheating, anti-cheating channels and so on. And through monitoring tools on the business scene anti-cheating, cheating the monitored information to play a sample of black and white, then this feature together with the black and white samples continuously iterative optimization algorithm model. At the same time algorithm for the model, a model to do evaluation, and constantly to perfect anti-cheat system.

Finally, it is still associated with the costs, in daily use, the user must have white or some new users to the wrong use or do not care about the use of some resources, such as students or interns often there will be some non-technical, as an analyst, a SQL consumption is relatively high, in fact, this is a huge waste of resources, and may he a task, make the task all the others are waiting in line here, in fact we would go about doing a management of the entire resource.

From the size of the node to manage large data through big data, we can calculate the following table which produced it, how many days have not been read, including its span may not be so big, we will do it or go off the assembly line do governance, there are some business scenarios may not be very important or it's time requirements are not so high, some algorithms such as training, peak load shifting can do some scheduling, to ensure that the water level is not too high. From the perspective of MaxCompute task, which tasks can be calculated tilt data, what data there may be a similar calculation, which tasks need to do MapJoin, which tasks need to do some cutting, and then save it to the IO. What tasks will be to do violence to scan, sweep a month, a year sweep of data, what data may have such a data expansion, for example, it made such a complex calculation CUBE and the like, some of the iterative algorithm model; we calculated the data by these signs, to thrust reversers user, come and improve the quality of its points of such a data, we come to achieve the purpose of reducing overall computing resources.

In the perspective of computing platforms, we also continued in the use of some very sophisticated use of MaxCompute introduced, such as our side of HBO, Hash Cluster, Aliorc, HBO is based on the optimization of our history, thus avoiding the user does not know how to adjust Senate, I might for their own tasks a little faster, particularly on the transfer of a parameter, this is the case, for integration of resources is very wasteful. With this feature, users will not have to adjust the parameters, the cluster automatically adjusted, users can write their own business logic enough.

The second block, probably the last two years introduced Hash Cluster, was often occur when using Hadoop, and two large tables Join when not calculated, this is actually a Hash Cluster optimization tool. A large table with small tables Join, can do something to distribute, to do some optimization. A large table with a large table to relate to the issue of a sort. The Hash Cluster, in fact lined up ahead of the data, save a lot of computing intermediate links, to achieve the purpose of efficiency improvement.

Third, Aliorc, fixed in some scenarios above, can be stabilized to enhance the computational efficiency of 20%.

The fourth, Session. Some relatively small data, or directly into the cache and SSD, there is a downstream node 100 leaves the scene, it is very friendly, because second low-latency results. Meanwhile, Youku also using Lightning solve computing acceleration, this is an optimization program on the computing architecture, it is a MPP architecture.

The last one is to optimize the storage, because, like some of the key raw data or need to audit data is not deleted, can not be permanently deleted. In fact, it will cause our trend has been up data storage is diminished, the calculation will achieve a balance at a certain point in time. The current computing resources with so much, and then again, in fact, should no longer be a rose, for example, under the old out of business logic, business logic will get a new one, it will remain at a relatively steady above fluctuations. But the store, because it has some historical data is never deleted may occur has been growing, but is exponential. So we will continue to focus stored, we have four main instruments.

The first, or to treat large data through big data, which table to see its not enough or not enough on its span. It is to optimize some life cycle, come and control its growth. Including the following, just mentioned Aliorc, actually doing compression, we will do some split large field, to increase the proportion of compression.

OK, this is Youku in a number of scenarios in MaxCompute, thank you for listening.

Guess you like

Origin yq.aliyun.com/articles/705113