Big Data Distributed Machine Learning new opportunities

Distributed machine learning, with the "big data" the rise of the concept. Before big data, there is a lot of work to make the machine learning algorithm faster, and more profitable processors. Such work is usually referred to as "parallel computing" or "parallel machine learning", which is the core objective of computing tasks disassembled into a plurality of smaller tasks, assign multiple processors to do the calculation.

 

A new era

origin

Distributed machine learning, with the "big data" the rise of the concept. Before big data, there is a lot of work to make the machine learning algorithm faster, and more profitable processors. Such work is usually referred to as "parallel computing" or "parallel machine learning", which is the core objective of computing tasks disassembled into a plurality of smaller tasks, assign multiple processors to do the calculation.

Distributed computing or distributed computing machine learning tasks in addition should be distributed on a plurality of processors, more importantly, the data (including the training data and intermediate results) are distributed apart. Because in the era of large data, a machine fit all the data are often hard, or even hold, the machine will be limited by the bandwidth of I / O channels, so that access is slow. For greater storage capacity, throughput and fault tolerance, we all want the data distributed across multiple computers.

So much what kind of data to a machine's hard disk or even hundreds of machines not fit it? You know, a lot of server disk space is the number of TB! In fact, this has a lot of big data. For example, the search engine to climb down many, many pages of their contents analyzed and indexed. How many pages? The numbers are hard to estimate, because it changes over time.

Before Web 2.0 appeared, the number of global web growth is relatively stable, because Web pages are edited by professionals. And because a variety of Web 2.0 tools to help users create their own web pages, such as blog, or even micro-blog, so the number of pages exponential rate of increase.

Another typical big data is user behavior data on the electricity supplier website. For example, in the Amazon or Taobao, many users every day to see a lot of recommended products, and clicked on some of them. The user clicks on the recommended product actions will be recorded Amazon and Taobao server down, as a distributed machine learning system input. The output is a mathematical model that can predict what a user likes to see the goods, so the next show recommended product when the multi-display users who prefer.

Similarly, in the Internet advertising system, display advertising to the user, and the user clicks on the ad will also be recorded as a data machine learning systems, training hits prediction model. At the next display recommended product, these models are used to estimate each item after the show if it is, how much probability of being user clicks. Click where the estimated rate of merchandise, often appear before the estimated click rate of commodity, in fact, to win a relatively high click-through rate.

We can see from the above example, these big reason why big data, because they are recorded billions of Internet users behavior. And people will have a daily behavior that Baidu, Ali, Tencent, Qihoo, so the company's Sogou Internet every day to collect a lot of data in the hard disk can be loaded. And these data increases over time, endless. While people see the "big data" specific definition see wisdom, but the Internet user behavior data, no doubt is recognized as a big data.

value

Application of machine learning for a long time. You may remember a dozen years ago, IBM introduced speech recognition and input system ViaVoice. Acoustic model and language model of the system using the data collected and manually labeled training. Since then IBM ostentatious, collect and collate a lot of data, so recognition accuracy ViaVoice is far ahead of similar products. However, ViaVoice difficult to ensure that people can recognize various accents. So IBM engineers designed an automatic adaptation function - by allowing users to label failed to properly identify the corresponding text of the speech, ViaVoice can do specifically optimized for accent Director.

Today, you can use Google's voice recognition system via the Internet. We will find that no matter how users accent, Google's voice recognition system can almost accurately identify that almost no longer need to "adapt to the owner's accent." And Google's system is also supported by the language more. This is one of the secret lies in "big data."

Google released before the speech recognition engine, the first voice search service. Before voice search service, a service call queries. In fact, the official telephone service collects a lot of user's voice input. This part of the data through manual annotation, known as the first batch of data model and language training acoustic models. Subsequently released voice search a collection of more Internet users around the world of sound, coupled with the introduction of semi-automatic tagging system, greatly enriched the training data. The more training data, can cover more languages ​​and accents, the higher the recognition accuracy of machine learning model obtained.

So that when Google released the beginning of the speech recognition engine, the recognition rate is much higher than rely on manual annotation of training data IBM ViaVoice. As voice recognition service used by many mobile phones and desktop applications, it can collect more user's voice input, the accuracy of the model will continue to be improved.

We can see from the above example, because the data collected by Internet service is reflected in the behavior of thousands upon thousands of users, while human behavior is the result of human intelligence.

So if we can design a distributed machine learning systems, large data summarized from the law, we are actually in the induction of the entire human knowledge. This sounds fantastic, in fact the example above, Google has done it. In the last one of this series, we will introduce a semantic learning system we have developed, from which hundreds of billions provision of this data, summarized Chinese millions of "semantics." Subsequently, as long as any user input a text, the system may utilize a trained model within milliseconds, the expression is understood in the text "semantics." This understanding of the process to ensure the elimination of ambiguity in the text, so that the application of search engine advertising system, recommendation systems, a better understanding of user needs.

In short, the Internet makes the first opportunity to collect behavioral data of the first human mankind. So as to machine learning that lasted decades of research provides a new opportunity - distributed machine learning - the induction of human knowledge from the Internet data, so that the machine than any other individual to be "smart."

Recommended Reading articles

40 + annual salary of big data development [W] tutorial, all here!

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Big Data era need to know six things

Big Data framework hadoop Top 10 Myths

Experience big data development engineer salary 30K summary?

Big Data framework hadoop we encountered problems

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/92205999