Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement

As we all know, mobile internet, social media, e-commerce and the use of various sensors generate a large data set, data mining can extract useful information.

Benpian data mining and machine learning in the big data environment, focusing on a comprehensive introduction to some of the rows in the data processing algorithms practice, is essential reading for students and practitioners related. The main contents include 10 elements:

◆ Distributed File System and MapReduce tool;

◆ similarity search;

◆ data stream processing and special processing algorithms for special cases easily lost data and the like;

◆ search engine technology, such as Google's PageRank;

◆ frequent item set mining;

◆ large-scale high-dimensional data sets clustering algorithms;

The key problem applications ◆ Web advertising management and a recommendation system;

◆ social networking graph mining;

◆ dimensionality reduction process, such as SVD decomposition and decomposition CUR;

◆ large-scale machine learning.


The basic concept of data mining

This section of the introductory part of the book, first described the nature of the data mining, and understanding the differences discussed in the plurality of related disciplines.

Then introduce Bonferroni principle (Bonferroni's principle), the principle is actually excessive use of data mining has warned.

This chapter also outlines some very useful ideas, they do not necessarily belong to the category of data mining, but there is help understanding some important concepts in data mining. These ideas include a measure of the importance of words TF.IDF weight, and the nature of the hash function index structure, comprising a base e of natural logarithm identities like. Finally, a brief introduction of the topic to be covered in later chapters.

Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement


Similar items found

A basic problem is to obtain data mining "similar" items from the data. We will introduce the relevant application to the problem in 3.1, and give an example of a specific Web page search approximate weight. These approximate duplicate pages may be copied pages, or just the host and the other mirror pages of information about different mirror pages.

First we will look for similarities problem expressed as a set of questions has a relatively large intersection, then we will introduce how to convert the text to the above-mentioned problems similar set of problems and solved by the famous "shingling" technique. Then, we introduce a selection called minimum hash (minhashing) technology, it is capable of large compression set, and may derive the original set of similarity based on a result of the compression. When the similarity demanding, can also be used - some other technologies are described in section 3.9.

Any type of similar items search in the presence of another - an important issue is that even if the calculation of the similarity between each is very simple, but due to the excessive number of items, can not detect all of similarity. To solve this problem, a technique called partial birth sensitive hash (Locality Sensitive Hashing, referred LSH), which is able to search technology focused on those items may be similar to the above.

Finally, we will no longer limit the concept of similarity in the intersection using sets, but considering the distance in any metric space theory. At the same time, it also stimulated the emergence of a common framework of LSH, the framework can be applied in other definitions of similarity.

Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement


Data stream mining

Most algorithms presented in this book are assumed to be excavated from the database. In other words, when if you really need the data, all data is available. In this chapter, we will give additional - - hypotheses: a data - one way or multiple streams of arrival, if the data is not timely processing or storage, data will be lost forever. In addition, we assume that the data arrival speed is too fast, so that there will be activities all the data memory (ie traditional database) and interaction is impossible in our selected time.

Each data stream processing algorithm are to some extent contain aggregated flow (Summarization) process. We first consider how to extract useful samples from the stream, and how to filter most of the elements other "unwanted" from the stream. Then, we show how to estimate the number of independent elements in the stream, where the storage cost estimation methods used much less than list all the elements seen overhead.

Another way of convection is observed only a summary of fixed length "window", the window consists of the last n elements, where n is a given value, typically large. Then let's make it a relational database - kind of window query processing.

If you have a lot of flow and / or n is large, we may not be able to save the whole window for each stream. Therefore, even if these "windows" We all need to be aggregated process. For - a window bit stream, wherein the number of approximated 1 is a fundamental problem.

We will use a ratio of storage space consumed by the entire window much less methods. This method can also be extended to a variety of summation value approximated. .

Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement


Frequent Item Sets

One group is a frequent item sets technical chapter focuses on the characterization data found. This problem often seen as "association rule" was found, although the latter is mainly based on frequent itemsets and found a data characterize a more complex manner achieved.

首先,我们介绍数据的“购物篮”模型,其本质上是“项”和“购物篮”两类元素之间的多对多关系。但是其中有一些关于数据形状的假设。频繁项集问题就是寻找出现在很多相同购物篮中(与该购物篮相关的)的项集。

频繁项集发现问题和第3章讨论的相似性搜索不同,前者主要关注包含某个特定项集的购物篮的绝对数目,而后者的主要目标是寻找购物篮之间具有较高重合度的项集,不管购物篮数目的绝对数量是否很低。

上述差异导致了一类新的频繁项集发现算法的产生。我们首先介绍A-Priori算法, 该算法的基本思路是,如果-一个集合的子集不是频繁项集,那么该集合也不可能是频繁项集。基于这种思路,该算法可以通过检查小集合而去掉大部分不合格的大集合。接着,我们介绍基本的A-Priori算法的各种改进,这些改进策略集中关注给可用内存带来很大压力的极大规模数据集。

再接下来,我们还会考虑一些更快的近似算法,这些算法不能保证找到所有的频繁项集。这类算法当中的一些算法也应用了并行化机制,包括基于MapReduce框架的并行化方法。

最后,我们将简要地讨论数据流中的频繁项集的发现问题。

Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement


推荐系统.

有一类包罗万象的Web应用涉及用户对选项的喜好进行预测,这种系统称为推荐系统( recommendation system )。本章将首先给出这类系统的一些最重要应用样例。

但是,为了集中关注问题本身,下面给出两个很好的推荐系统样例:

(1)基于对用户兴趣的预测结果,为在线报纸的读者提供新闻报道;

(2)基于顾客过去的购物和/或商品搜索历史,为在线零售商的顾客推荐他们可能想要买的商品。

推荐系统使用一系列不同的技术,这些系统可以分成两大类:

  1. 基于内容的系统(Content-basedSystem)这类系统主要考察的是推荐项的性质。例如,如果一个Netlix的用户观看了多部西部牛仔片,那么系统就会将数据库中属于“西部牛仔”类的电影推荐给该用户。

  2. 协同过滤系统( Collaborative Filtering System )这类系统通过计算用户或/和项之间的相似度来推荐项。与某用户相似的用户所喜欢的项会推荐给该用户。这类推荐系统可以使用第3章的相似性搜索和第7章的聚类技术的基本原理。但是,这些技术本身并不足够,有一些新的算法被证明在推荐系统中十分有效。

Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement


大规模机器学习

现在有很多算法被归入“机器学习”类。同本书介绍的其他算法一样,这些算法的目的都是从数据中获取信息。所有数据分析算法都是基于数据生成概要,基于这些概要信息可以进行决策。

在很多例子中,第6章介绍的频繁项集分析方法都生成了关联规则这类信息,这些信息可以用于规划销售策略或者为其他目标服务。

然而,称为“机器学习”的算法不仅能够对数据进行概括,还可以将它们视作模型的学习器或者数据的分类器,因而可以学到数据中未来可以见到的某种信息。例如,第7章介绍的聚类算法可以产生- -系列簇,这些簇不仅能告诉我们有关被分析数据(训练集)的信息,而且能够将未来数据分到聚类算法生成的某-个簇当中。 因此,机器学习爱好者通常用“非监督学习”这个新词来表达聚类,术语“非监督”( unsupervised )表示输人数据并不会告诉聚类算法最后输出的簇到底应该是什么。而在有监督( supervised )的机器学习(本章的主题)中,给出的数据中包含了至少对- -部分数据进行正确分类的信息。已经分好类的数据称为训练集( training set )。

本章并不打算全面介绍机器学习中所有的方法,而只关注那些适用于处理极大规模数据的方法,以及有可能并行化实现的方法。我们会介绍学习数据分类器的经典的“感知机”方法,该方法能够找到-一个将两类数据分开的超平面。之后,我们会考察-一些更现代的包括支持向量机的技术。与感知机类似,这些方法寻找最佳的分类超平面,以使尽可能少(如果有的话)的训练集元素靠近超平面。最后讨论近邻技术,即数据按照某个空间下最近的一些邻居的类别进行分类。

Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement


Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement


Because of their length, Xiao Bian here is not to do too much introduction, and surely we have some of their own understanding and insight into data mining and distributed, but then, I believe we still some gaps in the concept of large-scale figure I hope we can carefully read what the true meaning of this!

So, if you need this [large Internet data mining and large-scale distributed data processing] technical documentation, then ++ VX ①⑧⑤⑥①③ zero ⑤③⑨⑤ I can get up.


Daniel three years to write a large Internet data mining and large-scale distributed data processing new movement





Guess you like

Origin blog.51cto.com/14620574/2455644