Wu Yuxiong - natural born HADOOP performing experiments study notes: pagerank algorithm

Purpose

Learn PageRank algorithm

Learn to solve complex computing problems of the actual mapreduce

Principle

About 1.pagerank algorithm
   PageRank, that is page rank, also known as PageRank, Google ranks the left or Page Rank.
  pagerank is part of the Google ranking algorithm (ranking formula) is, is a way to rank pagerank / used to identify the importance of web pages for Google, Google is the only standard to measure the quality of a website.
  Google uses it to reflect the relevance and importance of web pages, search engine optimization operation is often used to evaluate the effectiveness of one page optimization factors. pagerank to determine the level of a page via hyperlinks vast network of relationships. Google to explain a link from page A to page B page A to page B vote, according to sources Google to vote (even the source of origin, that is linked to a page A page) and the target level of voting to determine a new level. Simply put, a high level of other low page rank can improve the page rank.
  FIG follows a simple example, Internet pages can be viewed as a directed graph, where nodes are page, if the page A has a link to the page B, then there exists a directed edge A-> B:

 

 

2. Introduce the principle
  concerning the information on the various principles and pagerank of the network more, we have omitted a lot out of the middle of the profound mathematical proof, given relatively simple principle introduction:
  (1) .pr core ideas have 2 points:

If a page is a lot of other pages linking to illustrate this page, then the more important, which is the PR value will be relatively high

If a PR value is very high page links to one of the other pages, pages that are linked to the PR value will be boosted accordingly

  There is a simple algorithm of .WIKI on PR (2), it does not consider the transition probability, but is used in an iterative manner, each time updating the PR value of all the pages, updated way is the PR value of the level of each page apportioned to all the pages it points to, each page a total of all pages pointing to it in equal shares to its value as its PR value of the bout until the PR value of all the pages of convergence or meet certain threshold conditions to stop.

  (3) .ABCD PR initial values ​​are 1, A vote now BCD, so use it to BCD are PR value plus PR / 3 A, empathy, B vote for AD, PR value AD are so to add the PB B / 2, CD Similarly, after completion of this round, PR ABCD have a new value, and then repeat the above steps until the error reaches two adjacent accuracy so far.

  (4) The above algorithm has a problem, if the PR value every time the original and the PR value is directly obtained by the addition as a new voting PR value, initial results susceptible to errors, so there is a method to select the original value multiplied by a coefficient plus PR PR to vote and multiplied by another factor, the results obtained in this way is more reasonable, after the results of many Internet companies, the general uses two coefficients 0.85 and 0.15.

3.PageRank simple calculations
  assume that only one from the set consisting of four pages: A, B, C, and D. If all pages chain to A, then the PR A (the PageRank) value will be B, C and D and the.

 

 

We continue to assume that B also has a link to C, and D are also linked to the A comprises three pages. A page can not vote twice. So B half-price ticket to each page. In the same logic, D vote is counted to only one third of the PageRank A.

 

 

In other words, the total number of pages linked bisecting a PR value.

 

Examples shown below may be more readily understood specific PageRank calculation process:

 

 

lab environment

1.操作系统
  操作机:Windows_7
  操作机默认用户名:hongya,密码:123456
2.实验工具
  IntelliJ IDEA

 

 

IDEA全称IntelliJ IDEA,是java语言开发的集成环境,IntelliJ在业界被公认为最好的java开发工具之一,尤其在智能代码助手、代码自动提示、重构、J2EE支持、Ant、JUnit、CVS整合、代码审查、创新的GUI设计等方面的功能可以说是超常的。IDEA是JetBrains公司的产品,这家公司总部位于捷克共和国的首都布拉格,开发人员以严谨著称的东欧程序员为主。

  优点:
  1)最突出的功能自然是调试(Debug),可以对Java代码,JavaScript,JQuery,Ajax等技术进行调试。其他编辑功能抛开不看,这点远胜Eclipse。
  2)首先查看Map类型的对象,如果实现类采用的是哈希映射,则会自动过滤空的Entry实例。不像Eclipse,只能在默认的toString()方法中寻找你所要的key。
  3)其次,需要动态Evaluate一个表达式的值,比如我得到了一个类的实例,但是并不知晓它的API,可以通过Code Completion点出它所支持的方法,这点Eclipse无法比拟。
  4)最后,在多线程调试的情况下,Log on console的功能可以帮你检查多线程执行的情况。

  缺点:
  1)插件开发匮乏,比起Eclipse,IDEA只能算是个插件的矮子,目前官方公布的插件不足400个,并且许多插件实质性的东西并没有,可能是IDEA本身就太强大了。
  2)在同一页面中只支持单工程,这为开发带来一定的不便,特别是喜欢开发时建一个测试工程来测试部分方法的程序员带来心理上的不认同。
  3)匮乏的技术文章,目前网络中能找到的技术支持基本没有,技术文章也少之又少。
  4)资源消耗比较大,建个大中型的J2EE项目,启动后基本要200M以上的内存支持,包括安装软件在内,差不多要500M的硬盘空间支持。(由于很多智能功能是实时的,因此包括系统类在内的所有类都被IDEA存放到IDEA的工作路径中)。

  特色功能:
  智能选择
  丰富的导航模式
  历史记录功能
  JUnit的完美支持
  对重构的优越支持
  编码辅助
  灵活的排版功能
  XML的完美支持
  动态语法检测
  代码检查等等

 

 

 

 

 

 

 

 

 

 

 

 

步骤2:代码类实现分析

  解析每个记录中的value的工具类NodeUtils,代码比较简单,主要有三个方法,解析value,判断是否有投票,将节点转化为字符串。
  2.1PageRankMapper,读进来一行记录,解析得到源节点,投票节点,写出每个节点的pr值,大家可以参照hellohadoop|com.hongya|day027|RunJobPageRankMapper方法的代码。

static class PageRankMapper extends Mapper<Text, Text, Text, Text> {

    protected void map(Text key, Text value,

                       Context context)

            throws IOException, InterruptedException {

        //解析value

        NodeUtils n = NodeUtils.parse(value.toString());

        assert n != null;

        //写出现有值

        context.write(key, new Text(n.toString()));//key:A  value:1.0    B    D

        if (n.isHaveOutLink()) {

            for (String outNode : n.getOutLinkNodes()) {

                double outValue = n.getPr() / n.getOutLinkNodes().length;

                //写出投票值

                context.write(new Text(outNode), new Text(outValue + ""));//key:B value:0.5

            }

        }

    }

}

  2.2PageRankReducer直接得到map的数据后,累加得到的投票的pr值,然后写出。

static class PageRankReducer extends Reducer<Text, Text, Text, Text> {

        protected void reduce(Text key, Iterable<Text> values,

                              Context context)

                throws IOException, InterruptedException {

            double sum = 0;

            /** sourceNode和sourcePr分别代表源节点和源PR,sum计算所有的投票的值*/

            NodeUtils sourceNode = null;

            double sourcePr = 0;

            for (Text i : values) {

                NodeUtils n = NodeUtils.parse(i.toString());

                assert n != null;

                if (n.isHaveOutLink()) {

                    sourceNode = n;

                    sourcePr = n.getPr();

                } else {

                    sum = sum + n.getPr();

                }

            }

            double newPr = sum * 0.85 + 0.15 * sourcePr;

            System.out.println("新的pagerank的值为: ---------" + key + "*****" + newPr);

            //计算误差

            double n = newPr - sourcePr;

            System.out.println(n + "    " + (int) (Math.abs(n) * 1000));

            context.getCounter(My.COUNTER).increment((int) (Math.abs(n) * 1000));

            //重新写出

            assert sourceNode != null;

            sourceNode.setPr(newPr);

            context.write(key, new Text(sourceNode.toString()));

        }

    }

  由于实现过程需要不断迭代,知道误差达到精度要求,所以job需要记录误差,不断循环运行mapreduce,具体参考hellohadoop|com.hongya|day027|RunJob

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/tszr/p/12169362.html