PHP development practices and Big Data

Large data using tools and techniques to handle large data sets and complex terms. Capable of processing technology a large amount of data called MapReduce.

 

When to use MapReduce

MapReduce is particularly suitable for problems involving large amounts of data. It is by dividing the work into smaller blocks, then a plurality of systems can be handled. Because MapReduce will issue a patch work in parallel, compared to conventional systems, solutions will be faster.

Probably the following scenario will be applied to MapReduce:

1 count and statistics

2 finishing

3 filter

4 Sort 

Apache Hadoop

In this article, we will use Apache Hadoop.

MapReduce development solution, it is recommended to use Hadoop, it is already the de facto standard, is also a free open-source software.

Also in the Amazon, Google and Microsoft and other cloud providers to rent or build a Hadoop cluster.

There are a number of other advantages:

Scalable: You can easily add new processing nodes clear, without changing a line of code

Cost-effectiveness: does not require any special hardware and unique, because the software runs on hardware normal normal

Flexible: no pattern. Any data structure can be processed, or even a combination of a plurality of data sources, without the many problems.

Fault-tolerant: If you have a problem node, other nodes can receive its work, the entire cluster to continue processing.

Further, the container still support the Hadoop called "flow" of the application, it provides freedom of choice for the development of the mapper and reducer scripting language for the user.

In this article we will use PHP to do the main development language.

Hadoop installation 

Apache Hadoop installation configuration beyond the scope. You can own platform, easy to find a lot of articles online. To keep things simple, we only discuss things related to big data.

Mapper (Mapper)

Mapper task is to enter into a series of key-value pairs. For example, in the case the word counter, a series of input lines. We separate them by word and turn them into key-value pair (such as key: word, value: 1), looks like this:

the       1

water    1

on        1

on        1

water    1

on        1

...         1

Then, these pairs are then sent to the reducer for the next step.

reducer

reducer task is to retrieve (sort) right, iterative and converted to the desired output. In the example of the word counter, taking the number (value) word, a word obtained by adding them (key) and the final count. as follows:

water 2

the   1

on    3

mapping and reducing the whole process looks a bit like this, look at the following chart:

 

Use PHP to do word counter

We will MapReduce example "Hello World" in the world began, and that is a simple implementation of a word counter. We will need some data to deal with. We do the experiment has been disclosed in the book Moby Dick.

Download this book execute the following command:

wget http://www.gutenberg.org/cache ... 1.txt

Create a working directory in HDFS (Hadoop Distributed File System)

hadoop dfs -mkdir wordcount

Our PHP code begins mapper

#!/usr/bin/php
<?php
    // iterate through lines
    while($line = fgets(STDIN)){
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // split the line in words
        $words = preg_split('/\s/', $line, -1, PREG_SPLIT_NO_EMPTY);
        // iterate through words
        foreach( $words as $key ) {
            // print word (key) to standard output
            // the output will be used in the
            // reduce (reducer.php) step
            // word (key) tab-delimited wordcount (1)
            printf("%s\t%d\n", $key, 1);
        }
    }
?>

The following is the code reducer.

#!/usr/bin/php
<?php
    $last_key = NULL;
    $running_total = 0;

    // iterate through lines
    while($line = fgets(STDIN)) {
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);
        // split line into key and count
        list($key,$count) = explode("\t", $line);
        // this if else structure works because
        // hadoop sorts the mapper output by it keys
        // before sending it to the reducer
        // if the last key retrieved is the same
        // as the current key that have been received
        if ($last_key === $key) {
            // increase running total of the key
            $running_total += $count;
        } else {
            if ($last_key != NULL)
                // output previous key and its running total
                printf("%s\t%d\n", $last_key, $running_total);
            // reset last key and running total
            // by assigning the new key and its value
            $last_key = $key;
            $running_total = $count;
        }
    }
?>

You can use a combination of certain commands and pipelines to easily test in local script.

head -n1000 pg2701.txt | ./mapper.php | sort | ./reducer.php

We run it on Apache Hadoop cluster:

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \
 -mapper "./mapper.php"
 -reducer "./reducer.php"
 -input "hello/mobydick.txt"
 -output "hello/result"

The output will be stored in folders hello / result, you can view by executing the following command

hdfs dfs -cat hello/result/part-00000

Calculate the average annual price of gold

The next example is a more practical example, although the data set is relatively small, the same logic can be easily applied to a set of data points having hundreds. We will try to calculate the last 50 years the average annual gold price. Want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307 , welcome additions, understand Courses

We download data sets:

wget https://raw.githubusercontent. ... a.csv

Create a working directory in HDFS (Hadoop Distributed File System)

hadoop dfs -mkdir goldprice

Copy the downloaded data set to HDFS

hadoop dfs -copyFromLocal ./data.csv goldprice/data.csv

My reducer looks like this

#!/usr/bin/php
<?php
    // iterate through lines
    while($line = fgets(STDIN)){
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // regular expression to capture year and gold value
        preg_match("/^(.*?)\-(?:.*),(.*)$/", $line, $matches);

        if ($matches) {
            // key: year, value: gold price
            printf("%s\t%.3f\n", $matches[1], $matches[2]);
        }
    }
?>

reducer also slightly modified, because we need to calculate the number of items and average.

#!/usr/bin/php
<?php
    $last_key = NULL;
    $running_total = 0;
    $running_average = 0;
    $number_of_items = 0;

    // iterate through lines
    while($line = fgets(STDIN)) {
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // split line into key and count
        list($key,$count) = explode("\t", $line);

        // if the last key retrieved is the same
        // as the current key that have been received
        if ($last_key === $key) {
            // increase number of items
            $number_of_items++;
            // increase running total of the key
            $running_total += $count;
            // (re)calculate average for that key
            $running_average = $running_total / $number_of_items;
        } else {
            if ($last_key != NULL)
                // output previous key and its running average
                printf("%s\t%.4f\n", $last_key, $running_average);
            // reset key, running total, running average
            // and number of items
            $last_key = $key;
            $number_of_items = 1;
            $running_total   = $count;
            $running_average = $count;
        }
    }

    if ($last_key != NULL)
        // output previous key and its running average
        printf("%s\t%.3f\n", $last_key, $running_average);
?>

As word counting sample, we can locally test

head -n1000 data.csv | ./mapper.php | sort | ./reducer.php

The final run it on a cluster hadoop

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \
 -mapper "./mapper.php"
 -reducer "./reducer.php"
 -input "goldprice/data.csv"
 -output "goldprice/result"

View average value

hdfs dfs -cat goldprice/result/part-00000

Small reward: generate charts

We will often result into a graph. For this demonstration, I'll use gnuplot, you can use any other interesting things.

First results are returned locally:

hdfs dfs -get goldprice/result/part-00000 gold.dat

Gnu plot to create a configuration file (gold.plot) and copy the following

# Gnuplot script file for generating gold prices
set terminal png
set output "chart.jpg"
set style data lines
set nokey
set grid
set title "Gold prices"
set xlabel "Year"
set ylabel "Price"
plot "gold.dat"

Generate charts:

gnuplot gold.plot

This generates a file named chart.jpg of.

Published 142 original articles · won praise 0 · Views 9737

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104334295