Dry: PHP development practices and Big Data

 

Large data using tools and techniques to handle large and complex data sets specific terms, the technology capable of processing large amounts of data called MapReduce.

When to use MapReduce

MapReduce is particularly suitable for problems involving large data. It does this by data processing into very small pieces (or block), a plurality of processing systems are convenient. Because MapReduce will issue a patch work in parallel, compared with traditional software systems, this solution will be faster.

Probably the following scenario applies to MapReduce:

1 count and statistics
2 finishing
3 Filter
4 Sort 

Apache Hadoop

In this article, we will use Apache Hadoop.
 
MapReduce development solution, it is recommended to use Hadoop, it is already the de facto standard, is also a free open-source software.
Also in the Amazon, Google and Microsoft and other cloud providers to rent or build a Hadoop cluster.

There are a number of other advantages:

Scalable: You can easily add new processing nodes clear, without changing a line of code
cost-effectiveness: does not require any special hardware and unique, because the software will be operable at normal hardware
flexibility: no pattern. Any data structure can be processed, or even a combination of a plurality of data sources, without the many problems.
Fault-tolerant: If you have a problem node, other nodes can receive its work, the entire cluster to continue processing.

Further, the container still support the Hadoop called "flow" of the application, it provides freedom of choice for the development of the mapper and reducer scripting language for the user.

In this article we will use PHP to do the main development language.

Hadoop installation 

Apache Hadoop installation configuration beyond the scope. You can own platform, easy to find a lot of articles online. To keep things simple, we only discuss things related to big data.

Mapper (Mapper)

mapper task is input into a series of pairs. For example, in the case the word counter, a series of input lines. We separate them by word and turn them into key-value pair (such as key: word, value: 1) , looks like this:

at The 1
Water 1
ON 1
ON 1
Water 1
ON 1
... 1

Then, these pairs then sent to the reducer for the next step.

the reducer

the reducer task is to retrieve (sort) right, iterative and converted to the desired output. In the example of the word counter, taking the number (value) word, a word obtained by adding them (key) and the final count. As follows:

Water 2
The. 1
ON. 3

Mapping and reducing overall process looks something like this, see the following chart:
 



Use PHP to do word counter

example "Hello World" we will start MapReduce world, and that is to achieve a simple word counter. We will need some data to deal with. We do the experiment has been disclosed in the book Moby Dick.

Download this book execute the following command:

wget http://www.gutenberg.org/cache ... 1.txt


Create a working directory in HDFS (Hadoop Distributed File System)

hadoop dfs -mkdir wordcount


Our PHP code begins mapper

#!/usr/bin/php
<?php
    // iterate through lines
    while($line = fgets(STDIN)){
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // split the line in words
        $words = preg_split('/\s/', $line, -1, PREG_SPLIT_NO_EMPTY);
        // iterate through words
        foreach( $words as $key ) {
            // print word (key) to standard output
            // the output will be used in the
            // reduce (reducer.php) step
            // word (key) tab-delimited wordcount (1)
            printf("%s\t%d\n", $key, 1);
        }
    }
?>


The following is the code reducer.

#!/usr/bin/php
<?php
    $last_key = NULL;
    $running_total = 0;

    // iterate through lines
    while($line = fgets(STDIN)) {
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);
        // split line into key and count
        list($key,$count) = explode("\t", $line);
        // this if else structure works because
        // hadoop sorts the mapper output by it keys
        // before sending it to the reducer
        // if the last key retrieved is the same
        // as the current key that have been received
        if ($last_key === $key) {
            // increase running total of the key
            $running_total += $count;
        } else {
            if ($last_key != NULL)
                // output previous key and its running total
                printf("%s\t%d\n", $last_key, $running_total);
            // reset last key and running total
            // by assigning the new key and its value
            $last_key = $key;
            $running_total = $count;
        }
    }
?>


You can use a combination of certain commands and pipelines to easily test in local script.
 

head -n1000 pg2701.txt | ./mapper.php | sort | ./reducer.php

We run it on Apache Hadoop cluster:

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \
 -mapper "./mapper.php"
 -reducer "./reducer.php"
 -input "hello/mobydick.txt"
 -output "hello/result"


The output will be stored in folders hello / result, you can view by executing the following command

hdfs dfs -cat hello/result/part-00000


Calculating an average price of gold

next example is a more practical example, although the data set is relatively small, the same logic can be easily applied to a set of data points having hundreds. We will try to calculate the last 50 years the average annual gold price.


[Big Data to develop learning materials collection method: data added to the large group study exchange 957,205,962, click a group chat, private letters administrator can receive a free

 

We download data sets:

wget https://raw.githubusercontent. ... a.csv

Create a working directory in HDFS (Hadoop Distributed File System)

hadoop dfs -mkdir goldprice

Copy the downloaded data set to HDFS

hadoop dfs -copyFromLocal ./data.csv goldprice/data.csv

My reducer looks like this:

#!/usr/bin/php
<?php
    // iterate through lines
    while($line = fgets(STDIN)){
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // regular expression to capture year and gold value
        preg_match("/^(.*?)\-(?:.*),(.*)$/", $line, $matches);

        if ($matches) {
            // key: year, value: gold price
            printf("%s\t%.3f\n", $matches[1], $matches[2]);
        }
    }
?>

Our reducer slightly modified because the need to calculate the number of projects and average:

#!/usr/bin/php
<?php
    $last_key = NULL;
    $running_total = 0;
    $running_average = 0;
    $number_of_items = 0;

    // iterate through lines
    while($line = fgets(STDIN)) {
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // split line into key and count
        list($key,$count) = explode("\t", $line);

        // if the last key retrieved is the same
        // as the current key that have been received
        if ($last_key === $key) {
            // increase number of items
            $number_of_items++;
            // increase running total of the key
            $running_total += $count;
            // (re)calculate average for that key
            $running_average = $running_total / $number_of_items;
        } else {
            if ($last_key != NULL)
                // output previous key and its running average
                printf("%s\t%.4f\n", $last_key, $running_average);
            // reset key, running total, running average
            // and number of items
            $last_key = $key;
            $number_of_items = 1;
            $running_total   = $count;
            $running_average = $count;
        }
    }

    if ($last_key != NULL)
        // output previous key and its running average
        printf("%s\t%.3f\n", $last_key, $running_average);
?>

As word counting sample, we can locally test:

head -n1000 data.csv | ./mapper.php | sort | ./reducer.php

The final run it on hadoop cluster:

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \
 -mapper "./mapper.php"
 -reducer "./reducer.php"
 -input "goldprice/data.csv"
 -output "goldprice/result"

View average value

hdfs dfs -cat goldprice/result/part-00000


Small reward: generate charts

we will often result into a graph. For this demonstration, I'll use gnuplot, you can use any other interesting things.

First results are returned locally:

hdfs dfs -get goldprice/result/part-00000 gold.dat

Gnu plot to create a configuration file (gold.plot) and copy the following

# Gnuplot script file for generating gold prices
set terminal png
set output "chart.jpg"
set style data lines
set nokey
set grid
set title "Gold prices"
set xlabel "Year"
set ylabel "Price"
plot "gold.dat"

Generate charts:

gnuplot gold.plot

This generates a file named chart.jpg of. It looks like this:
 

Welcome to add.

Guess you like

Origin blog.csdn.net/lele989/article/details/92378953