Large data using tools and techniques to handle large data sets and complex terms. Capable of processing technology a large amount of data called MapReduce.
When to use MapReduce
MapReduce is particularly suitable for problems involving large amounts of data. It is by dividing the work into smaller blocks, then a plurality of systems can be handled. Because MapReduce will issue a patch work in parallel, compared to conventional systems, solutions will be faster.
Probably the following scenario will be applied to MapReduce:
Apache Hadoop
In this article, we will use Apache Hadoop.
MapReduce development solution, it is recommended to use Hadoop, it is already the de facto standard, is also a free open-source software.
Also in the Amazon, Google and Microsoft and other cloud providers to rent or build a Hadoop cluster.
There are a number of other advantages:
Scalable: You can easily add new processing nodes clear, without changing a line of code
Cost-effectiveness: does not require any special hardware and unique, because the software runs on hardware normal normal
Flexible: no pattern. Any data structure can be processed, or even a combination of a plurality of data sources, without the many problems.
Fault-tolerant: If you have a problem node, other nodes can receive its work, the entire cluster to continue processing.
Further, the container still support the Hadoop called "flow" of the application, it provides freedom of choice for the development of the mapper and reducer scripting language for the user.
In this article we will use PHP to do the main development language.
Apache Hadoop installation configuration beyond the scope. You can own platform, easy to find a lot of articles online. To keep things simple, we only discuss things related to big data.
Mapper (Mapper)
Mapper task is to enter into a series of key-value pairs. For example, in the case the word counter, a series of input lines. We separate them by word and turn them into key-value pair (such as key: word, value: 1), looks like this:
the 1
water 1
on 1
on 1
water 1
on 1
... 1
Then, these pairs are then sent to the reducer for the next step.
reducer
reducer task is to retrieve (sort) right, iterative and converted to the desired output. In the example of the word counter, taking the number (value) word, a word obtained by adding them (key) and the final count. as follows:
water 2
the 1
on 3
mapping and reducing the whole process looks a bit like this, look at the following chart:
We will MapReduce example "Hello World" in the world began, and that is a simple implementation of a word counter. We will need some data to deal with. We do the experiment has been disclosed in the book Moby Dick.
Download this book execute the following command:
wget http://www.gutenberg.org/cache ... 1.txt
Create a working directory in HDFS (Hadoop Distributed File System)
hadoop dfs -mkdir wordcount
Our PHP code begins mapper
#!/usr/bin/php <?php // iterate through lines while($line = fgets(STDIN)){ // remove leading and trailing $line = ltrim($line); $line = rtrim($line); // split the line in words $words = preg_split('/\s/', $line, -1, PREG_SPLIT_NO_EMPTY); // iterate through words foreach( $words as $key ) { // print word (key) to standard output // the output will be used in the // reduce (reducer.php) step // word (key) tab-delimited wordcount (1) printf("%s\t%d\n", $key, 1); } } ?>
The following is the code reducer.
#!/usr/bin/php <?php $last_key = NULL; $running_total = 0; // iterate through lines while($line = fgets(STDIN)) { // remove leading and trailing $line = ltrim($line); $line = rtrim($line); // split line into key and count list($key,$count) = explode("\t", $line); // this if else structure works because // hadoop sorts the mapper output by it keys // before sending it to the reducer // if the last key retrieved is the same // as the current key that have been received if ($last_key === $key) { // increase running total of the key $running_total += $count; } else { if ($last_key != NULL) // output previous key and its running total printf("%s\t%d\n", $last_key, $running_total); // reset last key and running total // by assigning the new key and its value $last_key = $key; $running_total = $count; } } ?>
You can use a combination of certain commands and pipelines to easily test in local script.
head -n1000 pg2701.txt | ./mapper.php | sort | ./reducer.php
We run it on Apache Hadoop cluster:
hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \ -mapper "./mapper.php" -reducer "./reducer.php" -input "hello/mobydick.txt" -output "hello/result"
The output will be stored in folders hello / result, you can view by executing the following command
hdfs dfs -cat hello/result/part-00000
Calculate the average annual price of gold
The next example is a more practical example, although the data set is relatively small, the same logic can be easily applied to a set of data points having hundreds. We will try to calculate the last 50 years the average annual gold price. Want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307 , welcome additions, understand Courses
We download data sets:
wget https://raw.githubusercontent. ... a.csv
Create a working directory in HDFS (Hadoop Distributed File System)
hadoop dfs -mkdir goldprice
Copy the downloaded data set to HDFS
hadoop dfs -copyFromLocal ./data.csv goldprice/data.csv
My reducer looks like this
#!/usr/bin/php <?php // iterate through lines while($line = fgets(STDIN)){ // remove leading and trailing $line = ltrim($line); $line = rtrim($line); // regular expression to capture year and gold value preg_match("/^(.*?)\-(?:.*),(.*)$/", $line, $matches); if ($matches) { // key: year, value: gold price printf("%s\t%.3f\n", $matches[1], $matches[2]); } } ?>
reducer also slightly modified, because we need to calculate the number of items and average.
#!/usr/bin/php <?php $last_key = NULL; $running_total = 0; $running_average = 0; $number_of_items = 0; // iterate through lines while($line = fgets(STDIN)) { // remove leading and trailing $line = ltrim($line); $line = rtrim($line); // split line into key and count list($key,$count) = explode("\t", $line); // if the last key retrieved is the same // as the current key that have been received if ($last_key === $key) { // increase number of items $number_of_items++; // increase running total of the key $running_total += $count; // (re)calculate average for that key $running_average = $running_total / $number_of_items; } else { if ($last_key != NULL) // output previous key and its running average printf("%s\t%.4f\n", $last_key, $running_average); // reset key, running total, running average // and number of items $last_key = $key; $number_of_items = 1; $running_total = $count; $running_average = $count; } } if ($last_key != NULL) // output previous key and its running average printf("%s\t%.3f\n", $last_key, $running_average); ?>
As word counting sample, we can locally test
head -n1000 data.csv | ./mapper.php | sort | ./reducer.php
The final run it on a cluster hadoop
hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \ -mapper "./mapper.php" -reducer "./reducer.php" -input "goldprice/data.csv" -output "goldprice/result"
View average value
hdfs dfs -cat goldprice/result/part-00000
Small reward: generate charts
We will often result into a graph. For this demonstration, I'll use gnuplot, you can use any other interesting things.
First results are returned locally:
hdfs dfs -get goldprice/result/part-00000 gold.dat
Gnu plot to create a configuration file (gold.plot) and copy the following
# Gnuplot script file for generating gold prices set terminal png set output "chart.jpg" set style data lines set nokey set grid set title "Gold prices" set xlabel "Year" set ylabel "Price" plot "gold.dat"
Generate charts:
gnuplot gold.plot
This generates a file named chart.jpg of.