I knew about Hadoop a long time ago, but I never learned it. One is that I can’t use it for my work, and the other is that I can’t find a way to write the first Demo, and my confidence is ruined by “Hadoop Authoritative Guide (First Edition)” . It's okay now, work forced me to be on the shelves, and finally made up my mind to learn from it.
What can Hadoop do?
At present, all I can know is data statistics, such as log analysis, data analysis, and statistics using databases in the past. Because the amount of data is getting bigger and bigger, the analysis is getting slower and slower. Hadoop solves this problem. It performs distributed computing and statistics on some data that is written once and read many times, which is not only faster, but also more powerful (because the statistical process can use code logic, while SQL is much weaker)
Big Data and Hadoop
It is very mysterious. From the above concept, it is actually huge-scale data that is written once and read many times. These data are classified (modeled) and stored according to statistical requirements. Due to the huge amount of data, Hadoop provides A complete set of technical ecology is supported, that is to say, from data access, calculation, and output, Hadoop provides a complete set of solutions.
first application
Well, now that we know what Hadoop does, let's start writing our first program.
need
Let's first make up a demand. Rogue Rabbit Universe is a large supermarket chain. Each sub-supermarket provides the parent company with the sales data of each product in the following format:
1843,44
1943,52
28443,35
223,35
Each line of the data file is the sales record of the product, separated by commas. The left side of the comma is the product number, and the right side is the sales volume of the current year. Because of the huge number of sales items in the whole universe, the data files of this number are tens of billions. Rogue Rabbit decided to use Hadoop to analyze and obtain the data 每种商品的最大销售量
.
Create project
1. Create a standard maven project
2. Add hadoop dependencies
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
</dependencies>
3. Set packaging options
<build>
<finalName>demo</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.6</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>com.sanlea.hadoop.demo.Entrance</mainClass>
<classpathPrefix>libs/</classpathPrefix>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.8</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>
${project.build.directory}/libs/
</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
Note
mainClass
the configuration here, which points to the startup class.
Analysis and Implementation
1. Data processing
Because commodities are available for sale in the entire universe, each data file may have data for the same commodity, that is, the same commodity, the sales volume is in different data files.
If we can read all the data files and convert the sales data into 商品 -> [销售量,销售量....]
a format, we can easily count the maximum sales of each item.
In Hadoop, Mapper does this kind of work. Hadoop reads each line of data from all data files, and then passes each line of data to Mapper. Mapper analyzes this line of data and converts it into key -> value
methods, and then Hadoop key -> value
merges these data. The data in the final generated key -> [value, value, ....]
format.
Ok, let's write a Mapper here:
package com.sanlea.hadoop.demo;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class DataMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
String[] parts = line.split(",");
String product = parts[0];
int quantity = Integer.parseInt(parts[1]);
context.write(new Text(product), new IntWritable(quantity));
}
}
We wrote a DataMapper
class that inherits from Mapper, and the generic parameters are:
- The location type of each line can be ignored here
- Type of text per line
- type of key
- the type of value
In DataMapper
the map
method of , value
representing each row of data, we disassemble each row to obtain the product number and sales quantity, and then context
write the result set, with the item number as key
the quantity value
.
2. Statistics
From Mapper, we get the key -> [value, value, ....]
number of product sales, and we can easily calculate the maximum sales data of this product, that is, poll the value, get the maximum number, and then write it into the result key -> value
.
In Hadoop, it is Reduce that does statistical processing of data. We write this Reduce here:
package com.sanlea.hadoop.demo;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class DataReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int max = Integer.MIN_VALUE;
for (IntWritable value : values)
{
max = Math.max(max, value.get());
}
context.write(key, new IntWritable(max));
}
}
3. Startup class
Well, we have written Mapper for data processing and Reduce for data statistics, so how do we make them work together? First we need to create a task, and then run the task:
package com.sanlea.hadoop.demo;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Entrance
{
public static void main(String[] args) throws Exception
{
if (args.length < 2)
{
System.err.println(" Usage: demo <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Entrance.class);
job.setJobName("demo");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(DataMapper.class);
job.setReducerClass(DataReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
- First of all, this program is a standard Java program, it takes two parameters,
- data entry directory
- data output directory
- Create a task, set the program entry class and name
- Set the input directory and output directory of the task
- Mapper and Reduce classes for setting tasks
- Set the type of Key and Value of task data output
- Run the task until the task completes and exit.
run
1. Packaging
mvn package
2. Run
hadoop -jar target/demo.jar in out
3. View the results
The result of the operation is in the out
directory:
1981 82
1982 53
1983 64
It can be seen from the results that the maximum sales volume of each commodity.
afterword
The code is Hadoop 1.2.1 version, which is a bit old, mainly because the book I read is the third edition of "Hadoop Authoritative Guide", because the version of Hadoop is confusing, the same class name may be completely different with different package names, so it is necessary to Look carefully at the import
statement of the above example code.
This code runs in 独立模式
, although it is not a real big data calculation, but it has all the internal organs, enough to understand its original meaning.
I will continue to update when I have time, hehe.