Hadoop study notes (1) Hello, World

I knew about Hadoop a long time ago, but I never learned it. One is that I can’t use it for my work, and the other is that I can’t find a way to write the first Demo, and my confidence is ruined by “Hadoop Authoritative Guide (First Edition)” . It's okay now, work forced me to be on the shelves, and finally made up my mind to learn from it.

What can Hadoop do?

At present, all I can know is data statistics, such as log analysis, data analysis, and statistics using databases in the past. Because the amount of data is getting bigger and bigger, the analysis is getting slower and slower. Hadoop solves this problem. It performs distributed computing and statistics on some data that is written once and read many times, which is not only faster, but also more powerful (because the statistical process can use code logic, while SQL is much weaker)

Big Data and Hadoop

It is very mysterious. From the above concept, it is actually huge-scale data that is written once and read many times. These data are classified (modeled) and stored according to statistical requirements. Due to the huge amount of data, Hadoop provides A complete set of technical ecology is supported, that is to say, from data access, calculation, and output, Hadoop provides a complete set of solutions.

first application

Well, now that we know what Hadoop does, let's start writing our first program.

need

Let's first make up a demand. Rogue Rabbit Universe is a large supermarket chain. Each sub-supermarket provides the parent company with the sales data of each product in the following format:

1843,44
1943,52
28443,35
223,35

Each line of the data file is the sales record of the product, separated by commas. The left side of the comma is the product number, and the right side is the sales volume of the current year. Because of the huge number of sales items in the whole universe, the data files of this number are tens of billions. Rogue Rabbit decided to use Hadoop to analyze and obtain the data 每种商品的最大销售量.

Create project

1. Create a standard maven project

2. Add hadoop dependencies

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-core</artifactId>
        <version>1.2.1</version>
    </dependency>
</dependencies>

3. Set packaging options

<build>
    <finalName>demo</finalName>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <version>2.6</version>
            <configuration>
                <archive>
                    <manifest>
                        <addClasspath>true</addClasspath>
                        <mainClass>com.sanlea.hadoop.demo.Entrance</mainClass>
                        <classpathPrefix>libs/</classpathPrefix>
                    </manifest>
                </archive>
            </configuration>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-dependency-plugin</artifactId>
            <version>2.8</version>
            <executions>
                <execution>
                    <id>copy-dependencies</id>
                    <phase>package</phase>
                    <goals>
                        <goal>copy-dependencies</goal>
                    </goals>
                    <configuration>
                        <outputDirectory>
                            ${project.build.directory}/libs/
                        </outputDirectory>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Note mainClassthe configuration here, which points to the startup class.

Analysis and Implementation

1. Data processing

Because commodities are available for sale in the entire universe, each data file may have data for the same commodity, that is, the same commodity, the sales volume is in different data files.

If we can read all the data files and convert the sales data into 商品 -> [销售量,销售量....]a format, we can easily count the maximum sales of each item.

In Hadoop, Mapper does this kind of work. Hadoop reads each line of data from all data files, and then passes each line of data to Mapper. Mapper analyzes this line of data and converts it into key -> valuemethods, and then Hadoop key -> valuemerges these data. The data in the final generated key -> [value, value, ....]format.

Ok, let's write a Mapper here:

package com.sanlea.hadoop.demo;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class DataMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
    {
        String line = value.toString();
        String[] parts = line.split(",");
        String product = parts[0];
        int quantity = Integer.parseInt(parts[1]);
        context.write(new Text(product), new IntWritable(quantity));
    }
}

We wrote a DataMapperclass that inherits from Mapper, and the generic parameters are:

  • The location type of each line can be ignored here
  • Type of text per line
  • type of key
  • the type of value

In DataMapperthe mapmethod of , valuerepresenting each row of data, we disassemble each row to obtain the product number and sales quantity, and then contextwrite the result set, with the item number as keythe quantity value.

2. Statistics

From Mapper, we get the key -> [value, value, ....]number of product sales, and we can easily calculate the maximum sales data of this product, that is, poll the value, get the maximum number, and then write it into the result key -> value.

In Hadoop, it is Reduce that does statistical processing of data. We write this Reduce here:

package com.sanlea.hadoop.demo;


import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class DataReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
    {
        int max = Integer.MIN_VALUE;
        for (IntWritable value : values)
        {
            max = Math.max(max, value.get());
        }
        context.write(key, new IntWritable(max));
    }
}

3. Startup class

Well, we have written Mapper for data processing and Reduce for data statistics, so how do we make them work together? First we need to create a task, and then run the task:

package com.sanlea.hadoop.demo;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Entrance
{
    public static void main(String[] args) throws Exception
    {
        if (args.length < 2)
        {
            System.err.println(" Usage: demo <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Entrance.class);
        job.setJobName("demo");

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(DataMapper.class);
        job.setReducerClass(DataReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
  • First of all, this program is a standard Java program, it takes two parameters,
    • data entry directory
    • data output directory
  • Create a task, set the program entry class and name
  • Set the input directory and output directory of the task
  • Mapper and Reduce classes for setting tasks
  • Set the type of Key and Value of task data output
  • Run the task until the task completes and exit.

run

1. Packaging

mvn package

2. Run

hadoop -jar target/demo.jar  in out

3. View the results

The result of the operation is in the outdirectory:

1981    82
1982    53
1983    64

It can be seen from the results that the maximum sales volume of each commodity.

afterword

The code is Hadoop 1.2.1 version, which is a bit old, mainly because the book I read is the third edition of "Hadoop Authoritative Guide", because the version of Hadoop is confusing, the same class name may be completely different with different package names, so it is necessary to Look carefully at the importstatement of the above example code.

This code runs in 独立模式, although it is not a real big data calculation, but it has all the internal organs, enough to understand its original meaning.

I will continue to update when I have time, hehe.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325800696&siteId=291194637