Simple Introduction of KNN

Meaning of KNN

KNN, which is named from K-Nearest-Neighbor, is such a very algorithm that famous as one of the most simple classified method in data mining.

So what is the meaning of K? K is the numbers of the most nearest neighbors around the new sample. And the new sample is going to be represented by type of neighbors having the most percentage in all the neighbors.

That ‘s why the way this algorithm is called.

Example

following picture is from spedia.
KNN

As we see, the green circle is the new sample we want to group.

Now there are two kinds of shapes in the samples set: blue square and red triangle (I will say, it sounds like names of some teams or organization…)

OK, which group we want to put the green circle into is depends on the value of K. As what we have introduced above, we are gonna select K neighbors around the green circle.

#1 K = 3

In this case, red triangle has a percentage of 2/3, while blue triangle has a percentage of 1/3. Thus, according to our algorithm, gc is gonna be rt, you know.

#2 K = 5

In this case, red triangle has a percentage of 2/5, while blue triangle has a percentage of 3/5. Thus gc is gonna be bs.

JAVA Implementation

Basic Theorem

Euclidean Distance (欧式距离)

Generally we are gonna use to compute the distance, with expression shown as following:

这里写图片描述

Others like Manhattan Distance (which is using the sum of the abs of differences between each parameters instead of square, and that means that there will be not only one route from one vector to another) and any other method can also use to compute distance.

Manhattan Distance in Dimension 2 mht_distance

Algorithm Process

Mapper’s job

step.0 Construct data structure of the sample as KNNVector.

step.1 Read the existing samples into `ArrayList<KNNVector> samples`.

step.2 Using the method `public void computeDistance(KNNVector knnv)` of KNNVector to compute Euclidean distance from each existing samples to input sample.

step.3 Sort and select first k sample in `ArrayList<KNNVector> kSamples`

Reducer’s job

step.4 Count up number of each types of vectors in k nearest vectors

step.5 Output the predict type of the input sample according to result of step.4

Code

JAVA’s Mapper

public static class KNNSelector extends Mapper<LongWritable, Text, Text, Text>{

    /**
     * 
     * Define the data structure of the vector
     *
     */
    public class KNNVector{
        // parameters
        // A R4 vector denotes four skills of the very Pokemon 
        public int[] skills = new int[4];
        // name of the Pokemon
        public String name = "";
        // type of the Pokemon
        public String type = "";
        // The Euclidean Distance to the input sample
        public double EuclideanDis = 0;

        // initialize with the given line of data
        // data format: Pokemon's name,number of type1 skill,of type2,of type3,of type4 
        public KNNVector(String data) {

            int parser = 0;
            // parsing Pokemon's name
            for(;;parser++) {
                if(data.charAt(parser) == ',') {
                    parser++;
                    break;
                }
                else {
                    name += data.charAt(parser);
                }
            }

            // index of current parsing skill
            int skillCounter = 0;
            String skill = "";
            // parsing skills
            for(;parser < data.length();parser++) {
                if(data.charAt(parser) == ',') {
                    skills[skillCounter] = Integer.parseInt(skill);
                    skillCounter++;
                    if(skillCounter == 4) {
                        parser++;
                        break;
                    }
                }
                else {
                     skill += data.charAt(parser);
                }
            }
            // if there is no type 
            if(parser >= data.length()) {
                skills[skillCounter] = Integer.parseInt(skill);
            }
            else {
                // parsing Pokemon's type if exists
                for(;parser < data.length();parser++) {
                    type += data.charAt(parser);
                }
            }
        }

        //method to compute Euclidean distance to the input sample
        public void computeDistance(KNNVector sample) {
            double sumOfSquare = 0;
            for(int i = 0;i < 4;i++)
                sumOfSquare += this.skills[i]*this.skills[i] - sample.skills[i]*sample.skills[i];
            this.EuclideanDis = Math.sqrt(sumOfSquare);
        }
    }
//**************************************** parameters *************************************************************
    // existing data samples
    private ArrayList<KNNVector> samples = new ArrayList<KNNVector>();
    // very k in KNN
    private int k = 5;

//********************************************** set up ******************************************************
    // Called once at the beginning of the task
    // read existing data
    @Override
    protected void setup(Context context) throws IOException, InterruptedException{
        FileSystem fs = null;
        try {
            fs = FileSystem.get(new URI("hdfs://master:9000/"), new Configuration());
        }
        catch(Exception e) {
            e.printStackTrace();
        }
        FSDataInputStream in = fs.open(new Path("hdfs://master:9000/samples/samples.txt"));
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));

        String sample = reader.readLine();
        while(sample != null) {
            System.out.println(sample);
            samples.add(new KNNVector(sample));
            sample = reader.readLine();
        }
    }

//********************************************* map *****************************************************
    // Called once for each key/value pair in the input split
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

        // The sample we want to predict its type
        KNNVector newInput = new KNNVector(value.toString());
        // ArrayList to help us select k nearest neighbors
        ArrayList<KNNVector> kSamples = new ArrayList<KNNVector>();
        // select k nearest vectors
        for(int i = 0;i < samples.size();i++) {
                KNNVector knnv = samples.get(i);
                knnv.computeDistance(newInput);
                if(kSamples.size() > 0){
                        int j = 0;
                        for(j = 0;j < kSamples.size();j++){
                                if(kSamples.get(j).EuclideanDis >= knnv.EuclideanDis){
                                        kSamples.add(j, knnv);
                                        break;
                                }
                        }
                        if(j >= kSamples.size())
                                kSamples.add(j, knnv);
                }
                else
                        kSamples.add(knnv);
                for(int j = 0;j < kSamples.size();j++){
                        KNNVector knnv_ = kSamples.get(j);
                        System.out.print(knnv_.name + " " + knnv_.EuclideanDis);
                }
                System.out.println();
        }

        // output k first of the priority queue
        for(int i = 0;i < k;i++) {
                KNNVector knnv = kSamples.get(i);
                System.out.println(knnv.name + ": " + knnv.EuclideanDis);
                //Output pairs are collected with calls to context.write(WritableComparable, Writable)
                context.write(new Text(newInput.name), new Text(knnv.type));
        }
    }
}

JAVA’s Reducer

/**
 * 
 * This is the reducer
 * 
 * There 3 phases in the reducer
 * 1. Sort
 * 
 * Copies the sorted output from each Mapper
 * 
 * 2. Shuffle
 * 
 * The framework will merge the sorts Reducer input by keys
 * 
 * 3. Reduce
 * 
 * This is what we are gonna implement with the virtual member function: 
 * 
 * reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context)
 * 
 * this method will be called for each <key, (collection of values)> in the sorted input
 */
public static class KNNPredictor extends Reducer<Text, Text, Text, NullWritable>{


    // The result parameter
    String type = "";

//*********************************************** reduce ***************************************************
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context)  throws IOException, InterruptedException{

        int fireCounter = 0;
        int waterCounter = 0;
        int grassCounter = 0;
        int normalCounter = 0;
        int max = 0;
        for(Text t : values) {
            System.out.println(t.toString());
            switch(t.toString()) {
                case "Fire":
                    fireCounter += 2;
                    if(fireCounter > max) {
                        max = fireCounter;
                        type = "Fire";
                    }
                case "Water":
                    waterCounter += 2;
                    if(waterCounter > max) {
                        max = waterCounter;
                        type = "Water";
                    }
                case "Grass":
                    grassCounter += 2;
                    if(grassCounter > max) {
                        max = grassCounter;
                        type = "Grass";
                    }
                case "Normal":
                    // since almost every Pokemon can learn Normal type skill, so I make it a 0.5 weight when counting
                    normalCounter += 1;
                    if(normalCounter > max) {
                        max = normalCounter;
                        type = "Normal";
                    }
            }
        }

        context.write(new Text(key.toString() + "\t\t" + type), NullWritable.get());
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException{

    }
}

JAVA’s Job Configuration

public static void main(String[] args) throws Exception{

    // only provides a hint to the framework
    Configuration conf = new Configuration();

    Job job = Job.getInstance(conf, "KNN");
    // jar
    job.setJarByClass(KNNRunner.class);
    // set the map class
    job.setMapperClass(KNNSelector.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    // Optionally specify perform local aggregation of the intermediate outputs,
    // which helps to cut down the amount of data transferred from the Mapper to the Reducer.
    // noting: may not be called.
    job.setCombinerClass(KNNPredictor.class);
    // set the reduce class
    job.setReducerClass(KNNPredictor.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(NullWritable.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Something about KNN

Simple Introduction of KNN

Meaning of KNN

KNN, which is named from K-Nearest-Neighbor, is such a very algorithm that famous as one of the most simple classified method in data mining.

So what is the meaning of K? K is the numbers of the most nearest neighbors around the new sample. And the new sample is going to be represented by type of neighbors having the most percentage in all the neighbors.

That ‘s why the way this algorithm is called.

Example

As we see, the green circle is the new sample we want to group.

Now there are two kinds of shapes in the samples set: blue square and red triangle (I will say, it sounds like names of some teams or organization…)

OK, which group we want to put the green circle into is depends on the value of K. As what we have introduced above, we are gonna select K neighbors around the green circle.

JAVA Implementation

Basic Theorem

Euclidean Distance (欧式距离)

Algorithm Process

Mapper’s job

step.0 Construct data structure of the sample as KNNVector.

step.1 Read the existing samples into `ArrayList<KNNVector> samples`.

step.2 Using the method `public void computeDistance(KNNVector knnv)` of KNNVector to compute Euclidean distance from each existing samples to input sample.

step.3 Sort and select first k sample in `ArrayList<KNNVector> kSamples`

Reducer’s job

step.4 Count up number of each types of vectors in k nearest vectors

step.5 Output the predict type of the input sample according to result of step.4

Code

JAVA’s Mapper

JAVA’s Reducer

JAVA’s Job Configuration

猜你喜欢

Something about KNN

Simple Introduction of KNN

Meaning of KNN

KNN, which is named from K-Nearest-Neighbor, is such a very algorithm that famous as one of the most simple classified method in data mining.

So what is the meaning of K? K is the numbers of the most nearest neighbors around the new sample. And the new sample is going to be represented by type of neighbors having the most percentage in all the neighbors.

That ‘s why the way this algorithm is called.

Example

As we see, the green circle is the new sample we want to group.

Now there are two kinds of shapes in the samples set: blue square and red triangle (I will say, it sounds like names of some teams or organization…)

OK, which group we want to put the green circle into is depends on the value of K. As what we have introduced above, we are gonna select K neighbors around the green circle.

JAVA Implementation

Basic Theorem

Euclidean Distance (欧式距离)

Algorithm Process

Mapper’s job

step.0 Construct data structure of the sample as KNNVector.

step.1 Read the existing samples into ArrayList<KNNVector> samples.

step.2 Using the method public void computeDistance(KNNVector knnv) of KNNVector to compute Euclidean distance from each existing samples to input sample.

step.3 Sort and select first k sample in ArrayList<KNNVector> kSamples

Reducer’s job

step.4 Count up number of each types of vectors in k nearest vectors

step.5 Output the predict type of the input sample according to result of step.4

Code

JAVA’s Mapper

JAVA’s Reducer

JAVA’s Job Configuration

猜你喜欢

step.1 Read the existing samples into `ArrayList<KNNVector> samples`.

step.2 Using the method `public void computeDistance(KNNVector knnv)` of KNNVector to compute Euclidean distance from each existing samples to input sample.

step.3 Sort and select first k sample in `ArrayList<KNNVector> kSamples`