Simple Introduction of KNN
Meaning of KNN
KNN, which is named from K-Nearest-Neighbor, is such a very algorithm that famous as one of the most simple classified method in data mining.
So what is the meaning of K? K is the numbers of the most nearest neighbors around the new sample. And the new sample is going to be represented by type of neighbors having the most percentage in all the neighbors.
That ‘s why the way this algorithm is called.
Example
following picture is from spedia.
As we see, the green circle is the new sample we want to group.
Now there are two kinds of shapes in the samples set: blue square and red triangle (I will say, it sounds like names of some teams or organization…)
OK, which group we want to put the green circle into is depends on the value of K. As what we have introduced above, we are gonna select K neighbors around the green circle.
- #1 K = 3
In this case, red triangle has a percentage of 2/3, while blue triangle has a percentage of 1/3. Thus, according to our algorithm, gc is gonna be rt, you know.
- #2 K = 5
In this case, red triangle has a percentage of 2/5, while blue triangle has a percentage of 3/5. Thus gc is gonna be bs.
JAVA Implementation
Basic Theorem
Euclidean Distance (欧式距离)
Generally we are gonna use to compute the distance, with expression shown as following:
Others like Manhattan Distance (which is using the sum of the abs of differences between each parameters instead of square, and that means that there will be not only one route from one vector to another) and any other method can also use to compute distance.
Manhattan Distance in Dimension 2
Algorithm Process
Mapper’s job
step.0 Construct data structure of the sample as KNNVector.
step.1 Read the existing samples into ArrayList<KNNVector> samples
.
step.2 Using the method public void computeDistance(KNNVector knnv)
of KNNVector to compute Euclidean distance from each existing samples to input sample.
step.3 Sort and select first k sample in ArrayList<KNNVector> kSamples
Reducer’s job
step.4 Count up number of each types of vectors in k nearest vectors
step.5 Output the predict type of the input sample according to result of step.4
Code
JAVA’s Mapper
public static class KNNSelector extends Mapper<LongWritable, Text, Text, Text>{
/**
*
* Define the data structure of the vector
*
*/
public class KNNVector{
// parameters
// A R4 vector denotes four skills of the very Pokemon
public int[] skills = new int[4];
// name of the Pokemon
public String name = "";
// type of the Pokemon
public String type = "";
// The Euclidean Distance to the input sample
public double EuclideanDis = 0;
// initialize with the given line of data
// data format: Pokemon's name,number of type1 skill,of type2,of type3,of type4
public KNNVector(String data) {
int parser = 0;
// parsing Pokemon's name
for(;;parser++) {
if(data.charAt(parser) == ',') {
parser++;
break;
}
else {
name += data.charAt(parser);
}
}
// index of current parsing skill
int skillCounter = 0;
String skill = "";
// parsing skills
for(;parser < data.length();parser++) {
if(data.charAt(parser) == ',') {
skills[skillCounter] = Integer.parseInt(skill);
skillCounter++;
if(skillCounter == 4) {
parser++;
break;
}
}
else {
skill += data.charAt(parser);
}
}
// if there is no type
if(parser >= data.length()) {
skills[skillCounter] = Integer.parseInt(skill);
}
else {
// parsing Pokemon's type if exists
for(;parser < data.length();parser++) {
type += data.charAt(parser);
}
}
}
//method to compute Euclidean distance to the input sample
public void computeDistance(KNNVector sample) {
double sumOfSquare = 0;
for(int i = 0;i < 4;i++)
sumOfSquare += this.skills[i]*this.skills[i] - sample.skills[i]*sample.skills[i];
this.EuclideanDis = Math.sqrt(sumOfSquare);
}
}
//**************************************** parameters *************************************************************
// existing data samples
private ArrayList<KNNVector> samples = new ArrayList<KNNVector>();
// very k in KNN
private int k = 5;
//********************************************** set up ******************************************************
// Called once at the beginning of the task
// read existing data
@Override
protected void setup(Context context) throws IOException, InterruptedException{
FileSystem fs = null;
try {
fs = FileSystem.get(new URI("hdfs://master:9000/"), new Configuration());
}
catch(Exception e) {
e.printStackTrace();
}
FSDataInputStream in = fs.open(new Path("hdfs://master:9000/samples/samples.txt"));
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String sample = reader.readLine();
while(sample != null) {
System.out.println(sample);
samples.add(new KNNVector(sample));
sample = reader.readLine();
}
}
//********************************************* map *****************************************************
// Called once for each key/value pair in the input split
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
// The sample we want to predict its type
KNNVector newInput = new KNNVector(value.toString());
// ArrayList to help us select k nearest neighbors
ArrayList<KNNVector> kSamples = new ArrayList<KNNVector>();
// select k nearest vectors
for(int i = 0;i < samples.size();i++) {
KNNVector knnv = samples.get(i);
knnv.computeDistance(newInput);
if(kSamples.size() > 0){
int j = 0;
for(j = 0;j < kSamples.size();j++){
if(kSamples.get(j).EuclideanDis >= knnv.EuclideanDis){
kSamples.add(j, knnv);
break;
}
}
if(j >= kSamples.size())
kSamples.add(j, knnv);
}
else
kSamples.add(knnv);
for(int j = 0;j < kSamples.size();j++){
KNNVector knnv_ = kSamples.get(j);
System.out.print(knnv_.name + " " + knnv_.EuclideanDis);
}
System.out.println();
}
// output k first of the priority queue
for(int i = 0;i < k;i++) {
KNNVector knnv = kSamples.get(i);
System.out.println(knnv.name + ": " + knnv.EuclideanDis);
//Output pairs are collected with calls to context.write(WritableComparable, Writable)
context.write(new Text(newInput.name), new Text(knnv.type));
}
}
}
JAVA’s Reducer
/**
*
* This is the reducer
*
* There 3 phases in the reducer
* 1. Sort
*
* Copies the sorted output from each Mapper
*
* 2. Shuffle
*
* The framework will merge the sorts Reducer input by keys
*
* 3. Reduce
*
* This is what we are gonna implement with the virtual member function:
*
* reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context)
*
* this method will be called for each <key, (collection of values)> in the sorted input
*/
public static class KNNPredictor extends Reducer<Text, Text, Text, NullWritable>{
// The result parameter
String type = "";
//*********************************************** reduce ***************************************************
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
int fireCounter = 0;
int waterCounter = 0;
int grassCounter = 0;
int normalCounter = 0;
int max = 0;
for(Text t : values) {
System.out.println(t.toString());
switch(t.toString()) {
case "Fire":
fireCounter += 2;
if(fireCounter > max) {
max = fireCounter;
type = "Fire";
}
case "Water":
waterCounter += 2;
if(waterCounter > max) {
max = waterCounter;
type = "Water";
}
case "Grass":
grassCounter += 2;
if(grassCounter > max) {
max = grassCounter;
type = "Grass";
}
case "Normal":
// since almost every Pokemon can learn Normal type skill, so I make it a 0.5 weight when counting
normalCounter += 1;
if(normalCounter > max) {
max = normalCounter;
type = "Normal";
}
}
}
context.write(new Text(key.toString() + "\t\t" + type), NullWritable.get());
}
@Override
protected void cleanup(Context context) throws IOException, InterruptedException{
}
}
JAVA’s Job Configuration
public static void main(String[] args) throws Exception{
// only provides a hint to the framework
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "KNN");
// jar
job.setJarByClass(KNNRunner.class);
// set the map class
job.setMapperClass(KNNSelector.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// Optionally specify perform local aggregation of the intermediate outputs,
// which helps to cut down the amount of data transferred from the Mapper to the Reducer.
// noting: may not be called.
job.setCombinerClass(KNNPredictor.class);
// set the reduce class
job.setReducerClass(KNNPredictor.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}