Spark ml pipeline - transforming feature - StringIndexer

In the spark ml pipeline extraction features and conversion stage , there is a transformer may be a character string that the common machine learning training data (e.g., represented various classifications) into numeric index column in order to handle the computer. It is StringIndexer . It supports index range is [0, numLabels ) (not supported will be coded as numLabels) , and supports four sort, frequencyDesc (highest frequency assignment index is 0), frequencyAsc, alphabetDesc, alphabetAsc .

 

Suppose we have dataframe

 id | category
----|----------
 0  | a
 1  | b
 2  | c
 3  | a
 4  | a
 5  | c

 

Application indexer category as output as input to categoryIndex

 id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0
 3  | a        | 0.0
 4  | a        | 0.0
 5  | c        | 1.0

“a” gets index 0 because it is the most frequent, followed by “c” with index 1 and “b” with index 2.

 

When StringIndexer encountered no strings previously treated for three new data processing strategy

  • Throws an exception (default)
  • Skip the current row
  • Unknown label placement

 

If we use the previously generated StringIndexer applied to the following data

 id | category
----|----------
 0  | a
 1  | b
 2  | c
 3  | d
 4  | e

 

If no known policy, or the policy is set to error, will throw an exception, and if set too setHandleInvalid ( "skip") will be skipped d, e row

id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0

 

If you call setHandleInvalid ( "keep") will generate the following data

id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0
 3  | d        | 3.0
 4  | e        | 3.0

Note: "d" or "e" where the rows are mapped to the index "3.0", keep unknown coding set, rather than continuing to encode

 

scala Code Example:

import org.apache.spark.ml.feature.StringIndexer

//创建表 val df
= spark.createDataFrame( Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")) ).toDF("id", "category")
//创建新列索引器 val indexer
= new StringIndexer() .setInputCol("category" ) .SetOutputCol ( " categoryIndex " )
// to fit let indexer coding df index then one table here is to convert itself does not throw an exception or skip Val indexed
= indexer.fit (df) .transform (df) indexed.show ()

 

See detailed API documentation  https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer

 

ref: https://spark.apache.org/docs/latest/ml-features.html#stringindexer

 

Guess you like

Origin www.cnblogs.com/lnas01/p/12630238.html