In the spark ml pipeline extraction features and conversion stage , there is a transformer may be a character string that the common machine learning training data (e.g., represented various classifications) into numeric index column in order to handle the computer. It is StringIndexer . It supports index range is [0, numLabels ) (not supported will be coded as numLabels) , and supports four sort, frequencyDesc (highest frequency assignment index is 0), frequencyAsc, alphabetDesc, alphabetAsc .
Suppose we have dataframe
id | category ----|---------- 0 | a 1 | b 2 | c 3 | a 4 | a 5 | c
Application indexer category as output as input to categoryIndex
id | category | categoryIndex ----|----------|--------------- 0 | a | 0.0 1 | b | 2.0 2 | c | 1.0 3 | a | 0.0 4 | a | 0.0 5 | c | 1.0
“a” gets index 0
because it is the most frequent, followed by “c” with index 1
and “b” with index 2
.
When StringIndexer encountered no strings previously treated for three new data processing strategy
- Throws an exception (default)
- Skip the current row
- Unknown label placement
If we use the previously generated StringIndexer applied to the following data
id | category ----|---------- 0 | a 1 | b 2 | c 3 | d 4 | e
If no known policy, or the policy is set to error, will throw an exception, and if set too setHandleInvalid ( "skip") will be skipped d, e row
id | category | categoryIndex ----|----------|--------------- 0 | a | 0.0 1 | b | 2.0 2 | c | 1.0
If you call setHandleInvalid ( "keep") will generate the following data
id | category | categoryIndex ----|----------|--------------- 0 | a | 0.0 1 | b | 2.0 2 | c | 1.0 3 | d | 3.0 4 | e | 3.0
Note: "d" or "e" where the rows are mapped to the index "3.0", keep unknown coding set, rather than continuing to encode
scala Code Example:
import org.apache.spark.ml.feature.StringIndexer
//创建表 val df = spark.createDataFrame( Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")) ).toDF("id", "category")
//创建新列索引器 val indexer = new StringIndexer() .setInputCol("category" ) .SetOutputCol ( " categoryIndex " )
// to fit let indexer coding df index then one table here is to convert itself does not throw an exception or skip Val indexed = indexer.fit (df) .transform (df) indexed.show ()
See detailed API documentation https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer
ref: https://spark.apache.org/docs/latest/ml-features.html#stringindexer