One pyspark of sparkML machine learning FAQ: removing nulls from dataset or using handleInvalid = "keep" or "skip".

Using python language development sparkML machine learning programs, such as abnormal encounter:

Caused by: org.apache.spark.SparkException: Encountered null while assembling a row with handleInvalid = "keep". Consider
removing nulls from dataset or using handleInvalid = "keep" or "skip".
 at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:287)
 at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:255)
 at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
 at org.apache.spark.ml.feature.VectorAssembler$.assemble(VectorAssembler.scala:255)
 at org.apache.spark.ml.feature.VectorAssembler$$anonfun$4.apply(VectorAssembler.scala:144)
 at org.apache.spark.ml.feature.VectorAssembler$$anonfun$4.apply(VectorAssembler.scala:143)
 ... 16 more

1. Analysis of Abnormal tips:

removing nulls from dataset or using handleInvalid = "keep" or "skip". there is a null value in the data set characteristics are listed according to the prompt should be said that the proposal would remove the null value. Or configuration parameters handleInvalid = "keep" or "skip".

First, the machine learning process, normative data set is very important. This is not to say you crawling or directly used data directly is very standardized. We need to do is pre-treatment data into a complete specification of the training data set before the machine learning and training. Pretreatment such as a null value, converted into numeric values ​​eigenvalue string type, irrespective of training or delete the column.

2. Solution:

1) Delete the null value column

2) configuration parameters, where the second method is used to test the effect of performing the configuration parameters

  Parameters: handleInvalid = "skip" or "keep", skip i.e., the representative value is not null skip process, keep the representative retention null value

The parameters need to be placed in a column wherein the conversion function of the feature vector: As used herein, skip mode

vec_assembler = VectorAssembler(inputCols=feature_cols, outputCol='features',handleInvalid="skip")
查看VectorAssembler源码:
@keyword_only
def __init__(self, inputCols=None, outputCol=None, handleInvalid="error"):
"""
__init__(self, inputCols=None, outputCol=None, handleInvalid="error")
"""
super(VectorAssembler, self).__init__()
self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorAssembler", self.uid)
self._setDefault(handleInvalid="error")
kwargs = self._input_kwargs
self.setParams(**kwargs)

handleInvalid this function by default = "error" parameter defines the characteristics of the column data that is if there is a problem for the error will be thrown prompt.

 

3. Run the test again to resolve anomalies

Guess you like

Origin www.cnblogs.com/mdlcw/p/11106334.html