sparksql_ _ Delete Row to remove the line

#income column missing too much, basically useless, and now want to get rid of this column data
# thresh = 3 represents a row of the Central African NONE of the data is less than 3 to remove the line

#income列缺失太多,基本无用了,现在要去掉这一列数据
df_miss_no_income = df_miss.select([c for c in df_miss.columns if c != 'income'])
df_miss_no_income.show()
+---+------+------+----+------+
| id|weight|height| age|gender|
+---+------+------+----+------+
|  1| 143.5|   5.6|  28|     M|
|  2| 167.2|   5.4|  45|     M|
|  3|  null|   5.2|null|  null|
|  4| 144.5|   5.9|  33|     M|
|  5| 133.2|   5.7|  54|     F|
|  6| 124.1|   5.2|null|     F|
|  7| 129.2|   5.3|  42|     M|
+---+------+------+----+------+

To drop the observations instead you can use the .dropna(...) method.

#某些行缺失的数据也比较多,现在去除掉这些行
#thresh=3 表示一行中非NONE的数据少于3个则去除该行
df_miss_no_income.dropna(thresh=3).show()#只要含有NONE则去除该行
df_miss_no_income.dropna().show()
+---+------+------+----+------+
| id|weight|height| age|gender|
+---+------+------+----+------+
|  1| 143.5|   5.6|  28|     M|
|  2| 167.2|   5.4|  45|     M|
|  4| 144.5|   5.9|  33|     M|
|  5| 133.2|   5.7|  54|     F|
|  6| 124.1|   5.2|null|     F|
|  7| 129.2|   5.3|  42|     M|
+---+------+------+----+------+

Published 273 original articles · won praise 1 · views 4702

Guess you like

Origin blog.csdn.net/wj1298250240/article/details/103945550