SSS-Spark Structured Streaming splits a single column into multiple columns

Foreword:
Since you clicked on this article, it means that you are having the need to split one column into multiple columns, because in the blogger's previous introduction to Spark Structured Streaming, we have already said that in structured stream programming, we have received There are many limitations, such as many static DataFrame methods that cannot be used on this, which brings a lot of confusion to development. Today's article will talk about how to turn one column into multiple columns.

1. Split function

Since it has to become more, the split function is definitely indispensable. There is such a function split(str, pattern) in Spark SQL. The introduction in the source code is like this:

@since(1.5)
@ignore_unicode_prefix
def split(str, pattern):
    """
    Splits str around pattern (pattern is a regular expression).

    .. note:: pattern is a string represent the regular expression.

    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    [Row(s=[u'ab', u'cd'])]
    """
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))

From the source code, you can see that the function of this function is still very powerful, and it supports segmentation by regular expressions.
After the string is split, it becomes a list of strings.
Let’s do a test below (some irrelevant code is omitted) ):

from pyspark.sql import functions as f

source_df的结构是这样的:
source_df.show()
'''
+------------------+
|       value      |
+------------------+
|      a,b,c,d     |
+------------------+
'''
split_df = f.split(source.value,",")
type(split_df)
# 查看分割后得到的这个结果是什么类型:
# <class pyspark.sql.Column >
# 可以看到返回的是一个列类型,我们要把他合并在DataFrame中才能显示
# 利用下面这种方式就可以了:
append_split_column = source_df.withColumn("split_value",split_df)
append_split_column.show()
'''
+------------------+-------------------+
|       value      |    split_value    |
+------------------+-------------------+
|      a,b,c,d     | ['a','b','c','d'] |
+------------------+-------------------+
'''

2. Get the items in the list

Above we just split the string in the column into a list by the separator, or store it in a column as a list, but what we want is to separate each element into a column. Among the built-in functions of Pyspark, there are A function is called explode explosion. It can divide a list into multiple rows, which is obviously not what we need. There is no built-in method for splitting a list into multiple columns. Spark does not have such a built-in method. We can only implement it ourselves:
First look:

add_one_column = split_df.getItem(0)
source_df.withColumn("one",add_one_column).show()

'''
+------------------+-------------+
|       value      |     one     |
+------------------+-------------+
|      a,b,c,d     |      a      |
+------------------+-------------+
'''

From the above code, we can see that a separate column has been extracted from the list, and then added to the back of the original DataFrame through the withColumn method. Below, we will draw a gourd and call the rest in a chained way, all peacefully. Get up

merge_df = source_df.withColumn("1",split_df.getItem(0)) \
					.withColumn("2",split_df.getItem(1)) \
					.withColumn("3",split_df.getItem(2)) \
					.withColumn("4",split_df.getItem(3)) \
					.drop("value")
merge_df.show()
'''
+-------+-------+-------+-------+
|   1   |   2   |   3   |   4   |
+-------+-------+-------+-------+
|   a   |   b   |   c   |   d   |
+-------+-------+-------+-------+
'''

Finally, we delete the original column, so that we have completed one column to multiple columns

3. Split the JSON string into multiple columns

I talked about the simplest fixed separator string, split and split into multiple columns. Here is how to parse the JSON column in the DataFrame and split it into multiple columns.
Since I want to talk about it, just go directly A more complex JSON string can also cope with different scenarios:
now the data format of the original column is like this:

{
    
    
    "name":"zs",
    "age":19,
    "interests":[
        "basketball",
        "football",
        "tennis"
    ],
    "edu":{
    
    
        "primary":{
    
    
            "name":"ttt",
            "graduation":1587880706
        }
    }
}

Let's see how to parse such a format:

# 首先还是需要SpparkSQL中内置的函数方法
from pyspark.sql import functions as f

source_df.show()
# 由于篇幅原因所以自动截断了
'''
+--------------------------------------------------------------------+
|                               value                                |
+--------------------------------------------------------------------+
| {"name": "zs", "age": 19, "interests": ["basketball", "footba....  | 
+--------------------------------------------------------------------+

'''

We need to use
the source code of the from_json(col, schema, options=()) function in functions:

@ignore_unicode_prefix
@since(2.1)
def from_json(col, schema, options={
    
    }):
    """
    Parses a column containing a JSON string into a :class:`MapType` with :class:`StringType`
    as keys type, :class:`StructType` or :class:`ArrayType` with
    the specified schema. Returns `null`, in the case of an unparseable string.

    :param col: string column in json format
    :param schema: a StructType or ArrayType of StructType to use when parsing the json column.
    :param options: options to control parsing. accepts the same options as the json datasource

    .. note:: Since Spark 2.3, the DDL-formatted string or a JSON format string is also
              supported for ``schema``.

    >>> from pyspark.sql.types import *
    >>> data = [(1, '''{"a": 1}''')]
    >>> schema = StructType([StructField("a", IntegerType())])
    >>> df = spark.createDataFrame(data, ("key", "value"))
    >>> df.select(from_json(df.value, schema).alias("json")).collect()
    [Row(json=Row(a=1))]
    >>> df.select(from_json(df.value, "a INT").alias("json")).collect()
    [Row(json=Row(a=1))]
    >>> df.select(from_json(df.value, "MAP<STRING,INT>").alias("json")).collect()
    [Row(json={u'a': 1})]
    >>> data = [(1, '''[{"a": 1}]''')]
    >>> schema = ArrayType(StructType([StructField("a", IntegerType())]))
    >>> df = spark.createDataFrame(data, ("key", "value"))
    >>> df.select(from_json(df.value, schema).alias("json")).collect()
    [Row(json=[Row(a=1)])]
    >>> schema = schema_of_json(lit('''{"a": 0}'''))
    >>> df.select(from_json(df.value, schema).alias("json")).collect()
    [Row(json=Row(a=1))]
    >>> data = [(1, '''[1, 2, 3]''')]
    >>> schema = ArrayType(IntegerType())
    >>> df = spark.createDataFrame(data, ("key", "value"))
    >>> df.select(from_json(df.value, schema).alias("json")).collect()
    [Row(json=[1, 2, 3])]
    """

    sc = SparkContext._active_spark_context
    if isinstance(schema, DataType):
        schema = schema.json()
    elif isinstance(schema, Column):
        schema = _to_java_column(schema)
    jc = sc._jvm.functions.from_json(_to_java_column(col), schema, options)
    return Column(jc)

According to the documentation in the function, we quickly know how to use it.
We first need to have a schema corresponding to the JSON string (Schema)

# 导入SparkSql中所有的类型
from pyspark.sql.types import *
# 根据上面的json字符串创建一个对应模式
my_schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType()),
    StructField("interests", ArrayType(StringType())),
    StructField("edu", StructType([
        StructField("primary", StructType([
            StructField("name", StringType()),
            StructField("graduation", TimestampType())
        ]))
    ]))
])

After the pattern is created, you only need to pass the corresponding json string column and pattern to the from_json() function.

json_column = f.from_json(source_df,my_schema)
# 和前面分割字符串一样,这里得到的也只是一个json结构的单独的列
# 我们还是需要手动将这些字段提取出来
# 因为我这里是演示,就提几个重要的字段,明白方法即可
# 我们这里换一种方法来添加列,用select的方式来提取
merge_df = source_df.select(
	json_column.getItem("name").alias("student_name"),
	json_column.getItem("age").alias("age"),
	json_column.getItem("edu").getItem("primary").getItem("name").alias("primary_name"),
	json_column.getItem("edu").getItem("primary").getItem("graduation").alias("primary_graduation"),
	json_column.getItem("interests").getItem(0).alias("interests1")
)
merge_df.show()
'''
+------+-----+--------------+--------------------+------------+
| name | age | primary_name | primary_graduation | interests1 |
+------+-----+--------------+--------------------+------------+
|  zs  | 19  |     ttt      |     1587880706     | basketball |
+------+-----+--------------+--------------------+------------+
'''

This completes the change of a single column to multiple columns in Spark Structured Streaming. The core is actually to operate around this source_df, which is the input table, and the streaming table, so that you can use many static DataFrame methods.

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/105766771