PySpark solves the connected graph problem

Previous review:

Installation and use of PySpark and GraphFrameshttps:
//xxmdmst.blog.csdn.net/article/details/123009617

networkx quickly solves the problem of connected graph
https://xxmdmst.blog.csdn.net/article/details/123012333

Earlier I explained the use of the PySpark graph computing library and two examples of pure python to solve the connected graph problem. In this article, we continue to use PySpark for the last connected graph problem.

Requirement 1: Find a community

Liu Bei and Guan Yu are related, indicating that they are a community, and Liu Bei and Zhang Fei are also related, so Liu Bei, Guan Yu, Zhang Fei belong to a community, and so on.

image-20220218200637173

How to solve this connected graph problem using Pyspark?

First, we create the spark object:

from pyspark.sql import SparkSession, Row
from graphframes import GraphFrame

spark = SparkSession \
    .builder \
    .appName("PySpark") \
    .master("local[*]") \
    .getOrCreate()
sc = spark.sparkContext
# 设置检查点目录
sc.setCheckpointDir("checkpoint")

Then build the data:

data = [
    ['刘备', '关羽'],
    ['刘备', '张飞'],
    ['张飞', '诸葛亮'],
    ['曹操', '司马懿'],
    ['司马懿', '张辽'],
    ['曹操', '曹丕']
]
data = spark.createDataFrame(data, ["人员", "相关人员"])
data.show()
+------+--------+
|  人员|相关人员|
+------+--------+
|  刘备|    关羽|
|  刘备|    张飞|
|  张飞|  诸葛亮|
|  曹操|  司马懿|
|司马懿|    张辽|
|  曹操|    曹丕|
+------+--------+

Obviously, the original data is the edge data required by the graph calculation, just modify one of the column names:

edges = data.toDF("src", "dst")
edges.printSchema()
root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)

Let's start building the vertex data:

vertices = (
    edges.rdd.flatMap(lambda x: x)
            .distinct()
            .map(lambda x: Row(x))
            .toDF(["id"])
)
vertices.show()
+------+
|    id|
+------+
|诸葛亮|
|  刘备|
|  曹操|
|司马懿|
|  曹丕|
|  关羽|
|  张飞|
|  张辽|
+------+

The following uses spark's graph calculation to calculate the connected graph:

g = GraphFrame(vertices, edges)
result = g.connectedComponents().orderBy("component")
result.show()
+------+------------+
|    id|   component|
+------+------------+
|司马懿|           0|
|  张辽|           0|
|  曹丕|           0|
|  曹操|           0|
|  关羽|635655159808|
|  刘备|635655159808|
|  张飞|635655159808|
|诸葛亮|635655159808|
+------+------------+

It can be seen that the members of a community have been successfully identified through the same component, and the needs have been successfully solved.

Requirement 2: Unified user identification

The five fields of abcde represent unique identifiers such as mac address, ip address, device_id, imei, etc. tags represent user tags. For some reason, there are always several fields missing in the unique identification field of the same user. Now it is required to identify the data of the same user, and to merge the labels of each user. Examples of raw data and resulting models are as follows:

img

First, we construct the data:

df = spark.createDataFrame([
    ['a1', None, 'c1', None, None, 'tag1'],
    [None, None, 'c1', 'd1', None, 'tag2'],
    [None, 'b1', None, 'd1', None, 'tag3'],
    [None, 'b1', 'c1', 'd1', 'e1', 'tag4'],
    ['a2', 'b2', None, None, None, 'tag1'],
    [None, 'b4', 'c4', None, 'e4', 'tag1'],
    ['a2', None, None, 'd2', None, 'tag2'],
    [None, None, 'c2', 'd2', None, 'tag3'],
    [None, 'b3', None, None, 'e3', 'tag1'],
    [None, None, 'c3', None, 'e3', 'tag2'],
], list("abcde")+["tags"])
df.show()

result:

+----+----+----+----+----+----+
|   a|   b|   c|   d|   e|tags|
+----+----+----+----+----+----+
|  a1|null|  c1|null|null|tag1|
|null|null|  c1|  d1|null|tag2|
|null|  b1|null|  d1|null|tag3|
|null|  b1|  c1|  d1|  e1|tag4|
|  a2|  b2|null|null|null|tag1|
|null|  b4|  c4|null|  e4|tag1|
|  a2|null|null|  d2|null|tag2|
|null|null|  c2|  d2|null|tag3|
|null|  b3|null|null|  e3|tag1|
|null|null|  c3|null|  e3|tag2|
+----+----+----+----+----+----+

The next idea is still the same as last time. First, assign a unique id to each row of data, and then for each uniquely identified column, build the connection relationship between rows according to whether they are the same or not. All uniquely identified columns generate connections. Relationships work together as edges for graph computation.

The following uses the zipWithUniqueId method of RDD to generate a unique ID for each row, and move this ID to the front (because this data may be frequently used many times later, so it is cached):

tmp = df.rdd.zipWithUniqueId().map(lambda x: (x[1], x[0]))
tmp.cache()
tmp.first()
(0, Row(a='a1', b=None, c='c1', d=None, e=None, tags='tag1'))

Build vertex data based on unique id:

vertices = tmp.map(lambda x: Row(x[0])).toDF(["id"])
vertices.show()
+---+
| id|
+---+
|  0|
|  1|
|  7|
|  2|
|  8|
|  3|
|  4|
| 10|
|  5|
| 11|
+---+

Next, build the edge data:

def func(p):
    for k, ids in p:
        ids = list(ids)
        n = len(ids)
        if n <= 1:
            continue
        for i in range(n-1):
            for j in range(i+1, n):
                yield (ids[i], ids[j])


edges = []
keylist = list("abcde")
for key in keylist:
    data = tmp.mapPartitions(lambda area: [(row[key], i) for i, row in area if row[key]])
    edgeRDD = data.groupByKey().mapPartitions(func)
    edges.append(edgeRDD)
edgesDF = sc.union(edges).toDF(["src", "dst"])
edgesDF.cache()
edgesDF.show()
+---+---+
|src|dst|
+---+---+
|  8|  4|
|  7|  2|
|  0|  1|
|  0|  2|
|  1|  2|
|  4| 10|
|  1|  7|
|  1|  2|
|  7|  2|
|  5| 11|
+---+---+

You can see that all line number relationships have been successfully acquired.

The following uses graph computation to calculate the rows that belong to the same user:

gdf = GraphFrame(vertices, edgesDF)
components = gdf.connectedComponents()
components.show()
+---+---------+
| id|component|
+---+---------+
|  0|        0|
|  1|        0|
|  7|        0|
|  2|        0|
|  8|        4|
|  3|        3|
|  4|        4|
| 10|        4|
|  5|        5|
| 11|        5|
+---+---------+

With the row number and the unique identifier of the group to which it belongs, we can get the component to which each row of the original data belongs through the table join:

result = tmp.cogroup(components.rdd) \
    .map(lambda pair: pair[1][0].data[0] + Row(pair[1][1].data[0])) \
    .toDF(df.schema.names+["component"])
result.cache()
result.show()
+----+----+----+----+----+----+---------+
|   a|   b|   c|   d|   e|tags|component|
+----+----+----+----+----+----+---------+
|  a1|null|  c1|null|null|tag1|        0|
|null|null|  c1|  d1|null|tag2|        0|
|null|  b1|  c1|  d1|  e1|tag4|        0|
|null|  b4|  c4|null|  e4|tag1|        3|
|  a2|null|null|  d2|null|tag2|        4|
|null|  b3|null|null|  e3|tag1|        5|
|null|  b1|null|  d1|null|tag3|        0|
|  a2|  b2|null|null|null|tag1|        4|
|null|null|  c2|  d2|null|tag3|        4|
|null|null|  c3|null|  e3|tag2|        5|
+----+----+----+----+----+----+---------+

You can see that we have successfully identified the same user, and the rest just needs to group and merge the data using pandas logic:

def func(pdf):
    row = pdf[keylist].bfill().head(1)
    row["tags"] = pdf.tags.str.cat(sep=",")
    return row


result.groupBy("component").applyInPandas(
    func, schema="a string, b string, c string, d string, e string, tags string"
).show()
+----+---+---+----+----+-------------------+
|   a|  b|  c|   d|   e|               tags|
+----+---+---+----+----+-------------------+
|  a1| b1| c1|  d1|  e1|tag1,tag2,tag4,tag3|
|null| b4| c4|null|  e4|               tag1|
|  a2| b2| c2|  d2|null|     tag2,tag1,tag3|
|null| b3| c3|null|  e3|          tag1,tag2|
+----+---+---+----+----+-------------------+

It can be seen that the desired result has been successfully obtained.

Note: applyInPandas requires that the returned result must be the datafream object of pandas, so the previous logic was changed from .iloc[0] to .head(1)

If your spark is not version 3.X and there is no applyInPandas method, it will be a lot of trouble to use the native rdd method:

def func(pair):
    component, rows = pair
    keyList = list("abcde")
    ids = {
    
    }
    for row in rows:
        for key in keylist:
            v = getattr(row, key)
            if v:
                ids[key] = v
        ids.setdefault("tags", []).append(row.tags)
    result = []
    for key in keylist:
        result.append(ids.get(key))
    result.append(",".join(ids["tags"]))
    return result


result2 = result.rdd.groupBy(
    lambda row: row.component).map(func).toDF(df.schema)
result2.cache()
result2.show()

Same result:

+----+---+---+----+----+-------------------+
|   a|  b|  c|   d|   e|               tags|
+----+---+---+----+----+-------------------+
|  a1| b1| c1|  d1|  e1|tag1,tag2,tag4,tag3|
|null| b4| c4|null|  e4|               tag1|
|  a2| b2| c2|  d2|null|     tag2,tag1,tag3|
|null| b3| c3|null|  e3|          tag1,tag2|
+----+---+---+----+----+-------------------+

Guess you like

Origin blog.csdn.net/as604049322/article/details/123036398