pyspark daily finishing

Table 1 Union

  df1.join (df2, connection conditions, connection)

  如:df1.join(df2,[df1.a==df2.a], "inner").show()

  Connection: string type, such as "left", commonly used are: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer

  Join conditions: df1 [ "a"] == df2 [ "a"] or "a" or df1.a == df2.a, if a plurality of conditions such as, [df1 [ "a"] == df2 [ "a"], df1 [ "b"] == df2 [ "b"]] or (df.a> 1) & (df.b> 1)

  It should be noted:

  If "a" is connected, the same fields are automatically combined, only one input. The df1.join (df2, "a", "left") of a field output only df1, df2 of a field is removed.

2 udf use

  You need to add a reference

  from pyspark.sql.functions import udf
  from pyspark.sql import functions as F

  There are two ways:

  The first

  def get_tablename(a):

    return "name"

  get_tablename_udf = F.udf(get_tablename)

  The second

  @udf

  def get_tablename_udf (a):

    return "name"

  

  Call two ways is the same

  df.withColumn("tablename", get_tablename_udf (df[a"]))

3 packet

  Use groupBy method

  A single field: df.groupBy ( "a") or df.groupBy (df.a)

  A plurality of fields: df.groupBy ([df.a, df.b]) or df.groupBy ([ "a", "b"])

  It should be noted:

  Method groupBy latter method must be used with the output field, such as: agg (), select (), etc.

4 query

  Use filter () or where (), as the two.

  Single condition: df.filter (df.a> 1) or df.filter ( "a> 1")

  Multi Conditions: df.filter ( "a> 1 and b> 0") or df.filter ((df.a> 1) & (df.b == 0))

5 Replace the null value

  Use fillna () or fill () method

  df.fillna({"a":0, "b":""})

  df.na.fill({"a":0, "b":""})

Sort 6

  Use orderBy () or sort () method

  df.orderBy(df.a.desc())

  df.orderBy(desc("age"), asc("name"))

       df.orderBy(["age", "name"], ascending=[0, 1])

  df.orderBy(["age", "name"], ascending=False)

  It should be noted:

  The default is ascending ascending True, False Descending

7 new additional

  Use withColumn () or Alias ​​() method

  df.withColumn("b",F.lit(999))

  df.withColumn("b",df.a)

  df.withColumn("b",df.a).withColumn("m","m1")

  df.agg(F.lit(ggg).alias("b"))

  df.select(F.lit(ggg).alias("b"))

  It should be noted:

  withColumn method overrides the original of the same name inside the column df

8 rename the column name

  Use withColumnRenamed () method 

  df.withColumnRenamed("a","a1").withColumnRenamed("m","m1") 

  Need some attention:

  To rename columns determined in the presence of which df

9 Create a new DataFrame

  Use createDataFrame () method

  spark.createDataFrame (data set, the column set), for example: spark.createDataFrame ([(5, "hello")], [ 'a', 'b'])

  It should be noted:

   Column set and the number of data sets to be consistent

   spark为 SparkSession 对象, 例如:spark = SparkSession.builder.master("local").appName("Word Count").config("spark.some.config.option", "some-value").getOrCreate()

10 consolidated result set

  Use union () or UnionAll () method

  df.union(df1)

  It should be noted:

  Both methods will not take the initiative to eliminate duplicates, as required, followed by distinct () such as: df.union (df1) .distinct ()

  These two methods are combined for display in accordance with the order data sequence, rather than the name of the column

  The number of columns the result of two sets of the same size to ensure

11 determines whether or not a NULL value 

  Use isNull () method or sql statement

  df.where(df["a"].isNull())

  df.where("a is null")

12 added in the calculation conditions determined

  When using () method

  df.select(when(df.age == 2, 1).alias("age")) 

  The value of the age column: When satisfied when condition, output 1, otherwise, the output NULL 

  A plurality of conditions: when ((df.age == 2) & (df.name == ' "name"), 1)

  

Guess you like

Origin www.cnblogs.com/xiaonanmu/p/12049868.html