Table 1 Union
df1.join (df2, connection conditions, connection)
如:df1.join(df2,[df1.a==df2.a], "inner").show()
Connection: string type, such as "left", commonly used are: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer
Join conditions: df1 [ "a"] == df2 [ "a"] or "a" or df1.a == df2.a, if a plurality of conditions such as, [df1 [ "a"] == df2 [ "a"], df1 [ "b"] == df2 [ "b"]] or (df.a> 1) & (df.b> 1)
It should be noted:
If "a" is connected, the same fields are automatically combined, only one input. The df1.join (df2, "a", "left") of a field output only df1, df2 of a field is removed.
2 udf use
You need to add a reference
from pyspark.sql.functions import udf
from pyspark.sql import functions as F
There are two ways:
The first
def get_tablename(a):
return "name"
get_tablename_udf = F.udf(get_tablename)
The second
@udf
def get_tablename_udf (a):
return "name"
Call two ways is the same
df.withColumn("tablename", get_tablename_udf (df[a"]))
3 packet
Use groupBy method
A single field: df.groupBy ( "a") or df.groupBy (df.a)
A plurality of fields: df.groupBy ([df.a, df.b]) or df.groupBy ([ "a", "b"])
It should be noted:
Method groupBy latter method must be used with the output field, such as: agg (), select (), etc.
4 query
Use filter () or where (), as the two.
Single condition: df.filter (df.a> 1) or df.filter ( "a> 1")
Multi Conditions: df.filter ( "a> 1 and b> 0") or df.filter ((df.a> 1) & (df.b == 0))
5 Replace the null value
Use fillna () or fill () method
df.fillna({"a":0, "b":""})
df.na.fill({"a":0, "b":""})
Sort 6
Use orderBy () or sort () method
df.orderBy(df.a.desc())
df.orderBy(desc("age"), asc("name"))
df.orderBy(["age", "name"], ascending=[0, 1])
df.orderBy(["age", "name"], ascending=False)
It should be noted:
The default is ascending ascending True, False Descending
7 new additional
Use withColumn () or Alias () method
df.withColumn("b",F.lit(999))
df.withColumn("b",df.a)
df.withColumn("b",df.a).withColumn("m","m1")
df.agg(F.lit(ggg).alias("b"))
df.select(F.lit(ggg).alias("b"))
It should be noted:
withColumn method overrides the original of the same name inside the column df
8 rename the column name
Use withColumnRenamed () method
df.withColumnRenamed("a","a1").withColumnRenamed("m","m1")
Need some attention:
To rename columns determined in the presence of which df
9 Create a new DataFrame
Use createDataFrame () method
spark.createDataFrame (data set, the column set), for example: spark.createDataFrame ([(5, "hello")], [ 'a', 'b'])
It should be noted:
Column set and the number of data sets to be consistent
spark为 SparkSession 对象, 例如:spark = SparkSession.builder.master("local").appName("Word Count").config("spark.some.config.option", "some-value").getOrCreate()
10 consolidated result set
Use union () or UnionAll () method
df.union(df1)
It should be noted:
Both methods will not take the initiative to eliminate duplicates, as required, followed by distinct () such as: df.union (df1) .distinct ()
These two methods are combined for display in accordance with the order data sequence, rather than the name of the column
The number of columns the result of two sets of the same size to ensure
11 determines whether or not a NULL value
Use isNull () method or sql statement
df.where(df["a"].isNull())
df.where("a is null")
12 added in the calculation conditions determined
When using () method
df.select(when(df.age == 2, 1).alias("age"))
The value of the age column: When satisfied when condition, output 1, otherwise, the output NULL
A plurality of conditions: when ((df.age == 2) & (df.name == ' "name"), 1)