Basic operation functions of DataFrame

**

Basic operation functions of DataFrame

**

-Action operation

  • 1. The return value of collect() is an array, which returns all rows of the dataframe collection. 2. The
    return value of collectAsList() is an array of java type, which returns all the rows of the dataframe collection. 3. Count()
    returns a number type, which returns The number of rows in the dataframe collection
    4. describe(cols: String*)
    returns a table-like value (count, mean, stddev, min, and
    max) calculated by mathematics . This can pass multiple parameters, separated by commas, if any If the field is empty, it does not participate in the calculation, only the pair of numeric fields.
    For example, df.describe("age", "height").show()
    5. first() returns the first row, the type is row type
    6, head() returns the first row, and the type is row type
    7, head(n :Int) returns n rows, the type is row type
    8, show() returns the value of the dataframe collection is 20 rows by default, the return type is unit
    9, show(n:Int) returns n rows, and the return value type is unit
    10. table(n:Int) returns n rows, the type is row type

Basic operation of dataframe

1. Cache() synchronizes the data memory
2. Columns returns an array of string type, the return value is the name of all columns
3. dtypes returns a two-dimensional array of string type, the return value is the name and type of all columns
4. Explan ()Print the physical of the execution plan
5. explain(n:Boolean) The input value is false or true, the return value is unit and the default is false, if you enter true, it will print the logical and physical
6. The return value of isLocal is of Boolean type. If the allowed mode is local, return true, otherwise return false
7. persist(newlevel:StorageLevel) returns a dataframe.this.type input storage model type
8. printSchema() prints out the field name and type according to the tree structure to print
9. registerTempTable( tablename:String) returns Unit, put the object of df in only one table, this table is deleted with the deletion of the object
10, schema returns structType type, returns the field name and type according to the structure type
11, toDF( ) Returns a new dataframe type
12. toDF(colnames: String*) returns several fields in the parameter to a new dataframe type,
13. unpersist() returns dataframe.this.type type, and removes the data in the pattern
14. unpersist(blocking:Boolean) returns dataframe.this.type type true
Same function as unpersist, false is to remove RDD

Integrated query:

1. agg(expers:column*) returns the dataframe type, same as the mathematical calculation
df.agg(max("age"), avg("salary"))
df.groupBy().agg(max("age") , avg(“salary”))
2. agg(exprs: Map[String, String]) returns dataframe type, same as
df.agg(Map(“age” -> “max”, “salary ”-> “avg”))
df.groupBy().agg(Map(“age” -> “max”, “salary” -> “avg”))
3. agg(aggExpr: (String, String), aggExprs : (String, String) ) Returns the dataframe type, same as the mathematical calculation
df.agg(Map(“age” -> “max”, “salary” -> “avg”))
df.groupBy().agg(Map ("Age" -> "max", "salary" -> "avg"))
4. apply(colName: String) returns the column type and captures the object entered into the column
5. as(alias: String) returns a new The dataframe type is the original alias.
6. col(colName: String) returns the column type to capture the object input into the column.
7. cube(col1: String, cols: String
) Return a GroupedData type, summarize according to some fields
8. Distinct deduplication returns a dataframe type
9. Drop(col: Column) delete a column and return dataframe type
10. dropDuplicates(colNames: Array[String]) delete the same column Returns a dataframe
11, except(other: DataFrame) returns a dataframe, returns the current collection that does not exist in other collections
12, explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]) The return value is a dataframe type, this splits a field into more rows
df.explode("name","names ”) {name :String=> name.split(" ")}.show();
Split the name field according to the space, and put the split field in the names
13. filter(conditionExpr: String): brush the selected part Data, return dataframe type df.filter("age>10").show(); df.filter(df("age")>10).show(); df.where(df("age")>10 ).show(); all works
14. groupBy(col1: String, cols: String*) summarizes and returns the groupedate type according to a write field df.groupBy("age").agg(Map("age" ->"count")).show(); df.groupBy("age").avg().show(); can be
15. intersect(other: DataFrame) returns a dataframe with elements that exist in both dataframes
16. join(right: DataFrame, joinExprs: Column , joinType: String)
One is the associated dataframe, the second is the condition of the association, and the third is the type of association: inner, outer, left_outer, right_outer, leftsemi
df.join(ds,df("name")===ds ("Name") and df("age")===ds("age"),"outer").show();
17, limit(n: Int) return dataframe type to get n data out
18. na : DataFrameNaFunctions, you can call the function area of ​​dataframenafunctions to filter df.na.drop().show(); delete empty rows
19, orderBy(sortExprs: Column*) do alise sort
20, select(cols:string*) dataframe Do field selection df.select($"colA", $"colB" + 1)
21. selectExpr(exprs: String*) Do the field selection df.selectExpr("name","name as names","upper(name)","age+1").show();
22, sort( sortExprs: Column*) Sort df.sort(df(“age”).desc).show(); The default is asc
23, unionAll(other:Dataframe) merge df.unionAll(ds).show();
24, withColumnRenamed (existingName: String, newName: String) modify the list df.withColumnRenamed("name","names").show();
25, withColumn(colName: String, col: Column) add a column df.withColumn("aa", df("name")).show();

Guess you like

Origin blog.csdn.net/weixin_44445958/article/details/88654931