1. Charles
1.1 Line element query operation
Print a list of 20 elements as before, like SQL, within the show function available int type specified number of lines to be printed:
df.show() df.show(30)
Print a summary in the form of a tree:
df.printSchema()
The first few lines to get local:
list = df.head(3) # Example: [Row(a=1, b=1), Row(a=2, b=2), ... ...] list = df.take(5) # Example: [Row(a=1, b=1), Row(a=2, b=2), ... ...]
Query number of rows:
df.count()
Discover a row as a null:
from pyspark.sql.functions import isnull df = df.filter(isnull("col_a"))
Output type list, list each element is Row class:
list = df.collect () # Note: this method all data into the local, returns an Array
Query Overview
df.describe().show()
To re-set operation, with the same set of py can DISTINCT () look to the weight, but also can .count () calculates the remaining number of
data.select('columns').distinct().show()
Random sampling in two ways, one is the random number check HIVE inside; the other is in pyspark
Inside the random number check #HIVE sql = "select * from data order by rand () limit 2000" among #pyspark Sample = result.sample (False, 0.5,0) 50% # of Lines RANDOMLY SELECT
1.2 operating element
Row obtain all the elements of column names:
r = Row(age=11, name='Alice') print(r.columns) # ['age', 'name']
Selecting one or more columns: select
df["age"] df.age df.select(“name”) df.select(df[‘name’], df[‘age’]+1) df.select(df.a, df.b, df.c) # 选择a、b、c三列 df.select(df["a"], df["b"], df["c"]) # 选择a、b、c三列
select overloaded method:
# Id column also shows, id + 1 column jdbcDF.select (jdbcDF ( "id"), jdbcDF ( "id") +. 1) the .Show (to false) # can also be used to select where conditional jdbcDF.where ( "id = 1 or c1 = 'b' " ) .show ()
1.3 Sorting
orderBy and sort: sort fields specified, the default is ascending
train.orderBy(train.Purchase.desc()).show(5)
1.4 Sampling
sample is a sampling function
t1 = train.sample(False, 0.2, 42) t2 = train.sample(False, 0.2, 43) t1.count(),t2.count() Output: (109812, 109745)
withReplacement = Are True or False representatives back. fraction = x, where x = .5, representative of the percentage of extraction
2. add, change
2.1 New Data
There are two general so the new data way: createDataFrame, .toDF ()
sqlContext.createDataFrame (pd.dataframe ()) # is converted to dataframe pandas is spark.dataframe format as the format can be converted both from pyspark Import SparkContext, SparkConf from pyspark.sql Import SqlContext from pyspark Import SQL the conf = SparkConf () .setAppName ( "MyFirstApp"). a setMaster ( "local") SC = SparkContext (= the conf the conf) SqlContext = sql.SQLContext (SC) A = sc.parallelize ([[. 1, "A"], [2, "B"], [. 3, "C"], [. 4, "D"], [. 5, "E"]]). toDF ([ 'IND', "State"]) a.show ()
Reference: https://blog.csdn.net/sinat_26917383/article/details/80500349