pyspark basic knowledge

1. Charles

1.1 Line element query operation 

Print a list of 20 elements as before, like SQL, within the show function available int type specified number of lines to be printed:

df.show()
df.show(30)

Print a summary in the form of a tree:

df.printSchema()

The first few lines to get local: 

list = df.head(3) # Example: [Row(a=1, b=1), Row(a=2, b=2), ... ...]
list = df.take(5) # Example: [Row(a=1, b=1), Row(a=2, b=2), ... ...]

Query number of rows:  

df.count()

Discover a row as a null:

from pyspark.sql.functions import isnull
df = df.filter(isnull("col_a"))

Output type list, list each element is Row class:

list = df.collect () # Note: this method all data into the local, returns an Array

Query Overview

df.describe().show()

To re-set operation, with the same set of py can DISTINCT () look to the weight, but also can .count () calculates the remaining number of

data.select('columns').distinct().show()

Random sampling in two ways, one is the random number check HIVE inside; the other is in pyspark

Inside the random number check #HIVE 
sql = "select * from data order by rand () limit 2000" 

among #pyspark 
Sample = result.sample (False, 0.5,0) 50% # of Lines RANDOMLY SELECT

1.2 operating element

Row obtain all the elements of column names: 

r = Row(age=11, name='Alice')
print(r.columns) # ['age', 'name']

 

Selecting one or more columns: select

df["age"]
df.age
df.select(“name”)
df.select(df[‘name’], df[‘age’]+1)
df.select(df.a, df.b, df.c) # 选择a、b、c三列
df.select(df["a"], df["b"], df["c"]) # 选择a、b、c三列

select overloaded method: 

# Id column also shows, id + 1 column 
jdbcDF.select (jdbcDF ( "id"), jdbcDF ( "id") +. 1) the .Show (to false) 

# can also be used to select where conditional 
jdbcDF.where ( "id = 1 or c1 = 'b' " ) .show ()

1.3 Sorting  

orderBy and sort: sort fields specified, the default is ascending

train.orderBy(train.Purchase.desc()).show(5)

1.4 Sampling

sample is a sampling function

t1 = train.sample(False, 0.2, 42)
t2 = train.sample(False, 0.2, 43)
t1.count(),t2.count()
Output:
(109812, 109745)

  withReplacement = Are True or False representatives back. fraction = x, where x = .5, representative of the percentage of extraction

2. add, change 

2.1 New Data 

There are two general so the new data way: createDataFrame, .toDF ()

sqlContext.createDataFrame (pd.dataframe ()) # is converted to dataframe pandas is spark.dataframe format as the format can be converted both 


from pyspark Import SparkContext, SparkConf 
from pyspark.sql Import SqlContext 
from pyspark Import SQL 

the conf = SparkConf () .setAppName ( "MyFirstApp"). a setMaster ( "local") 
SC = SparkContext (= the conf the conf) 
SqlContext = sql.SQLContext (SC) 

A = sc.parallelize ([[. 1, "A"], [2, "B"], [. 3, "C"], [. 4, "D"], [. 5, "E"]]). toDF ([ 'IND', "State"]) 

a.show ()

  

Reference: https://blog.csdn.net/sinat_26917383/article/details/80500349  

  

  

  

 

 

  

Guess you like

Origin www.cnblogs.com/tianqizhi/p/12115707.html