PySpark Operations

Basic operation:

 

Get the spark version number at runtime (take spark 2.0.0 as an example):

sparksn = SparkSession.builder.appName("PythonSQL").getOrCreate()

print sparksn.version

 

Get spark configuration (crossJoin, etc.):

df = spark.sql("SET -v")

df.show()

 

Display all the content of each column, display the content without pruning, show all the content of each column

df.show(truncate=False)

 

Create and convert formats:

 

Convert Pandas and Spark's DataFrame to each other:

pandas_df = spark_df.toPandas() spark_df = sqlContext.createDataFrame(pandas_df)

 

Interconversion with Spark RDD:

rdd_df = df.rdd df = rdd_df.toDF ()

Note: The premise of converting rdd to df is that the type of each rdd is Row type

 

 

increase:

 

New column:

df.withColumn("xx", 0).show() will report an error because there is no xx column
 

from pyspark.sql import functions

df = df.withColumn(“xx”, functions.lit(0)).show()

 

fillna function:

df.na.fill()

 

Add columns based on existing columns:

df = df.withColumn('count20', df["count"] - 20) # The new column is the data of the original column minus 20

 

Add sequence tags:

df.rdd.zipWithIndex ()

Reference: stackoverflow

 

delete:

 

Delete a column:

df.drop('age').collect()

df.drop(df.age).collect()

 

dropna function:

df = df.na.drop() # Drop any row that contains na in any column
df = df.dropna(subset=['col_name1', 'col_name2']) # Drop any row that contains na in either column of col1 or col2 Row

 

 

change:

 

Modify all the values ​​of the original df["xx"] column:

df = df.withColumn(“xx”, 1)

 

Modify the type of the column (type casting):

df = df.withColumn("year2", df["year1"].cast("Int"))

 

The join method to merge 2 tables:

 df_join = df_left.join(df_right, df_left.key == df_right.key, "inner")

Among them, the methods can be: `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.

 

groupBy method integration:

GroupedData = df.groupBy("age")

applies a single function (group according to the same name in column A, and merge column B within the group by mean calculation):

df.groupBy(“A”).avg(“B”).show()

 

Apply multiple functions:

from pyspark.sql import functions

df.groupBy(“A”).agg(functions.avg(“B”), functions.min(“B”), functions.max(“B”)).show()

Available methods of the GroupedData type after integration (both return the DataFrame type):

avg(*cols) – Calculate the average of one or more columns in each group

count() - Counts the total number of rows in each group, the returned DataFrame has 2 columns, one is the group name of the grouping, and the other is the total number of rows

max(*cols)    ——   计算每组中一列或多列的最大值

mean(*cols)  ——  计算每组中一列或多列的平均值

min(*cols)     ——  计算每组中一列或多列的最小值

sum(*cols)    ——   计算每组中一列或多列的总和

 

【函数应用】将df的每一列应用函数f:

df.foreach(f) 或者 df.rdd.foreach(f)

 

【函数应用】将df的每一块应用函数f:

df.foreachPartition(f) 或者 df.rdd.foreachPartition(f)

 

【Map和Reduce应用】返回类型seqRDDs

df.map(func)
df.reduce(func)

 

解决toDF()跑出First 100 rows类型无法确定的异常,可以采用将Row内每个元素都统一转格式,或者判断格式处理的方法,解决包含None类型时转换成DataFrame出错的问题:

    @staticmethod

    def map_convert_none_to_str(row):

        dict_row = row.asDict()

 

        for key in dict_row:

            if key != 'some_column_name':

                value = dict_row[key]

                if value is None:

                    value_in = str("")

                else:

                    value_in = str(value)

                dict_row[key] = value_in

 

        columns = dict_row.keys()

        v = dict_row.values()

        row = Row(*columns)

        return row(*v)

 

 

查:

解决中文乱码问题(python 2.7方案)

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

 

行元素查询操作:

像SQL那样打印列表前20元素(show函数内可用int类型指定要打印的行数):

df.show()
df.show(30)

 

以树的形式打印概要

df.printSchema()

 

获取头几行到本地:

list = df.head(3)   # Example: [Row(a=1, b=1), Row(a=2, b=2), ... ...]
list = df.take(5)   # Example: [Row(a=1, b=1), Row(a=2, b=2), ... ...]

 

输出list类型,list中每个元素是Row类:

list = df.collect()

注:此方法将所有数据全部导入到本地

 

查询总行数:

 int_num = df.count()

 

查询某列为null的行:

from pyspark.sql.functions import isnull
df = df.filter(isnull("col_a"))

 

 

列元素操作:

获取Row元素的所有列名:

r = Row(age=11, name='Alice')
print r.__fields__    #  ['age', 'name']

 

选择一列或多列:

df.select(“name”)

df.select(df[‘name’], df[‘age’]+1)

df.select(df.a, df.b, df.c)    # 选择a、b、c三列

df.select(df["a"], df["b"], df["c"])    # 选择a、b、c三列

 

排序:

df = df.sort("age", ascending=False)

 

过滤数据(filter和where方法相同):

df = df.filter(df['age']>21)
df = df.where(df['age']>21)

# 对null或nan数据进行过滤:
from pyspark.sql.functions import isnan, isnull
df = df.filter(isnull("a"))  # 把a列里面数据为null的筛选出来(代表python的None类型)
df = df.filter(isnan("a"))  # 把a列里面数据为nan的筛选出来(Not a Number,非数字数据)

 

 

SQL操作:

DataFrame注册成SQL的表:

df.createOrReplaceTempView("TBL1")

 

进行SQL查询(返回DataFrame):

conf = SparkConf()
ss = SparkSession.builder.appName("APP_NAME").config(conf=conf).getOrCreate()

df = ss.sql(“SELECT name, age FROM TBL1 WHERE age >= 13 AND age <= 19″)

 

对HIVE数据库的操作:

首先,开启对Hive数据库支持的开关:

spark = SparkSession.builder.appName("app1").enableHiveSupport().getOrCreate() 

接着,用INSERT代替write.parquet命令操作HIVE数据库

df = spark.read.parquet(some_path)

df.createOrReplaceTempView("some_df_tmp_table")

sql_content= "    INSERT INTO `some_schema`.`some_hive_table`    " \

                      "    SELECT a,b,c from some_df_tmp_table                   "

spark.sql(sql_content)

 

 

 

时间序列操作:

 

先按某几列分组,再按时间段分组:

from pyspark.sql.functions import window
 

win_monday = window("col1", "1 week", startTime="4 day")

GroupedData = df.groupBy([df.col2, df.col3, df.col4, win_monday])

 

 

 

 

参考资料:

Spark与Pandas中DataFrame对比(详细)

使用Apache Spark让MySQL查询速度提升10倍以上

传统MySQL查询(执行时间 19 min 16.58 sec):

mysql> 

 

SELECT

    MIN(yearD),

    MAX(yearD) AS max_year,

    Carrier,

    COUNT(*) AS cnt,

    SUM(IF(ArrDelayMinutes > 30, 1, 0)) AS flights_delayed,

    ROUND(SUM(IF(ArrDelayMinutes > 30, 1, 0)) / COUNT(*),2) AS rate

FROM

    ontime_part

WHERE

    DayOfWeek NOT IN (6 , 7)

        AND OriginState NOT IN ('AK' , 'HI', 'PR', 'VI')

        AND DestState NOT IN ('AK' , 'HI', 'PR', 'VI')

GROUP BY carrier

HAVING cnt > 1000 AND max_year > '1990'

ORDER BY rate DESC , cnt DESC

LIMIT 10;

 

使用Scala语言摘写的Spark查询(执行时间 2 min 19.628 sec):

scala>

val jdbcDF = spark.read.format("jdbc").options(Map("url" ->  "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql",

                                                   "dbtable" -> "ontime.ontime_sm",     

                                                   "fetchSize" -> "10000",

                                                   "partitionColumn" -> "yeard",

                                                   "lowerBound" -> "1988",

                                                   "upperBound" -> "2015",

                                                   "numPartitions" -> "48")).load()

jdbcDF.createOrReplaceTempView("ontime")

val sqlDF = sql("SELECT

                     MIN(yearD),

                     MAX(yearD) AS max_year,

                     Carrier,

                     COUNT(*) AS cnt,

                     SUM(IF(ArrDelayMinutes > 30, 1, 0)) AS flights_delayed,

                     ROUND(SUM(IF(ArrDelayMinutes > 30, 1, 0)) / COUNT(*),2) AS rate

                 FROM

                     ontime_part

                 WHERE

                     DayOfWeek NOT IN (6 , 7)

                         AND OriginState NOT IN ('AK' , 'HI', 'PR', 'VI')

                         AND DestState NOT IN ('AK' , 'HI', 'PR', 'VI')

                 GROUP BY carrier

                 HAVING cnt > 1000 AND max_year > '1990'

                 ORDER BY rate DESC , cnt DESC

                 LIMIT 10;

")

sqlDF.show()

 

 

Spark RDD中的map、reduce等操作的概念详解:

 

map将RDD中的每个元素都经过map内函数处理后返回给原来的RDD,即对每个RDD单独处理且不影响其它和总量。属于一对一的关系(这里一指的是对1个RDD而言)。

 

flatMap将RDD中的每个元素进行处理,返回一个list,list里面可以是1个或多个RDD,最终RDD总数会不变或变多。属于一变多的关系(这里一指的是对1个RDD而言)。

 

reduce将RDD中元素前两个传给输入函数,产生一个新的return值,新产生的return值与RDD中下一个元素(第三个元素)组成两个元素,再被传给输入函数,直到最后只有一个值为止。属于多变一的关系

val c = sc.parallelize(1 to 10)

c.reduce((x, y) => x + y)//结果55

 

reduceByKey(binary_function) 

reduceByKey就是对元素为KV对的RDD中Key相同的元素的Value进行binary_function的reduce操作,因此,Key相同的多个元素的值被reduce为一个值,然后与原RDD中的Key组成一个新的KV对。属于多变少的关系

val a = sc.parallelize(List((1,2),(1,3),(3,4),(3,6)))

a.reduceByKey((x,y) => x + y).collect

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325508269&siteId=291194637