Getting Started with Spark Basics

Getting Started with Spark Basics

  • Environment setup
    1. local
    2. standlone
    3. kick ha
  • spark code
    1. spark core
    2. spark sql
    3. spark streaming

Environment setup

Preparation

Create installation directory

mkdir /opt/soft
cd /opt/soft

Download scala

wget https://downloads.lightbend.com/scala/2.13.12/scala-2.13.12.tgz -P /opt/soft

Unzip scala

tar -zxvf scala-2.13.12.tgz

Modify scala directory name

mv scala-2.13.12 scala-2

Download spark

wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3-scala2.13.tgz -P /opt/soft

Unzip spark

tar -zxvf spark-3.5.0-bin-hadoop3-scala2.13.tgz 

Modify directory name

mv spark-3.5.0-bin-hadoop3-scala2.13 spark-3

Modify environment traversal

vim /etc/profile.d/my_env.sh
export JAVA_HOME=/opt/soft/jdk-8

export ZOOKEEPER_HOME=/opt/soft/zookeeper-3

export HDFS_NAMENODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_ZKFC_USER=root
export HDFS_JOURNALNODE_USER=root
export HADOOP_SHELL_EXECNAME=root

export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

export HADOOP_HOME=/opt/soft/hadoop-3
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export HBASE_HOME=/opt/soft/hbase-2

export PHOENIX_HOME=/opt/soft/phoenix

export HIVE_HOME=/opt/soft/hive-3
export HCATALOG_HOME=/opt/soft/hive-3/hcatalog

export HCAT_HOME=/opt/soft/hive-3/hcatalog
export SQOOP_HOME=/opt/soft/sqoop-1

export FLUME_HOME=/opt/soft/flume

export SCALA_HOME=/opt/soft/scala-2

export SPARK_HOME=/opt/soft/spark-3
export SPARKPYTHON=/opt/soft/spark-3/python

export PATH=$PATH:$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PHOENIX_HOME/bin:$HIVE_HOME/bin:$HCATALOG_HOME/bin:$HCATALOG_HOME/sbin:$HCAT_HOME/bin:$SQOOP_HOME/bin:$FLUME_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SPARKPYTHON
source /etc/profile

Local mode

java scale

start up
spark-shell

spark-shell

Science site:http://spark01:4040

spark

quit
:quit

quit

pyspark

start up
pyspark

pyspark

Science site:http://spark01:4040

spark

quit
quit() or Ctrl-D

pyspark

Submit application in local mode

Execute in spark directory

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[4] \
./examples/jars/spark-examples_2.13-3.5.0.jar \
10
  1. –class indicates the main class of the program to be executed. This can be replaced with an application written by ourselves.
  2. –master local[2] deployment mode, the default is local mode, the number indicates the number of allocated virtual CPU cores
  3. spark-examples_2.13-3.5.0.jar is the jar package where the running application class is located. In actual use, it can be set to our own jar package.
  4. The number 10 represents the entry parameter of the program, which is used to set the number of tasks for the current application.

Standalone mode

Write core configuration file

under the cont directory

cd /opt/soft/spark-3/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
export JAVA_HOME=/opt/soft/jdk-8
export HADOOP_HOME=/opt/soft/hadoop-3
export HADOOP_CONF_DIR=/opt/soft/hadoop-3/etc/hadoop
export JAVA_LIBRAY_PATH=/opt/soft/hadoop-3/lib/native
export SPARK_DIST_CLASSPATH=$(/opt/soft/hadoop-3/bin/hadoop classpath)

export SPARK_MASTER_HOST=spark01
export SPARK_MASTER_PORT=7077

export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_CORES=4
export SPARK_MASTER_WEBUI_PORT=6633

Edit slaves

cp workers.template workers
vim workers
spark01
spark02
spark03

Configure history log

cp spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://lihaozhe/spark-log
hdfs dfs -mkdir /spark-log
vim spark-env.sh
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 
-Dspark.history.retainedApplications=30 
-Dspark.history.fs.logDirectory=hdfs://lihaozhe/spark-log"

Modify startup file name

mv sbin/start-all.sh sbin/start-spark.sh
mv sbin/stop-all.sh sbin/stop-spark.sh

Distribute with other nodes

scp -r /opt/soft/spark-3 root@spark02:/opt/soft
scp -r /opt/soft/spark-3 root@spark03:/opt/soft
scp /etc/profile.d/my_env.sh root@spark02:/etc/profile.d
scp /etc/profile.d/my_env.sh  root@spark03:/etc/profile.d

Refresh environment traversal on other nodes

source /etc/profile

start up

start-spark.sh
start-history-server.sh

web

http://spark01:6633

spark webui

http://spark01:18080

spark historyserver

Submit jobs to the cluster

spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://spark01:7077 \
./examples/jars/spark-examples_2.13-3.5.0.jar \
10

Submit jobs to Yarn

spark-submit \
--master yarn \
--class  org.apache.spark.examples.SparkPi \
./examples/jars/spark-examples_2.13-3.5.0.jar \
10

HA mode

Write core configuration file

under the cont directory

cd /opt/soft/spark-3/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
export JAVA_HOME=/opt/soft/jdk-8
export HADOOP_HOME=/opt/soft/hadoop-3
export HADOOP_CONF_DIR=/opt/soft/hadoop-3/etc/hadoop
export JAVA_LIBRAY_PATH=/opt/soft/hadoop-3/lib/native
export SPARK_DIST_CLASSPATH=$(/opt/soft/hadoop-3/bin/hadoop classpath)

SPARK_DAEMON_JAVA_OPTS="
-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=spark01:2181,spark02:2181,spark03:2181 
-Dspark.deploy.zookeeper.dir=/spark"

export SPARK_WORKER_MEMORY=8g
export SPARK_WORKER_CORES=8
export SPARK_MASTER_WEBUI_PORT=6633

Edit slaves

cp workers.template workers
vim workers
spark01
spark02
spark03

Configure history log

cp spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://lihaozhe/spark-log
hdfs dfs -mkdir /spark-log
vim spark-env.sh
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 
-Dspark.history.retainedApplications=30 
-Dspark.history.fs.logDirectory=hdfs://lihaozhe/spark-log"

Modify startup file name

mv sbin/start-all.sh sbin/start-spark.sh
mv sbin/stop-all.sh sbin/stop-spark.sh

Distribute with other nodes

scp -r /opt/soft/spark-3 root@spark02:/opt/soft
scp -r /opt/soft/spark-3 root@spark03:/opt/soft
scp /etc/profile.d/my_env.sh root@spark02:/etc/profile.d
scp /etc/profile.d/my_env.sh  root@spark03:/etc/profile.d

Refresh environment traversal on other nodes

source /etc/profile

start up

start-spark.sh
start-history-server.sh

web

http://spark01:6633

spark master

http://spark01:18080

spark historyserver

Submit jobs to the cluster

spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://spark01:7077 \
./examples/jars/spark-examples_2.13-3.5.0.jar \
10

Submit jobs to Yarn

spark-submit --master yarn \
--class  org.apache.spark.examples.SparkPi \
./examples/jars/spark-examples_2.13-3.5.0.jar 10

spark-code

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.lihaozhe</groupId>
    <artifactId>spark-code</artifactId>
    <version>1.0.0</version>

    <properties>
        <jdk.version>1.8</jdk.version>
        <scala.version>2.13.12</scala.version>
        <spark.version>3.5.0</spark.version>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <commons-io.version>2.14.0</commons-io.version>
        <commons-lang3.version>3.12.0</commons-lang3.version>
        <commons-pool2.version>2.11.1</commons-pool2.version>
        <hadoop.version>3.3.6</hadoop.version>
        <hive.version>3.1.3</hive.version>
        <java-testdata-generator.version>1.1.2</java-testdata-generator.version>
        <junit.version>5.10.1</junit.version>
        <lombok.version>1.18.30</lombok.version>
        <mysql.version>8.2.0</mysql.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-compiler</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>com.github.binarywang</groupId>
            <artifactId>java-testdata-generator</artifactId>
            <version>${java-testdata-generator.version}</version>
        </dependency>
        <!-- spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.13</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.13</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.13</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.13</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- junit-jupiter-api -->
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-api</artifactId>
            <version>${junit.version}</version>
            <scope>test</scope>
        </dependency>
        <!-- junit-jupiter-engine -->
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-engine</artifactId>
            <version>${junit.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>${lombok.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-slf4j2-impl</artifactId>
            <version>2.21.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.21.1</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.25</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <!-- commons-pool2 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-pool2</artifactId>
            <version>${commons-pool2.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>${hive.version}</version>
        </dependency>
        <!-- commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>${commons-lang3.version}</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>${commons-io.version}</version>
        </dependency>
        <dependency>
            <groupId>com.mysql</groupId>
            <artifactId>mysql-connector-j</artifactId>
            <version>${mysql.version}</version>
        </dependency>
    </dependencies>
    <build>
        <finalName>${project.artifactId}</finalName>
        <!--<outputDirectory>../package</outputDirectory>-->
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.11.0</version>
                <configuration>
                    <!-- 设置编译字符编码 -->
                    <encoding>UTF-8</encoding>
                    <!-- 设置编译jdk版本 -->
                    <source>${jdk.version}</source>
                    <target>${jdk.version}</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-clean-plugin</artifactId>
                <version>3.3.2</version>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-resources-plugin</artifactId>
                <version>3.3.1</version>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-war-plugin</artifactId>
                <version>3.3.2</version>
            </plugin>
            <!-- 编译级别 -->
            <!-- 打包的时候跳过测试junit begin -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>3.2.2</version>
                <configuration>
                    <skip>true</skip>
                </configuration>
            </plugin>
            <!-- 该插件用于将Scala代码编译成class文件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>4.8.1</version>
                <configuration>
                    <scalaCompatVersion>${scala.version}</scalaCompatVersion>
                    <scalaVersion>${scala.version}</scalaVersion>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>compile-scala</id>
                        <phase>compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>test-compile-scala</id>
                        <phase>test-compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

hdfs-conf

Store the hdfs core configuration files core-site.xml and hdfs-site.xml in the resources directory

The imported hdfs configuration file is the test cluster configuration file

Since the production environment is different from the test environment, the hdfs configuration file is excluded when packaging the project.

rdd

Same points:
They are all distributed data sets
The bottom layer of DataFrame is RDD, but DataSet is not, but they are finally converted to RDD for operation.
The similarities between DataSet and DataFrame are distributed data sets (schema) with data characteristics and data types.
The differences:
Schema information:
The data in RDD has no data type
The data in DataFrame is a weak data type and no data type checking will be done

​ Although there is a schema that stipulates the data type, no error will be reported when compiling, but will be reported when running
​ The data type in DataSet is a strong data type ​DataSet uses custom data encoder for serialization and deserialization The default serialization mechanism of RDD and DataFrame is java serialization, which can be modified to Kyro’s mechanism
Serialization mechanism:

Construct RDD using data set method

package cn.lihaozhe.chap01;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;
import java.util.List;

/**
 * 数据集构建RDD
 * RDD代表弹性分布式数据集。它是记录的只读分区集合。 RDD是Spark的基本数据结构。它允许程序员以容错方式在大型集群上执行内存计算。
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo01 {
    
    
  public static void main(String[] args) {
    
    
    // SparkConf conf = new SparkConf().setAppName("RDD").setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName("RDD");
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 数据集
      List<Integer> list = Arrays.asList(1, 2, 3, 4, 5);
      // 从集合中创建 RDD
      JavaRDD<Integer> javaRDD = sparkContext.parallelize(list);
      // 将 spark RDD 转为 java 对象
      List<Integer> collect = javaRDD.collect();
      // lambda 表达式
      collect.forEach(System.out::println);
    }
  }
}

package cn.lihaozhe.chap01

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * 数据集构建RDD
 * RDD代表弹性分布式数据集。它是记录的只读分区集合。 RDD是Spark的基本数据结构。它允许程序员以容错方式在大型集群上执行内存计算。
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("RDD")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 数据集
    val data = Array(1, 2, 3, 4, 5)
    // 从集合中创建 RDD
    // ParallelCollectionRDD
    val rdd = sparkContext.parallelize(data)
    rdd.foreach(println(_))
  }
}

Build RDD from local files

words.txt

linux shell
java mysql jdbc
hadoop hdfs mapreduce
hive presto
flume kafka
hbase phoenix
scala spark
sqoop flink
linux shell
java mysql jdbc
hadoop hdfs mapreduce
hive presto
flume kafka
hbase phoenix
scala spark
sqoop flink
base phoenix
scala spark
sqoop flink
linux shell
java mysql jdbc
hadoop hdfs mapreduce
java mysql jdbc
hadoop hdfs mapreduce
hive presto
flume kafka
hbase phoenix
scala spark
java mysql jdbc
hadoop hdfs mapreduce
java mysql jdbc
hadoop hdfs mapreduce
hive presto

package cn.lihaozhe.chap01;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.List;

/**
 * 借助外部文件 External Datasets 构建 RDD
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo02 {
    
    
  public static void main(String[] args) {
    
    
    // SparkConf conf = new SparkConf().setAppName("RDD").setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName("RDD");
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 使用本地文件系统构建数据集
      JavaRDD<String> javaRDD = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/word.txt");
      // 将 spark RDD 转为 java 对象
      List<String> collect = javaRDD.collect();
      // lambda 表达式
      collect.forEach(System.out::println);
    }
  }
}
package cn.lihaozhe.chap01

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * 借助外部文件 External Datasets 构建 RDD
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("RDD")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 使用本地文件系统构建数据集
    val data = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/word.txt")
    data.foreach(println(_))
  }
}

HDFS file building RDD

package cn.lihaozhe.chap01;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.List;

/**
 * 借助外部文件 External Datasets 构建 RDD
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo03 {
    
    
  public static void main(String[] args) {
    
    
    System.setProperty("HADOOP_USER_NAME", "root");
    // SparkConf conf = new SparkConf().setAppName("RDD").setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName("RDD");
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 使用本地文件系统构建数据集
      // JavaRDD<String> javaRDD = sparkContext.textFile("hdfs://spark01:8020/data/word.txt");
      JavaRDD<String> javaRDD = sparkContext.textFile("/data/word.txt");
      // 将 spark RDD 转为 java 对象
      List<String> collect = javaRDD.collect();
      // lambda 表达式
      collect.forEach(System.out::println);
    }
  }
}
package cn.lihaozhe.chap01

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * 借助外部文件 External Datasets 构建 RDD
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo03 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("RDD")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 使用本地文件系统构建数据集
    val data = sparkContext.textFile("/data/word.txt")
    data.foreach(println(_))
  }
}

operator

count

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;
import java.util.List;

/**
 * count 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo01 {
    
    
  public static void main(String[] args) {
    
    
    String appName = "count";
    // SparkConf conf = new SparkConf().setAppName(appName).setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName(appName);
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 数据集
      List<Integer> data = Arrays.asList(0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
      // 从集合中创建RDD
      JavaRDD<Integer> javaRDD = sparkContext.parallelize(data);
      long count = javaRDD.count();
      System.out.println("count = " + count);
    }
  }
}

scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * count 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val appName = "count"
    // spark基础配置
    // val conf = new SparkConf().setAppName(appName).setMaster("local")
    val conf = new SparkConf().setAppName(appName)
    // 本地运行
    conf.setMaster("local")
    // 构建 SparkContext spark 上下文
    val sparkContext = new SparkContext(conf)
    // 数据集
    val data = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
    val rdd = sparkContext.parallelize(data)
    val count = rdd.count()
    println(s"count = $count")
  }
}

operation result:

count = 10

take

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;
import java.util.List;

/**
 * take 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo02 {
    
    
  public static void main(String[] args) {
    
    
    String appName = "take";
    // SparkConf conf = new SparkConf().setAppName(appName).setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName(appName);
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 数据集
      List<Integer> data = Arrays.asList(0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
      // 从集合中创建RDD
      JavaRDD<Integer> javaRDD = sparkContext.parallelize(data);
      List<Integer> topList = javaRDD.take(3);
      topList.forEach(System.out::println);
    }
  }
}

scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * take 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val appName = "take"
    // spark基础配置
    // val conf = new SparkConf().setAppName(appName).setMaster("local")
    val conf = new SparkConf().setAppName(appName)
    // 本地运行
    conf.setMaster("local")
    // 构建 SparkContext spark 上下文
    val sparkContext = new SparkContext(conf)
    // 数据集
    val data = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
    val rdd = sparkContext.parallelize(data)
    val top = rdd.take(3)
    top.foreach(println(_))
  }
}

operation result:

0
1
2

distinct

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;
import java.util.List;

/**
 * distinct 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo03 {
    
    
  public static void main(String[] args) {
    
    
    String appName = "distinct";
    // SparkConf conf = new SparkConf().setAppName(appName).setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName(appName);
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 数据集
      List<Integer> data = Arrays.asList(0, 1, 5, 6, 7, 8, 9, 3, 4, 2, 4, 3);
      // 从集合中创建RDD
      JavaRDD<Integer> javaRDD = sparkContext.parallelize(data);
      JavaRDD<Integer> uniqueRDD = javaRDD.distinct();
      List<Integer> uniqueList = uniqueRDD.collect();
      uniqueList.forEach(System.out::println);
    }
  }
}

scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * distinct 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo03 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val appName = "distinct"
    // spark基础配置
    // val conf = new SparkConf().setAppName(appName).setMaster("local")
    val conf = new SparkConf().setAppName(appName)
    // 本地运行
    conf.setMaster("local")
    // 构建 SparkContext spark 上下文
    val sparkContext = new SparkContext(conf)
    // 数据集
    val data = Array(0, 1, 5, 6, 7, 8, 9, 3, 4, 2, 4, 3)
    val rdd = sparkContext.parallelize(data)
    val uniqueRdd = rdd.distinct()
    uniqueRdd.foreach(println(_))
  }
}

operation result:

4
0
1
6
3
7
9
8
5
2

map

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;
import java.util.List;

/**
 * map 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo04 {
    
    
  public static void main(String[] args) {
    
    
    String appName = "map";
    // SparkConf conf = new SparkConf().setAppName(appName).setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName(appName);
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 数据集
      List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
      // 从集合中创建RDD
      JavaRDD<Integer> javaRDD = sparkContext.parallelize(data);
      JavaRDD<Integer> rs = javaRDD.map(num -> num * 2);
      List<Integer> list = rs.collect();
      list.forEach(System.out::println);
    }
  }
}

scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * map 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo04 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val appName = "map"
    // spark基础配置
    // val conf = new SparkConf().setAppName(appName).setMaster("local")
    val conf = new SparkConf().setAppName(appName)
    // 本地运行
    conf.setMaster("local")
    // 构建 SparkContext spark 上下文
    val sparkContext = new SparkContext(conf)
    // 数据集
    val data = Array(1, 2, 3, 4, 5)
    val rdd = sparkContext.parallelize(data)
    val rs = rdd.map(_ * 2)
    rs.foreach(println(_))
  }
}

operation result:

2
4
6
8
10

flatMap

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;

import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

/**
 * flatMap 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo05 {
    
    
  public static void main(String[] args) {
    
    
    String appName = "flatMap";
    // SparkConf conf = new SparkConf().setAppName(appName).setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName(appName);
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 数据集
      List<String> data = Arrays.asList("hadoop hive presto", "hbase phoenix", "spark flink");
      // 从集合中创建RDD
      JavaRDD<String> javaRDD = sparkContext.parallelize(data);
      // ["hadoop hive presto hbase phoenix spark flink"]
      // JavaRDD<String> wordsRdd =  javaRDD.flatMap(new FlatMapFunction<String, String>() {
    
    
      //   @Override
      //  public Iterator<String> call(String s) throws Exception {
    
    
      //    String[] words = s.split(" ");
      //    return Arrays.asList(words).iterator();
      //  }
      // });
      JavaRDD<String> wordsRdd = javaRDD.flatMap((FlatMapFunction<String, String>) line -> Arrays.asList(line.split(" ")).listIterator());
      List<String> words = wordsRdd.collect();
      words.forEach(System.out::println);
    }
  }
}

scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * flatMap 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo05 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val appName = "flatMap"
    // spark基础配置
    // val conf = new SparkConf().setAppName(appName).setMaster("local")
    val conf = new SparkConf().setAppName(appName)
    // 本地运行
    conf.setMaster("local")
    // 构建 SparkContext spark 上下文
    val sparkContext = new SparkContext(conf)
    // 数据集
    val data = Array("hadoop hive presto", "hbase phoenix", "spark flink")
    // ("hadoop","hive","presto","hbase","phoenix","spark","flink")
    val rs = data.flatMap(_.split(" "))
    rs.foreach(println(_))
  }
}

operation result:

hadoop
hive
presto
hbase
phoenix
spark
flink

filter

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;
import java.util.List;

/**
 * filter 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo06 {
    
    
  public static void main(String[] args) {
    
    
    String appName = "filter";
    // SparkConf conf = new SparkConf().setAppName(appName).setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName(appName);
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 数据集
      List<Integer> data = Arrays.asList(0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
      // 从集合中创建RDD
      JavaRDD<Integer> javaRDD = sparkContext.parallelize(data);
      JavaRDD<Integer> evenRDD = javaRDD.filter(num -> num % 2 == 0);
      List<Integer> evenList = evenRDD.collect();
      evenList.forEach(System.out::println);
    }
  }
}

scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * filter 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo06 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val appName = "filter"
    // spark基础配置
    // val conf = new SparkConf().setAppName(appName).setMaster("local")
    val conf = new SparkConf().setAppName(appName)
    // 本地运行
    conf.setMaster("local")
    // 构建 SparkContext spark 上下文
    val sparkContext = new SparkContext(conf)
    // 数据集
    val data = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
    val rdd = sparkContext.parallelize(data)
    val evenRdd = rdd.filter(_ % 2 == 0)
    evenRdd.foreach(println(_))
  }
}

operation result:

0
2
4
6
8

groupByKey

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.List;

/**
 * groupByKey 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo07 {
    
    
  public static void main(String[] args) {
    
    
    // SparkConf conf = new SparkConf().setAppName("RDD").setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName("groupByKey");
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 使用本地文件系统构建数据集
      JavaRDD<String> javaRDD = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv");
      // javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
    
    
      //   @Override
      //  public Tuple2<String, Integer> call(String s) throws Exception {
    
    
      //    String[] words = s.split(",");
      //    return new Tuple2<String, Integer>(words[0], Integer.parseInt(words[1]));
      //  }
      //});
      JavaPairRDD<String, Integer> javaPairRDD = javaRDD.mapToPair((PairFunction<String, String, Integer>) word -> {
    
    
        // [person3,137]
        String[] words = word.split(",");
        return new Tuple2<String, Integer>(words[0], Integer.parseInt(words[1]));
      });
      JavaPairRDD<String, Iterable<Integer>> groupRDD = javaPairRDD.groupByKey();
      List<Tuple2<String, Iterable<Integer>>> collect = groupRDD.collect();
      collect.forEach(tup -> {
    
    
        // 获取key
        System.out.print(tup._1 + " >>> (");
        // 获取value
        tup._2.forEach(num -> System.out.print(num + ","));
        System.out.println("\b)");
      });
    }
  }
}
scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * groupByKey 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo07 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("groupByKey")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 使用本地文件系统构建数据集
    val data = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val tupleData = data.map(line => (line.split(",")(0), line.split(",")(1)))
    // (person1,Seq(197, 38, 12, 114, 91, 182, 29, 2, 100, 99, 137, 56))
    val groupData = tupleData.groupByKey()
    groupData.foreach(println(_))
  }
}

reduceByKey

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Function;
import scala.Tuple2;

import java.util.List;

/**
 * reduceByKey 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo08 {
    
    
  public static void main(String[] args) {
    
    
    // SparkConf conf = new SparkConf().setAppName("RDD").setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName("reduceByKey");
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 使用本地文件系统构建数据集
      JavaRDD<String> javaRDD = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv");
      // javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
    
    
      //  @Override
      //  public Tuple2<String, Integer> call(String s) throws Exception {
    
    
      //    String[] words = s.split(",");
      //    return new Tuple2<String, Integer>(words[0], Integer.parseInt(words[1]));
      //  }
      //});
      JavaPairRDD<String, Integer> javaPairRDD = javaRDD.mapToPair((PairFunction<String, String, Integer>) word -> {
    
    
        // [person3,137]
        String[] words = word.split(",");
        return new Tuple2<String, Integer>(words[0], Integer.parseInt(words[1]));
      });
      JavaPairRDD<String, Integer> reduceRDD = javaPairRDD.reduceByKey((Function2<Integer, Integer, Integer>) Integer::sum);
      List<Tuple2<String, Integer>> collect = reduceRDD.collect();
      collect.forEach(tup -> System.out.println(tup._1 + " >>> " + tup._2));
    }
  }
}
scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * reduceByKey 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo08 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("reduceByKey")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 使用本地文件系统构建数据集
    val data = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val tupleData = data.map(line => (line.split(",")(0), line.split(",")(1).toInt))
    // (person1,Seq(197, 38, 12, 114, 91, 182, 29, 2, 100, 99, 137, 56))
    val groupData = tupleData.reduceByKey(_ + _)
    groupData.foreach(println(_))
  }
}

mapValues

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.*;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.Collectors;

/**
 * mapValues 算子
 * 引入数据文件 data.csv 第一列为姓名 第二列为每次消费的订单金额 析客单价
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo09 {
    
    
  public static void main(String[] args) {
    
    
    // SparkConf conf = new SparkConf().setAppName("RDD").setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName("mapValues");
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 使用本地文件系统构建数据集
      JavaRDD<String> javaRDD = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv");
      // javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
    
    
      //  @Override
      //  public Tuple2<String, Integer> call(String s) throws Exception {
    
    
      //    String[] words = s.split(",");
      //    return new Tuple2<String, Integer>(words[0], Integer.parseInt(words[1]));
      //  }
      //});
      JavaPairRDD<String, Integer> javaPairRDD = javaRDD.mapToPair((PairFunction<String, String, Integer>) word -> {
    
    
        // [person3,137]
        String[] words = word.split(",");
        return new Tuple2<String, Integer>(words[0], Integer.parseInt(words[1]));
      });
      JavaPairRDD<String, Iterable<Integer>> groupRDD = javaPairRDD.groupByKey();
      JavaPairRDD<String, Double> avgRDD = groupRDD.mapValues(v -> {
    
    
        int sum = 0;
        Iterator<Integer> it = v.iterator();
        AtomicInteger atomicInteger = new AtomicInteger();
        while (it.hasNext()) {
    
    
          Integer amount = it.next();
          sum += amount;
          atomicInteger.incrementAndGet();
        }
        return (double) sum / atomicInteger.get();
      });
      List<Tuple2<String, Double>> collect = avgRDD.collect();
      collect.forEach(tup -> System.out.println(tup._1 + " >>> " + (double) Math.round(tup._2 * 100) / 100));
//      Map<String, List<Tuple2<String, Integer>>> listMap = javaPairRDD.collect().stream().collect(Collectors.groupingBy(tup -> tup._1));
//      Set<Map.Entry<String, List<Tuple2<String, Integer>>>> entries = listMap.entrySet();
//      Iterator<Map.Entry<String, List<Tuple2<String, Integer>>>> it = entries.iterator();
//      Map<String, Double> map = new HashMap<>();
//      while (it.hasNext()) {
    
    
//        Map.Entry<String, List<Tuple2<String, Integer>>> entry = it.next();
//        Integer sum = entry.getValue().stream().map(tup -> tup._2).reduce(Integer::sum).orElse(0);
//        long count = entry.getValue().stream().map(tup -> tup._2).count();
//
//        map.put(entry.getKey(), Double.valueOf(sum) / count);
//      }
//      map.forEach((name, amount) -> System.out.println(name + " >>> " + amount));
    }
  }
}
scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * groupByKey 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo09 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("mapValues")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 使用本地文件系统构建数据集
    val data = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val tupleData = data.map(line => (line.split(",")(0), line.split(",")(1).toInt))
    // (person1,Seq(197, 38, 12, 114, 91, 182, 29, 2, 100, 99, 137, 56))
    val groupData = tupleData.groupByKey()
    // groupData.foreach(println(_))
    val avgData = groupData.mapValues(v => (v.sum.toDouble / v.size).formatted("%.2f"))
    avgData.foreach(println(_))
  }
}

sortByKey

javacode
package cn.lihaozhe.chap02;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.List;

/**
 * sortByKey reduceByKey 算子
 * 引入数据文件 data.csv 第一列为姓名 第二列为每次消费的订单金额 分析每个人消费的金额数据汇总
 *
 * @author 李昊哲
 * @version 1.0
 */
public class JavaDemo10 {
    
    
  public static void main(String[] args) {
    
    
    // SparkConf conf = new SparkConf().setAppName("RDD").setMaster("local");
    // spark基础配置
    SparkConf conf = new SparkConf().setAppName("sortByKey");
    // 本地运行
    conf.setMaster("local");
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      // 使用本地文件系统构建数据集
      JavaRDD<String> javaRDD = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv");
      // javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
    
    
      //  @Override
      //  public Tuple2<String, Integer> call(String s) throws Exception {
    
    
      //    String[] words = s.split(",");
      //    return new Tuple2<String, Integer>(words[0], Integer.parseInt(words[1]));
      //  }
      //});
      JavaPairRDD<String, Integer> javaPairRDD = javaRDD.mapToPair((PairFunction<String, String, Integer>) word -> {
    
    
        // [person3,137]
        String[] words = word.split(",");
        return new Tuple2<String, Integer>(words[0], Integer.parseInt(words[1]));
      });
      JavaPairRDD<String, Integer> reduceRDD = javaPairRDD.reduceByKey((Function2<Integer, Integer, Integer>) Integer::sum);
      // 参数 true为升序 false为降序 默认为升序
      JavaPairRDD<String, Integer> sortedRDD = reduceRDD.sortByKey(false);
      List<Tuple2<String, Integer>> collect = sortedRDD.collect();
      collect.forEach(tup -> System.out.println(tup._1 + " >>> " + tup._2));
    }
  }
}
scalacode
package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * sortByKey  reduceByKey 算子
 * 引入数据文件 data.csv 第一列为姓名 第二列为每次消费的订单金额 分析客总金额
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo10 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("sortByKey")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 使用本地文件系统构建数据集
    val data = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val tupleData = data.map(line => (line.split(",")(0), line.split(",")(1).toInt))
    // (person1,Seq(197, 38, 12, 114, 91, 182, 29, 2, 100, 99, 137, 56))
    val groupData = tupleData.reduceByKey(_ + _)
    val swapData = groupData.map(_.swap)
    // 参数 true为升序 false为降序 默认为升序
    val sortData = swapData.sortByKey(ascending = false)
    val result = sortData.map(_.swap)
    result.foreach(println(_))
  }
}

sortBy

package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * sortBy  reduceByKey 算子
 * 引入数据文件 data.csv 第一列为姓名 第二列为每次消费的订单金额 分析客总金额
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo11 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("sortBy")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 使用本地文件系统构建数据集
    val data = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val tupleData = data.map(line => (line.split(",")(0), line.split(",")(1).toInt))
    // (person1,1057)
    val groupData = tupleData.reduceByKey(_ + _)
    // 参数 true为升序 false为降序 默认为升序
    val sortedData = groupData.sortBy(_._2, ascending = false)
    sortedData.foreach(println(_))
  }
}

join

package cn.lihaozhe.chap02

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * join 算子
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo12 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // val conf = new SparkConf().setAppName("RDD").setMaster("local")
    // spark基础配置
    val conf = new SparkConf().setAppName("join")
    // 本地运行
    conf.setMaster("local")
    val sparkContext = new SparkContext(conf)
    // 使用本地文件系统构建数据集
    val data = sparkContext.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val tupleData = data.map(line => (line.split(",")(0), line.split(",")(1).toInt))
    val groupData = tupleData.groupByKey()
    // 姓名 评价消费金额
    val avgData = groupData.mapValues(v => (v.sum.toDouble / v.size).formatted("%.2f"))
    // 姓名 消费总金额
    val sumData = tupleData.reduceByKey(_ + _)
    // 相当于表连接
    val rsData = sumData.join(avgData)
    rsData.foreach(println(_))
  }
}

operation result:

(person1,(1057,88.08))
(person9,(2722,113.42))
(person6,(2634,105.36))
(person0,(1824,101.33))
(person2,(1296,99.69))
(person3,(2277,91.08))
(person7,(2488,99.52))
(person4,(2271,113.55))
(person5,(2409,114.71))
(person8,(1481,87.12))

WordCount

JavaWordCount

package cn.lihaozhe.chap03;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @author 李昊哲
 * @version 1.0
 * @create 2023-12-12
 */
public class JavaWordCount {
    
    
  public static void main(String[] args) {
    
    
    System.setProperty("HADOOP_USER_NAME", "root");
    String appName = "JavaWordCount";
    SparkConf conf = new SparkConf().setAppName(appName);
    try (JavaSparkContext sparkContext = new JavaSparkContext(conf)) {
    
    
      JavaRDD<String> javaRDD = sparkContext.textFile("/data/word.txt");
      JavaRDD<String> wordRdd = javaRDD.flatMap((FlatMapFunction<String, String>) line -> Arrays.asList(line.split(" ")).listIterator());
      JavaPairRDD<String, Integer> javaPairRDD = wordRdd.mapToPair((PairFunction<String, String, Integer>) word -> new Tuple2<>(word, 1));
      JavaPairRDD<String, Integer> rs = javaPairRDD.reduceByKey((Function2<Integer, Integer, Integer>) Integer::sum);
      rs.saveAsTextFile("/data/result");
    }
  }
}

ScalaWordCount

package cn.lihaozhe.chap03

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * @author 李昊哲
 * @version 1.0
 */
object ScalaWordCount01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val conf = new SparkConf().setAppName("ScalaWordCount01")
    val sparkContext = new SparkContext(conf)
    val content = sparkContext.textFile("/data/word.txt")
    val words = content.flatMap(_.split(" "))
    val wordGroup = words.groupBy(word => word)
    val wordCount = wordGroup.mapValues(_.size)
    wordCount.saveAsTextFile("/data/result")
  }
}

package cn.lihaozhe.chap03

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * @author 李昊哲
 * @version 1.0
 */
object ScalaWordCount02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val conf = new SparkConf().setAppName("ScalaWordCount02")
    val sparkContext = new SparkContext(conf)
    val content = sparkContext.textFile("/data/word.txt")
    val words = content.flatMap(_.split(" "))
    val wordMap = words.map((_, 1))
    val wordGroup = wordMap.reduceByKey(_ + _)
    wordGroup.saveAsTextFile("/data/result")
  }
}

Project packaging and release

mvn package

Upload jar files to the cluster

Submit on the cluster

spark-submit --master yarn --class cn.lihaozhe.chap02.JavaWordCount spark-code.jar 
spark-submit --master yarn --class cn.lihaozhe.chap03.ScalaWordCount01 spark-code.jar 
spark-submit --master yarn --class cn.lihaozhe.chap03.ScalaWordCount02 spark-code.jar 

SparkSQL

在SparkCore中需要创建上下文环境SparkContext
SparkSql对SparkCore的封装, 不仅仅是功能上的封装,上下文件环境也封装了
	老版本中称为 SQLContext 用于Spark自己的查询 和 HiveContext 用于Hive连接的查询
	新版本中称为 SparkSession 是 SQLContext 和 HiveContext的组成 , 所以他们的API是通用的
	同时 SparkSession也可以直接获取到SparkContext对象
DataFrame
是一种基于RDD的分布式数据集, 与RDD的区别在于DataFrame中有数据的原信息,
可以理解为传统数据库中的一张二维表格,每一列都有列名和类型  
DataSet
是分布式数据集,对DataFrame的一个扩展,相当于传统JDBC中的ResultSet
RDD 数据
DataFrame 数据+结构
DataSet 数据+结构+数据类型

DataFrame

Build DataFrame

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * 构建 dataFrame
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    // root
    // |-- _c0: string (nullable = true)
    // |-- _c1: string (nullable = true)
    df.printSchema()
    sparkSession.stop()
  }
}

operation result:

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)

show

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * show
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    df.printSchema()
    df.show(5, truncate = false)
    sparkSession.stop()
  }
}

operation result:

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)

+-------+---+
|_c0    |_c1|
+-------+---+
|person3|137|
|person7|193|
|person7|78 |
|person0|170|
|person5|145|
+-------+---+

option

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * option 是否将第一行作为字段名 header 默认值为 false
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo03 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read
      .option("header", "true")
      .csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/info.csv")
    // root
    // |-- name: string (nullable = true)
    // |-- amount: string (nullable = true)
    df.printSchema()
    df.show(5)
    sparkSession.stop()
  }
}

operation result:

root
 |-- name: string (nullable = true)
 |-- amount: string (nullable = true)

+-------+------+
|   name|amount|
+-------+------+
|person3|   137|
|person7|   193|
|person7|    78|
|person0|   170|
|person5|   145|
+-------+------+

select

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * select
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo04 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    df.printSchema()
    val rs = df.select("_c0", "_c1")
    rs.show(5, truncate = false)
    sparkSession.stop()
  }
}

operation result:

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)

+-------+---+
|_c0    |_c1|
+-------+---+
|person3|137|
|person7|193|
|person7|78 |
|person0|170|
|person5|145|
+-------+---+

withColumnRenamed

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * withColumnRenamed
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo05 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val table = df.withColumnRenamed("_c0", "name").withColumnRenamed("_c1", "amount")
    table.printSchema()
    val rs = table.select("name", "amount")
    rs.show(5,truncate = false)
    sparkSession.stop()
  }
}

operation result:

root
 |-- name: string (nullable = true)
 |-- amount: string (nullable = true)

+-------+------+
|name   |amount|
+-------+------+
|person3|137   |
|person7|193   |
|person7|78    |
|person0|170   |
|person5|145   |
+-------+------+

cast

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{
    
    IntegerType, StringType}

/**
 * cast
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo06 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val table = df.select(
      col("_c0").cast(StringType).as("name"),
      col("_c1").cast(IntegerType).as("amount"),
    )
    table.printSchema()
    val rs = table.select("name", "amount")
    rs.show(5, truncate = false)
    sparkSession.stop()
  }
}

operation result:

root
 |-- name: string (nullable = true)
 |-- amount: integer (nullable = true)

+-------+------+
|   name|amount|
+-------+------+
|person3|   137|
|person7|   193|
|person7|    78|
|person0|   170|
|person5|   145|
+-------+------+

show first foreach head take tail

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * show first foreach head take tail
 *
 * @author 李昊哲
 * @version 1.0

 */
object ScalaDemo07 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read
      .option("header", "true")
      .csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/info.csv")
    df.printSchema()
    // df.show(5, truncate = false)
    // df.foreach(println)
    // [name: string, amount: string]
    // println(df)
    // [person3,137]
    // println(df.first())
    // df.head(3).foreach(println)
    // df.take(3).foreach(println)
    df.tail(3).foreach(println)
    sparkSession.stop()
  }
}

where

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{
    
    IntegerType, StringType}

/**
 * where 按条件查询
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo08 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val table = df.select(
      col("_c0").cast(StringType).as("name"),
      col("_c1").cast(IntegerType).as("amount"),
    ).where("amount > 100")
    table.foreach(println)
    sparkSession.stop()
  }
}

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{
    
    IntegerType, StringType}

/**
 * where 按条件查询
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo09 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val table = df.select(
      col("_c0").cast(StringType).as("name"),
      col("_c1").cast(IntegerType).as("amount"),
    ).where(col("amount") > 100)
    table.foreach(println)
    sparkSession.stop()
  }
}

filter

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{
    
    IntegerType, StringType}

/**
 * filter 按条件查询
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo10 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val table = df.select(
      col("_c0").cast(StringType).as("name"),
      col("_c1").cast(IntegerType).as("amount"),
    ).filter("amount > 100")
    table.foreach(println)
    sparkSession.stop()
  }
}

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{
    
    IntegerType, StringType}

/**
 * filter 按条件查询
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo11 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val table = df.select(
      col("_c0").cast(StringType).as("name"),
      col("_c1").cast(IntegerType).as("amount"),
    ).filter(col("amount") > 100)
    table.foreach(println)
    sparkSession.stop()
  }
}

groupBy

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{
    
    IntegerType, StringType}

/**
 * group by
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo12 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val rs = df.select(
      col("_c0").cast(StringType).as("name"),
      col("_c1").cast(IntegerType).as("amount"),
    ).groupBy("name").count().where("count > 20")
    rs.printSchema()
    rs.foreach(println)
    sparkSession.stop()
  }
}

orderBy

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{
    
    IntegerType, StringType}

/**
 * order by
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo13 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    val rs = df.select(
      col("_c0").cast(StringType).as("name"),
      col("_c1").cast(IntegerType).as("amount"),
    ).groupBy("name").count().where("count > 20")
      .orderBy(col("count"), col("name"))
    rs.printSchema()
    rs.foreach(println)
    sparkSession.stop()
  }
}

SQL

package cn.lihaozhe.chap04

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * SQL
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo14 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
    // 使用 DataFrame 生成一张临时表
    df.createOrReplaceTempView("order_info")
    // 获取 SQLContext 对象
    // val sqlContext = sparkSession.sqlContext
    // val rs = sqlContext.sql("select _c0 as name,_c1 as amount from order_info where _c1 > 100")
    // 获取sql查询结果 dataFrame
    val rs = sparkSession.sql("select _c0 as name ,_c1 as mount from order_info where _c1 > 100")
    rs.foreach(println)
    sparkSession.stop()
  }
}

DataSet

dataframe dataset

package cn.lihaozhe.chap05

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * RDD DataFrame DataSet
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read
      .option("header", "true")
      .csv("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/info.csv")
    // 将 dataFrame 转换成 dataSet
    val ds = df.as[OrderInfo]
    // ds.printSchema()
    // ds.foreach(println)
    // val rdd = df.rdd
    val rdd = ds.map(orderInfo => (orderInfo.name, orderInfo.amount.toInt)).rdd
    rdd.foreach(println)

    sparkSession.stop()
  }
}

case class OrderInfo(name: String, amount: String)

Read files to build DataSet

package cn.lihaozhe.chap05

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * DataFrame DataSet
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.text("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/info.csv")
    // 读取 csv 文件获取 dataSet
    val ds = sparkSession.read.textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/info.csv")
    sparkSession.stop()
  }
}

RDD schema

package cn.lihaozhe.chap05

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * 在字段较少的情况下 使用 反射 推导 出 RDD schema 信息
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo03 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    val ds = sparkSession.sparkContext
      .textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
      .map(_.split(","))
      .map(attributes => OrderSchema(attributes(0),attributes(1).toInt))
      .toDS()
    ds.printSchema()
    ds.foreach(println)
    sparkSession.stop()
  }
}

case class OrderSchema(name: String, amount: Integer)

StructType

package cn.lihaozhe.chap05

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{
    
    Row, SparkSession}

/**
 * StructField
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo04 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 1、从原RDD的行中创建一个RDD;
    val rowRDD = sparkSession.sparkContext
      .textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1).toInt))
    // 2、创建由 StructType 表示的模式,该模式与步骤1中创建的RDD中的Rows结构匹配。
    val structType = StructType(Array(
      StructField(name = "name", dataType = StringType, nullable = false),
      StructField(name = "amount", dataType = IntegerType, nullable = false)
    ))
    // 3、通过 SparkSession 提供的 createDataFrame 方法将 schema 应用到 RDD 的行。
    val df = sparkSession.createDataFrame(rowRDD, structType)
    df.printSchema()
    df.foreach(println)
    sparkSession.stop()
  }
}

package cn.lihaozhe.chap05

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{
    
    Row, SparkSession}

/**
 * StructField
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo05 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
	import sparkSession.implicits._
    // 1、从原RDD的行中创建一个RDD;
    val rowRDD = sparkSession.sparkContext
      .textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1)))
    // 2、创建由 StructType 表示的模式,该模式与步骤1中创建的RDD中的Rows结构匹配。
    // val schemaString = "name amount"
    // val fields = schemaString.split(" ").map(fieldName => StructField(name = fieldName, dataType = StringType, nullable = true))
    // val structType = StructType(fields)
    val structType = StructType("name amount".split(" ").map(fieldName => StructField(name = fieldName, dataType = StringType)))
    // 3、通过 SparkSession 提供的 createDataFrame 方法将 schema 应用到 RDD 的行。
    val df = sparkSession.createDataFrame(rowRDD, structType)
    df.printSchema()
    df.foreach(println)
    sparkSession.stop()
  }
}

json

package cn.lihaozhe.chap05

import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    Encoders, SparkSession}

/**
 * kryo
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo06 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    val ds = sparkSession.sparkContext
      .textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
      .map(_.split(","))
      .map(attributes => TbOrder(attributes(0), attributes(1).toInt))
      .toDS()
    // 创建临时表 order_info
    ds.createOrReplaceTempView("order_info")
    // SQL查询后的结果集 dataFrame
    val df = sparkSession.sql("select name,amount from order_info where amount between 100 and 150")
    // df.foreach(println)
    // 通过下标方式取值
    df.map(temp => "{\"name\":" + temp(0) + ",\"amount\": " + temp(1) + "}").show(3, truncate = false)
    // 通过属性方式取值
    df.map(temp => "{\"name\":" + temp.getAs[String]("name") + ",\"amount\": " + temp.getAs[Int]("amount") + "}").show(3, truncate = false)
    // 将数据转为json格式字符串
    df.toJSON.show(3, truncate = false)
    // 一次读取一行数据并将数据封装到Map中
    implicit val mapEncoder = Encoders.kryo[Map[String, Any]]
    val array = df.map(teenager => teenager.getValuesMap[Any](List("name", "amount"))).collect()
    array.foreach(println)
    sparkSession.stop()
  }
}

case class TbOrder(name: String, amount: Integer)

format conversion

parquet

package cn.lihaozhe.chap06

import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    SaveMode, SparkSession}

/**
 * parquet
 *
 * @author 李昊哲
 * @version 1.0
 * @create 2023-12-12 
 */
object ScalaDemo01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read
      .option("header", "true")
      .format("csv")
      .load("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/info.csv")
    df.select("name", "amount").write.mode(SaveMode.Overwrite).format("parquet").save("/data/spark/parquet")
    sparkSession.stop()
  }
}

case class OrderInfo(name: String, amount: String)

json

package cn.lihaozhe.chap06

import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    SaveMode, SparkSession}

/**
 * json
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.format("parquet").load("/data/spark/parquet")
    println(df.count())
    df.select("name", "amount").write.mode(SaveMode.Overwrite).format("json").save("/data/spark/json")
    sparkSession.stop()
  }
}

JDBC

package cn.lihaozhe.chap07

import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    SaveMode, SparkSession}

/**
 * jdbc
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL JDBC example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read
      .format("jdbc")
      .option("url", "jdbc:mysql://spark03")
      .option("dbtable","knowledge.dujitang")
      .option("user", "root")
      .option("password", "Lihaozhe!!@@1122")
      .load()
    println(df.count())
    sparkSession.stop()
  }
}
package cn.lihaozhe.chap07

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

import java.util.Properties

/**
 * jdbc
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL JDBC example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    val url = "jdbc:mysql://spark03"
    val tableName = "knowledge.dujitang"
    val connectionProperties = new Properties()
    connectionProperties.put("user", "root")
    connectionProperties.put("password", "Lihaozhe!!@@1122")
    connectionProperties.put("customSchema", "id int,text string")
    // 读取 csv 文件获取 dataFrame
    val df = sparkSession.read.jdbc(url, tableName, connectionProperties)
    df.printSchema()
    println(df.count())
    sparkSession.stop()
  }
}

package cn.lihaozhe.chap07

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{
    
    Row, SaveMode, SparkSession}

import java.util.Properties

/**
 * jdbc
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo04 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL JDBC example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    // 1、从原RDD的行中创建一个RDD;
    val rowRDD = sparkSession.sparkContext
      .textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1).toInt))
    // 2、创建由 StructType 表示的模式,该模式与步骤1中创建的RDD中的Rows结构匹配。
    val structType = StructType(Array(
      StructField(name = "name", dataType = StringType, nullable = true),
      StructField(name = "amount", dataType = IntegerType, nullable = true)
    ))
    // 3、通过 SparkSession 提供的 createDataFrame 方法将 schema 应用到 RDD 的行。
    val df = sparkSession.createDataFrame(rowRDD, structType)
    val url = "jdbc:mysql://spark03"
    val tableName = "lihaozhe.data"
    val connectionProperties = new Properties()
    connectionProperties.put("user", "root")
    connectionProperties.put("password", "Lihaozhe!!@@1122")
    df.write
      .mode(SaveMode.Overwrite)
      .jdbc(url, tableName, connectionProperties)
    sparkSession.stop()
  }
}

package cn.lihaozhe.chap07

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{
    
    Row, SaveMode, SparkSession}

import java.util.Properties

/**
 * jdbc
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo05 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL JDBC example")
      .config(sparkConf)
      .getOrCreate()

    // 隐式转换
    // 1、从原RDD的行中创建一个RDD;
    val rowRDD = sparkSession.sparkContext
      .textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1).toInt))
    // 2、创建由 StructType 表示的模式,该模式与步骤1中创建的RDD中的Rows结构匹配。
    val structType = StructType(Array(
      StructField(name = "name", dataType = StringType, nullable = true),
      StructField(name = "amount", dataType = IntegerType, nullable = true)
    ))
    // 3、通过 SparkSession 提供的 createDataFrame 方法将 schema 应用到 RDD 的行。
    val df = sparkSession.createDataFrame(rowRDD, structType)
    val url = "jdbc:mysql://spark03"
    val tableName = "lihaozhe.data"
    val connectionProperties = new Properties()
    connectionProperties.put("user", "root")
    connectionProperties.put("password", "Lihaozhe!!@@1122")
    connectionProperties.put("createTableColumnTypes", "name varchar(50)")
    df.write
      .mode(SaveMode.Overwrite)
      .jdbc(url, tableName, connectionProperties)
    sparkSession.stop()
  }
}

spark on hive

"Spark on Hive" and "Hive on Spark" are two different concepts, which respectively describe the integration between Spark and Hive.

  1. Spark on Hive: “Spark on Hive” refers to using Hive’s metadata storage and query engine in Spark applications. In this integration method, Spark can directly access and operate data tables in Hive without copying the data to Spark's memory. This integration is achieved through Spark SQL, which allows users to query and manipulate data in Hive using SQL or the DataFrame API in Spark applications.
  2. Hive on Spark: "Hive on Spark" refers to using Spark as the computing engine in the Hive query engine. In traditional Hive, computing tasks are performed by MapReduce, but in some cases, users want to use Spark instead of MapReduce to execute Hive queries to obtain better performance and resource utilization. By using Spark as Hive's computing engine, users can take advantage of Spark's in-memory computing capabilities when executing Hive queries, thereby improving query performance.

In general, "Spark on Hive" mainly refers to using Hive data in Spark applications, while "Hive on Spark" mainly refers to using Spark as a computing engine in a Hive query engine. Both integration methods allow users to better take advantage of the advantages of Spark and Hive and choose the appropriate integration method based on specific needs.

package cn.lihaozhe.chap08

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{
    
    Row, SaveMode, SparkSession}

import java.util.Properties

/**
 * hive
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL JDBC example")
      .config(sparkConf)
      .enableHiveSupport()
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 1、从原RDD的行中创建一个RDD;
    val rowRDD = sparkSession.sparkContext
      .textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1).toInt))
    // 2、创建由 StructType 表示的模式,该模式与步骤1中创建的RDD中的Rows结构匹配。
    val structType = StructType(Array(
      StructField(name = "name", dataType = StringType, nullable = true),
      StructField(name = "amount", dataType = IntegerType, nullable = true)
    ))
    // 3、通过 SparkSession 提供的 createDataFrame 方法将 schema 应用到 RDD 的行。
    val df = sparkSession.createDataFrame(rowRDD, structType)
    df.write.mode(SaveMode.Overwrite).saveAsTable("lihaozhe.order_info");
    sparkSession.stop()
  }
}

package cn.lihaozhe.chap08

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{
    
    Row, SaveMode, SparkSession}

/**
 * hive
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL JDBC example")
      .config(sparkConf)
      .enableHiveSupport()
      .getOrCreate()

    // 隐式转换
    // val orderDF = sparkSession.sql("select * from lihaozhe.order_info");
    import sparkSession.sql
    val orderDF = sql("select * from lihaozhe.order_info");
    orderDF.foreach(info => println(info(0) + "\t" + info(1)))
    sparkSession.stop()
  }
}

streaming

sparkstreaming

structedstreaming

Structured Streaming is a way of processing real-time data streams in Apache Spark. It has many advantages and some disadvantages. I will list some of the main advantages and disadvantages below:

advantage:

  1. Highly integrated: Structured streaming processing is highly integrated with other components of Spark (such as Spark SQL, DataFrame, etc.), making processing real-time data streams easier and more flexible.

  2. Fault tolerance: Structured streaming processing is fault-tolerant and can automatically recover when a failure occurs, ensuring the reliability of data processing.

  3. High performance: Structured streaming processing is based on the Spark engine, which has excellent performance and scalability and can handle large-scale real-time data streams.

  4. Supports multiple data sources: Structured streaming processing supports reading data from multiple data sources (such as Kafka, HDFS, file systems, etc.), and can write processing results to multiple targets (such as Kafka, HDFS, file systems, etc.) database, etc.).

  5. SQL friendly: Structured streaming processing provides a SQL-like API, making processing real-time data streams more intuitive and easy to understand.

shortcoming:

  1. Learning curve: For beginners, structured streaming may require a certain learning curve, especially in terms of understanding the concepts of stream processing and tuning performance.

  2. Real-time limitations: Although structured streaming can process real-time data streams, due to the characteristics of batch processing, its real-time performance may not be able to meet certain scenarios that require very high real-time performance.

  3. Resource consumption: Since structured streaming processing is based on the Spark engine, it may require a large amount of computing resources and memory resources to process real-time data streams.

Overall, structured streaming has many advantages in processing real-time data streams, but its advantages and disadvantages also need to be weighed based on specific business needs and scenarios.

to execute Hive queries instead of MapReduce for better performance and resource utilization. By using Spark as Hive's computing engine, users can take advantage of Spark's in-memory computing capabilities when executing Hive queries, thereby improving query performance.

In general, "Spark on Hive" mainly refers to using Hive data in Spark applications, while "Hive on Spark" mainly refers to using Spark as a computing engine in a Hive query engine. Both integration methods allow users to better take advantage of the advantages of Spark and Hive and choose the appropriate integration method based on specific needs.

package cn.lihaozhe.chap08

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{
    
    Row, SaveMode, SparkSession}

import java.util.Properties

/**
 * hive
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo01 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL JDBC example")
      .config(sparkConf)
      .enableHiveSupport()
      .getOrCreate()

    // 隐式转换
    import sparkSession.implicits._
    // 1、从原RDD的行中创建一个RDD;
    val rowRDD = sparkSession.sparkContext
      .textFile("file:///D:/work/河南师范大学/2023/bigdata2023/spark/code/spark-code/data.csv")
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1).toInt))
    // 2、创建由 StructType 表示的模式,该模式与步骤1中创建的RDD中的Rows结构匹配。
    val structType = StructType(Array(
      StructField(name = "name", dataType = StringType, nullable = true),
      StructField(name = "amount", dataType = IntegerType, nullable = true)
    ))
    // 3、通过 SparkSession 提供的 createDataFrame 方法将 schema 应用到 RDD 的行。
    val df = sparkSession.createDataFrame(rowRDD, structType)
    df.write.mode(SaveMode.Overwrite).saveAsTable("lihaozhe.order_info");
    sparkSession.stop()
  }
}

package cn.lihaozhe.chap08

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{
    
    Row, SaveMode, SparkSession}

/**
 * hive
 *
 * @author 李昊哲
 * @version 1.0
 */
object ScalaDemo02 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sparkConf = new SparkConf()
    if (!sparkConf.contains("spark.master")) {
    
    
      sparkConf.setMaster("local")
    }

    val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL JDBC example")
      .config(sparkConf)
      .enableHiveSupport()
      .getOrCreate()

    // 隐式转换
    // val orderDF = sparkSession.sql("select * from lihaozhe.order_info");
    import sparkSession.sql
    val orderDF = sql("select * from lihaozhe.order_info");
    orderDF.foreach(info => println(info(0) + "\t" + info(1)))
    sparkSession.stop()
  }
}

streaming

sparkstreaming

structedstreaming

Structured Streaming is a way of processing real-time data streams in Apache Spark. It has many advantages and some disadvantages. I will list some of the main advantages and disadvantages below:

advantage:

  1. Highly integrated: Structured streaming processing is highly integrated with other components of Spark (such as Spark SQL, DataFrame, etc.), making processing real-time data streams easier and more flexible.

  2. Fault tolerance: Structured streaming processing is fault-tolerant and can automatically recover when a failure occurs, ensuring the reliability of data processing.

  3. High performance: Structured streaming processing is based on the Spark engine, which has excellent performance and scalability and can handle large-scale real-time data streams.

  4. Supports multiple data sources: Structured streaming processing supports reading data from multiple data sources (such as Kafka, HDFS, file systems, etc.), and can write processing results to multiple targets (such as Kafka, HDFS, file systems, etc.) database, etc.).

  5. SQL friendly: Structured streaming processing provides a SQL-like API, making processing real-time data streams more intuitive and easy to understand.

shortcoming:

  1. Learning curve: For beginners, structured streaming may require a certain learning curve, especially in terms of understanding the concepts of stream processing and tuning performance.

  2. Real-time limitations: Although structured streaming can process real-time data streams, due to the characteristics of batch processing, its real-time performance may not be able to meet certain scenarios that require very high real-time performance.

  3. Resource consumption: Since structured streaming processing is based on the Spark engine, it may require a large amount of computing resources and memory resources to process real-time data streams.

Overall, structured streaming has many advantages in processing real-time data streams, but its advantages and disadvantages also need to be weighed based on specific business needs and scenarios.

sparkstreaming

package cn.lihaozhe.chap09

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}

/**
 * spark streaming
 *
 * @author 李昊哲
 * @version 1.0
 */
object SparkStreamingExample {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //1. 生成一个Dstream
    val sparkConf: SparkConf = new SparkConf()
      .setMaster("local[2]")
      .setAppName("SparkStreamingExample")
    val streamingContext = new StreamingContext(sparkConf, Seconds(4))
    val dStream: ReceiverInputDStream[String] = streamingContext.socketTextStream("spark03", 9999)
    //2. 计算(wordcount)
    dStream
      .flatMap(_.split(" "))
      .map((_,1))
      .reduceByKey(_+_)
      .print()

    //3. 运行流程序
    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

structedstreaming

kafka

log4j.properties

log4j.rootLogger=error, stdout,R
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS}  %5p --- [%50t]  %-80c(line:%5L)  :  %m%n

log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.File=../log/agent.log
log4j.appender.R.MaxFileSize=1024KB
log4j.appender.R.MaxBackupIndex=1

log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS}  %5p --- [%50t]  %-80c(line:%6L)  :  %m%n

KafkaConsumer

package cn.lihaozhe.chap10

import java.util.Properties
import org.apache.kafka.clients.producer.{
    
    KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

object SparkKafkaProducer {
    
    

  def main(args: Array[String]): Unit = {
    
    

    // 0 配置信息
    val properties = new Properties()
    properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"spark01:9092,spark02:9092,spark03:9092")
    properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])

    // 1 创建一个生产者
    val producer = new KafkaProducer[String, String](properties)

    // 2 发送数据
    for (i <- 1 to 5) {
    
    
      producer.send(new ProducerRecord[String,String]("lihaozhe","lihaozhe"+i))
    }

    // 3 关闭资源
    producer.close()
  }

}

KafkaConsumer

package cn.lihaozhe.chap10

import org.apache.spark.SparkConf
import org.apache.kafka.clients.consumer.{
    
    ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{
    
    DStream, InputDStream}
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010.{
    
    ConsumerStrategies, KafkaUtils, LocationStrategies}

object SparkKafkaConsumer {
    
    

  def main(args: Array[String]): Unit = {
    
    

    // 1 初始化上下文环境
    val conf = new SparkConf().setMaster("local[*]").setAppName("spark-kafka")
    val ssc = new StreamingContext(conf, Seconds(3))


    // 2 消费数据
    val kafkapara  = Map[String,Object](
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->"spark01:9092,spark02:9092,spark03:9092",
      ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG->classOf[StringDeserializer],
      ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG->classOf[StringDeserializer],
      ConsumerConfig.GROUP_ID_CONFIG->"test"
    )
    val KafkaDSteam = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Set("lihaozhe"), kafkapara))

    // key "" value "lihaozhe"
    val valueDStream = KafkaDSteam.map(record => record.value())

    valueDStream.print()

    // 3 执行代码 并阻塞
    ssc.start()
    ssc.awaitTermination()
  }

}

Guess you like

Origin blog.csdn.net/qq_24330181/article/details/135001232