Spark development example (programming practice)

This section describes how to perform hands-RDD conversion and operation, as well as how to write, compile, package, and run the Spark application.

Start the Spark  Shell

Spark Interactive Scripting API is a simple way of learning, but also a powerful tool for analyzing data sets interact. Spark comprises a plurality of operating modes can be used stand-alone mode, the distributed mode may be used. For simplicity, in this section in standalone mode Spark.

Either mode, as long as the startup is complete, it initializes a SparkContext objects (SC), also created a SparkSQL target for SparkSQL operation. Scala into the interactive interface, you can convert the operation and action of RDD.

Enter the directory SPARK_HOME / bin down, execute the following command to start the Spark Shell.

$./spark-shell

Spark Shell use

It assumed that the local file system, the contents of the file home / hadoop / SparkData / WordCount / text1 follows.

hello world
hello My name is john I love Hadoop programming

Here we Spark Shell operations based on the file.

1) using the local file system of a text file to create a new RDD.

scala>var textFile = sc.textFile(“file://home/Hadoop/SparkData/WordCount/text1”);
textFile:org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
<console>:12

2) the implementation of the action is to calculate how many rows in the document.

scala> textFile.count () // RDD how many rows
17/05/17 22:59:07 spark.SparkContext the INFO: the Job Finished: COUNT AT <Console>: 15, Took 5.654325469 S
resl: Long = 2

The results show that the document is returned in the "2" line.

3) perform actions steps to obtain the first line of the document.

scala> textFile.first () // content RDD first row
17/05/17 23:01:25 INFO spark.SparkContext: Job finished: first at <console>: 15, took

The results show that returns the first line of the document is "hello world".

4) a switching operation will be converted into a new RDD RDD. Obtaining line of code containing "hello" as follows .

scala> var newRDD = textFile.filter (line => line.contains ( "hello")) // how many rows containing Hello
Scala> newRDD.ount () // how many rows containing Hello
17/05/17 23:06 : 33 is the INFO spark.SparkContext: the Job Finished: COUNT AT <Console>: 15, Took 0.867975549 S
RES4: Long = 2

Forming a first code including only the row containing "hello" by the switching operation of the RDD filter, and then count how many rows by calculation.

5) Spark Shell realization of WordCount

scala> val file = sc.textFile (“file://home/hendoop/SparkData/WordCount/text1”));
scala> val count = file.flatMap(line=>line.split(“”)).map(word => (word,1)).reduceByKey(_+_)
scala> count.collect()
17/05/17 23:11:46 INFO spark.SparkContext:Job finished: collect at<console>:17,
took 1.624248037 s
res5: Array[(String, Int)] = Array((hello,2),(world,1),(My,1),(is,1),(love,1),(I,1),(John,1),(hadoop,1),(name,1),(programming,1))

  1. Use sparkContext class textFile () reads the local file and generates MappedBJDD.
  2. The contents of the file according to the word spaces split split form using FlatMappedRDD flatMap () method.
  3. Use map (word => (word, 1)) is formed to split the word <word 1> data, generated at this time MappedBJDD.
  4. Use reduceByKey () method of word frequency statistics, thereby generating ShuffledRDD, collect run job by the outcome.

Write Java applications

1. Install maven

Manually install maven, maven can visit the official download apache-maven-3.3.9-bin.zip. Select the installation directory is / usr / local / maven.

sudo unzip ~/下载/apache-maven-3.3.9-bin.zip -d/usr/local
cd /usr/local
sudo mv apache-maven-3.3.9/ ./maven
sudo chown -R hadoop ./maven

2. Write Java application code

Execute the following command to create a folder sparkapp2, as the application root directory in the terminal.

cd ~ # to enter the user's home folder
mkdir -p ./sparkapp2/src/main/java

Use vim./sparkapp2/src/main/java/SimpleApp.java create a file named SimpleApp.java, the code is as follows.

  1. /*** SimpleApp.java ***/
  2. import org.apache.spark.api.java.*;
  3. import org.apache.spark.api.java.function.Function;
  4.  
  5. public class SimpleApp {
  6. public static void main(String[] args) {
  7. String logFile = “file:///usr/local/spark/README.md”; // Should be some file on your system
  8. JavaSparkContext sc = new JavaSparkContext(“local”, “Simple App”,
  9. “file:///usr/local/spark/”,new String[] {“target/simple-project-1.0.jar”});
  10.  
  11. JavaRDD<String> logData = sc.textFile(logFile).cache();
  12. long numAs = logData.filter(new Function<String, Boolean>(){
  13. public Boolean call(String s) {
  14. return s.contains (“a”);
  15. }
  16. }).count();
  17.  
  18. long numBs = logData.filter(new Function<String,Boolean>(){
  19. public Boolean call(String s) {
  20. return s.contains(“b”);
  21. }
  22. }).count();
  23. System.out.printIn (“Lines with a:”+ numAs +“,lines with b:”+ numBs);
  24. }
  25. }

The program relies Spark Java API, so we need to compile the package by maven. In the new document ./sparkapp2 pom.xml (vim./sparkapp2/pom.xml), and to declare dependency information Spark and the independent application code is as follows.

  1. <project>
  2. <groupld>edu.berkeley</groupId>
  3. <artifactId>simple-project</artifactId>
  4. <modelVersion>4.0.0</modelVersion>
  5. <name>Simple Project</name>
  6. <packaging>jar</packaging>
  7. <version>l.0</version>
  8. <repositories>
  9. <repository>
  10. <id>Akka repository</id>
  11. <url>http://repo.akka.io/releases</url>
  12. </repository>
  13. </repositories>
  14.  
  15. <dependencies>
  16. <dependency> <!–Spark dependency –>
  17. <groupId>org.apache.spark<groupId>
  18. <artifactId>spark-core_2.11</artifactId>
  19. <version>2.1.0</version>
  20. </dependency>
  21. </dependencies>
  22. </project>

3. Use maven package Java program

In order to ensure maven to function properly, perform the following command to check the file structure of the entire application.

cd ~ / sparkapp2
find

File structure shown in Figure 1.

SimpleApp.java file structure
1 SimpleApp.java file structure of FIG.

Next, the entire application may be packaged into these Jar following code.

/usr/local/maven/bin/mvn package

Similar to the following message appears after running the above command, indicating Jar packet generation success.

[INFO] ———————————————
[INFO] BUILD SUCCESS
[INFO] ———————————————
[INFO] Total time: 6.583 s
[INFO] Finished at: 2017-02-19T15:52:08+08:00
[INFO] Final Memory: 15M/121M
[INFO]———————————————-

4. Run the program by spark-submit

Finally, a Jar package may be submitted by the spark-submit Spark run, the following command.

/usr/local/spark/bin/spark-submit –class “SimpleApp” ~/sparkapp2/target/simple-project-1.0.jar

Finally, the results obtained are as follows.

Lines with a: 62,Lines with b: 30

40 + annual salary of big data development [W] tutorial, all here!

43. A the Spark Development Example
44. the Spark Streaming Profile
45. the Spark Streaming architecture
46. the Spark Streaming programming model
47. the Spark DSTREAM related operation
48. the Spark development instance Streaming

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92173153