How big data BigData the spark reading Amazon s3 bucket of data?


1. Note that my environment like your environment?

Environment: Native spark (version to make up time, not in the workspace)
system: Ubuntu 16.04
Jar package versions: see article tail (it is too long, unsightly open head)


2. How to perform spark-shell script?

First of all, run your spark-shell script. You know for sure that it is in / spark / bin / directory now!
Change to the directory where the spark-shell sh script and then execute it into the command line.

$ cd /spark/bin/
$ ./spark-shell	//我是执行命令

Then prints a long list of log information, see the spark icon text output, and shows the spark> command line when it is successful.

spark>

In this long list of log information, we want to know, in fact, the effect is equivalent to the following code:

//Scala 编程语言
val conf = new SparkConf().setMaster("local").setAppName("SparkSQL")
val sc = new SparkContext(conf)

The first line of action is to set up master for the local local, and application app name is called SparkSQL.
Distributed cluster is divided into master and slaves (workers). Host there regulatory role as an intermediary, slaves are slave nodes. Expansion: master and slaves in pairs so nice, right? And a good use for a long time. But then, slaves are the servants of the meaning, which raises a number of objections cultural figures, that the ingredients contained in the definition of discrimination, they are asked to change!

The second line means that a new SparkContext. SparkContext plays a leading role in the implementation Spark application, which is responsible for interacting with the program and spark cluster, including application cluster resources, create RDD, accumulators and broadcast variables.

Every execution spark-shell, will automatically help us create this sc (SparkContext), we can use directly behind it on the command line. Of course, you can also re-create yourself.


3. How to use spark-shell? Scala programming language

Native spark-shell can be operated using Scala or Python. We are using Scala to connect Amazon S3. Using Python feel a lot of pit operation, many packages are not available, you need to import your own, interested readers can go try.

Scala s3 using the spark-shell connector in

//set the configuration
sc.hadoopConfiguration.set("fs.s3a.access.key", "your access key")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "your secret key")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "endpoint:port ")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false")

//read
val myRdd = sc.textFile("s3a://bucketms/notebook/text")  
//write
val myRDD.saveAsTextFile("s3a://bucketms/notebook/text")  
//count text data num
myRdd.count

access key and secret key can be understood as user login account and password, s3 is assigned by the server.

endpoint: port is the ip address and port you're ready to connect s3 server.

The meaning of the fourth line is to disable ssl.


Attached: to use the package

Need to use the package to be put
jar under the package directory spark

1. aws java sdk jar package version (aws service)

aws-java-sdk-1.11.404.jar
aws-java-sdk-core-1.11.404.jar
aws-java-sdk-s3-1.11.404.jar

2. hadoop aws package version (aws service)

hadoop-aws-3.0.3.jar

3. jackson package version (data binding, comment)

Note com.fasterxml.jackson under specific jackon wrapped !!!
Jackson-Annotations-2.7.8.Jar
Jackson-Core-2.7.8.Jar
Jackson-Databind-2.7.8.Jar

jackson-core-asl-1.9.13.jar
jackson-mapper-asl-1.9.13.jar

Published 204 original articles · won praise 59 · Views 140,000 +

Guess you like

Origin blog.csdn.net/baidu_34122324/article/details/85082112