This test machine sparkshell small note data read s3


// step1 download spark, my official website to download a pre-compiled version of spark2.4.4 hadoop2.7.3 decompression which can use
// point to note here, I had installed the latest openjdk13, when executed sparksql of error,
// solutions modifications spark-env.sh update JAVA_HOME = / PATH / TO after / JDK8 solve problems
// step2 use docker to build its own s3 environment
docker RUN -p 9000: 9000 --name minio1 \
-e "MINIO_ACCESS_KEY Minio =" \
-e "MINIO_SECRET_KEY minio123 =" \
-v / the Users / student2020 / the Data / Minio / the Data /: / the Data \
Minio / Minio Server / the Data

// step3
// Login localhost: 9000 enter minio / minion123 in, click on the bottom right of the plus sign create a bucket name for the test
// behind the spare, test data generated on the test bucket

// step4 use sparkshell generate test data to rewrite the whole process of data compression, data cleaning data
// because I downloaded this machine is spark2.4hadoop2 .7.3 pre-compiled version, so the following configuration AWS-hadoop: 2.7.3
// the machine has been added SPARK_HOME in / etc / profile, the following command can be executed directly, or go to SPARK_HOME / bin
to the next // directory
spark -s hell \
 io.delta:delta-core_2.11:0.5.0,org.apache.hadoop:hadoop-aws:2.7.3 --packages \
 --conf spark.delta.logStore.class = org.apache.spark.sql. delta.storage.S3SingleDriverLogStore \
 --conf spark.hadoop.fs.s3a.access.key = Minio \
 --conf spark.hadoop.fs.s3a.secret.key = minio123 \
 --conf spark.hadoop.fs.s3a = 127.0.0.1 .endpoint: 9000 \
 --conf spark.hadoop.fs.s3a.connection.ssl.enabled = false
 // because they have not built minio security certification not use ssl, so the last item needs to be configured to enable good, otherwise it will perform error

// test write some data to s3,
// the following code directly attached to the spark-shell into the transport window to wait
spark.range (500) .write.format ( "delta "). save ( "S3a: // the Test / df001 /")
spark.range (1500) .write.format ( "Delta") .mode ( "Overwrite") .save ( "S3a: // the Test / df001 /")
spark.range (11500) .write.format ( "delta" ).mode("overwrite").save("s3a://test/df001/")
 
 // After several write, find the folder s3a: // test / df001 / below dozens of file
 // file is too small problem, would adversely affect the performance of the file system and query

/ * data based on compression delta lake official document recommended best practices * /
Val path = "S3a: // the Test / df001 /"
spark.read
.format ( "Delta")
.load (path)
.repartition (5)
.write
.option ( " dataChange "," false ")
.format (" Delta ")
.mode (" Overwrite ")
.save (path)

// Guo Shu too much to clean up or remove outdated small files
Import io.delta.tables._
Val deltaTable DeltaTable.forPath = (spark, path)
// ensure that in the implementation of this operation, do not insert | update | delete | optimize other operations, or delta table data will be damaged
// set the following configuration is false, otherwise execution error
spark .conf.set ( "spark.databricks.delta.retentionDurationCheck.enabled",false)
// clear the default is 168 hours, ie data before data before 7 * 24 hours, when tested here a short time a number of operations, retained data is written within six minutes of
deltaTable.vacuum (0.1)

// Check again s3a: // test / df001 / following directory, into five


Guess you like

Origin www.cnblogs.com/huaxiaoyao/p/12153390.html