Data Cleansing in Spark

1, pull a data file from the Internet and unzip these files:

mkdir linkage
cd linkage/
curl -L -o donation.zip https://bit.ly/1Aoywaq
unzip donation.zip
unzip 'block_*.zip'

2, if you have a HDFS cluster handy, you could put the data in the HDFS:

hadoop fs -mkdir linkage
hadoop fs -put block_*.csv linkage

3, if you have a HDFS cluster available with support of YARN, you can run Spark Shell on top of HDFS:

spark-shell --master yarn --deploy-mode client

4, if you are running Spark on your local computer with four cores:

$ spark-shell --master local[4]

5, if you are running on your local computer, you can limit Spark process' memory use to 2GB:

spark-shell --master local[8] --driver-memory 2g

6. you can add your own dependency jars to Spark process by using the --jars property:

spark-shell --master local[4] --driver-memory 2g --jars myJar.jar

猜你喜欢

转载自blog.csdn.net/qq_25527791/article/details/88844956