1, pull a data file from the Internet and unzip these files:
mkdir linkage
cd linkage/
curl -L -o donation.zip https://bit.ly/1Aoywaq
unzip donation.zip
unzip 'block_*.zip'
2, if you have a HDFS cluster handy, you could put the data in the HDFS:
hadoop fs -mkdir linkage
hadoop fs -put block_*.csv linkage
3, if you have a HDFS cluster available with support of YARN, you can run Spark Shell on top of HDFS:
spark-shell --master yarn --deploy-mode client
4, if you are running Spark on your local computer with four cores:
$ spark-shell --master local[4]
5, if you are running on your local computer, you can limit Spark process' memory use to 2GB:
spark-shell --master local[8] --driver-memory 2g
6. you can add your own dependency jars to Spark process by using the --jars property:
spark-shell --master local[4] --driver-memory 2g --jars myJar.jar