hadoop spark ubuntu16

Create a new user:

$sudo useradd -m hadoop -s /bin/bash
Set the user's password:
$sudo passwd hadoop
Add administrator privileges:
$sudo adduser hadoop sudo

Install SSH and configure SSH passwordless login:

Install SSH Server:

$ sudo apt-get install openssh-server
Use SSH to log in to the machine:
$ ssh localhost
launch shh locahost:
exit
Use ssh-keygen to generate keys:
cd ~/.ssh/ # If there is no such directory, please execute ssh localhost
ssh first -keygen -t rsa # There will be a prompt, press Enter to
cat ./id_rsa.pub >> ./authorized_keys #Append a copy of rsa.pub to the end of authorized_keys.

Install JAVA environment

sudo apt-get install openjdk-8-jre openjdk-8-jdk
find the installed directory
dpkg -L openjdk-8-jdk | grep '/bin/javac' (dpkg is a tool for installing, creating and managing software, -L Display the files associated with the package, grep is a text search tool, used to filter/search for specific characters)

The output path should be /usr/lib/jvm/java-8-openjdk-amd64/bin/javac, I don't know why there is no output, but it can be found from the local machine.
Edit the user's environment variables:
sudo gedit ~/.bashrc
Change the path of the JDK to the above path
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 (be careful not to add spaces here)
to make the environment variables take effect:
source ~/.bashrc
Verify variable value
echo $JAVA_HOME # Verify variable value
java -version
$JAVA_HOME/bin/java -version # Same as executing java -version directly

Hadoop installation:

URL http://mirror.bit.edu.cn/apache/hadoop/common/

$ sudo tar -zxf ~/download/hadoop-3.0.0.tar.gz -C /usr/local # Extract to /usr/local
$ cd /usr/local/
$ sudo mv ./hadoop-3.0.0/ ./hadoop # Change the folder name to hadoop
$ sudo chown -R hadoop ./hadoop # Modify file permissions (chown changes file permissions, where -R means the specified directory and all files in subdirectories [owner] [ File] The detailed chown command can be viewed through the man command: man chown)

Hadoop stand-alone configuration:

Hadoop is a stand-alone configuration by default

View all examples ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0.jar
Select the grep example, use all the folders in the input file as input, and filter the regular expressions dfs[az.]+ words and count the number of occurrences, and finally output the results to the output folder.
cd /usr/local/hadoop
mkdir ./input
cp ./etc/hadoop/ .xml ./input # take config file as input file
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-
.jar grep ./input ./output 'dfs[az.]+'
cat ./output/* (There will be permission problems when running, so no output file is generated when I execute it)

Note that Hadoop does not overwrite the result file by default, so running the above example again will prompt an error, you need to delete ./output first.

Hadoop pseudo-distributed configuration

Hadoop can run in a pseudo-distributed manner on a single node. The Hadoop process runs as a separate Java process. The node acts as both a NameNode and a DataNode, and at the same time, it reads files in HDFS.

The configuration file of Hadoop is located in /usr/local/hadoop/etc/hadoop/, and pseudo-distribution needs to modify two configuration files core-site.xml and hdfs-site.xml. Hadoop configuration files are in xml format, and each configuration is implemented by declaring the name and value of the property.

Modify the configuration file core-site.xml (it is more convenient to edit it through gedit: gedit ./etc/hadoop/core-site.xml),




change to


Similarly, modify the configuration file hdfs-site.xml:



The Hadoop configuration file explains that
the running mode of Hadoop is determined by the configuration file (the configuration file is read when running Hadoop), so if you need to switch from pseudo-distributed mode to non-distributed mode, you need to delete the configuration in core-site.xml item.

In addition, although pseudo-distribution only needs to configure fs.defaultFS and dfs.replication to run (this is the official tutorial), but if the hadoop.tmp.dir parameter is not configured, the default temporary directory used is /tmp/hadoo-hadoop, And this directory may be cleaned up by the system when restarting, so the format must be re-executed. So we set it up and also specify dfs.namenode.name.dir and dfs.datanode.data.dir, otherwise it might go wrong in the next steps.

After the configuration is complete, execute the formatting of the NameNode:

./bin/hdfs purpose -format

Then start the NameNode and DataNode daemons.

./sbin/start-dfs.sh
JAVA_HOME Pay attention to the configuration set in hadoop/etc/hadoop_env.sh, otherwise an error will be reported.

In addition, if the DataNode is not started, you can try the following methods (note that this will delete all the original data in HDFS, if the original data is very important, please do not do this):

After the startup is completed, you can use the command jps to judge whether the startup is successful. If the startup is successful, the following processes will be listed: "NameNode", "DataNode" and "SecondaryNameNode" (if SecondaryNameNode is not started, please run sbin/stop-dfs.sh Close the process and try the launch again). If there is no NameNode or DataNode, the configuration is unsuccessful, please check the previous steps carefully, or check the cause by viewing the startup log.

The solution for the DataNode cannot be started

./sbin/stop-dfs.sh # close
rm -r ./tmp # delete the tmp file, note that this will delete all the original data in HDFS
./bin/hdfs namenode -format # reformat the NameNode
./sbin/start -dfs.sh # restart
here itself is not successful, then I add the following in hdfs-site.xml: (0.0.0.0 type local corresponding address, see your local specific ip settings)



Running a pseudo-distributed instance of Hadoop

In the above stand-alone mode, the grep example reads local data, while pseudo-distributed reads data on HDFS. To use HDFS, you first need to create a user directory in HDFS:

./bin/hdfs dfs -mkdir -p /user/hadoop

Then copy the xml file in ./etc/hadoop as an input file to the distributed file system, that is, copy /usr/local/hadoop/etc/hadoop to /user/hadoop/input in the distributed file system. We are using the hadoop user, and the corresponding user directory /user/hadoop has been created, so we can use a relative path such as input in the command, and the corresponding absolute path is /user/hadoop/input:

./bin/hdfs dfs -mkdir input
./bin/hdfs dfs -put ./etc/hadoop/*.xml input // Put this step in and report an error when you run it yourself, pay attention to check the log log, the reason for the error The problem of clusterID incompatibility, at this time, close it again, then remove the file and reformat

Pay attention to the modification in bashrc, otherwise an error will be reported. The same configuration is best set to the middle of hadoop_env.sh (here I did not configure the following code segment to the middle of hadoop_env.sh, because an error was reported)
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR= \(HADOOP_HOME/lib/native export HADOOP_OPTA="-Djava.library.path=\) HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR

After the copy is complete, you can view the file list with the following command:

./bin/hdfs dfs -ls input

The pseudo-distributed way of running MapReduce jobs is the same as the stand-alone mode, the difference is that the pseudo-distribution reads files in HDFS (you can delete the local input folder created in the stand-alone step and the output folder of the output result to verify at this point).

./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'

Command to view the running result (viewing the output in HDFS):

./bin/hdfs dfs -cat output/*

We can also get the results back to the local:

rm -r ./output # First delete the local output folder (if it exists)
./bin/hdfs dfs -get output ./output # Copy the output folder on HDFS to the local
cat ./output/*

When Hadoop runs the program, the output directory cannot exist, otherwise the error "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists" will be prompted, so if you want to execute it again, You need to execute the following command to delete the output folder:

./bin/hdfs dfs -rm -r output # Delete the output folder
Shell command
When running the program, the output directory cannot exist
. When running the Hadoop program, in order to prevent overwriting the result, the output directory (such as output) specified by the program cannot exist, otherwise it will be It prompts an error, so you need to delete the output directory before running. When actually developing an application, consider adding the following code to the program, which can automatically delete the output directory every time it runs, avoiding tedious command line operations:

(JAVA)
Configuration conf = new Configuration();
Job job = new Job(conf);

/* 删除输出目录 */
Path outputPath = new Path(args[1]);
outputPath.getFileSystem(conf).delete(outputPath, true);

To shut down Hadoop, run

./sbin/stop-dfs.sh

Note
The next time you start hadoop, you don't need to initialize the NameNode, just run ./sbin/start-dfs.sh! When starting, you can go to the hadoop directory through $HADOOP_HOME.

Start YARN

(pseudo-distribution can also be done without starting YARN, which generally does not affect program execution)

Some readers may wonder how the JobTracker and TaskTracker mentioned in the book cannot be seen after starting Hadoop. This is because the new version of Hadoop uses the new MapReduce framework (MapReduce V2, also known as YARN, Yet Another Resource Negotiator) .

YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on MapReduce and provides high availability and scalability. More introduction to YARN will not be expanded here. Those who are interested can refer to related materials.

The above-mentioned starting Hadoop through ./sbin/start-dfs.sh only starts the MapReduce environment. We can start YARN and let YARN be responsible for resource management and task scheduling.

First modify the configuration file mapred-site.xml, which needs to be renamed first:

mv ./etc/hadoop/mapred-site.xml.template ./etc/hadoop/mapred-site.xml (mv implements the file renaming function)
and then edit it. It is more convenient to use gedit to edit the same gedit. /etc/hadoop/mapred-site.xml :



Then modify the configuration file yarn-site.xml:



Then you can start YARN (you need to execute ./sbin/start-dfs.sh first):

./sbin/start-yarn.sh # Start YARN
./sbin/mr-jobhistory-daemon.sh start historyserver # Start the history server to view the task running status in the web

After opening, through jps view, you can see two more background processes NodeManager and ResourceManager.

After running YARN, the method of running the instance is still the same, only the resource management method and task scheduling are different. By observing the log information, it can be found that when YARN is not enabled, "mapred.LocalJobRunner" is running the task. After YARN is enabled, "mapred.YARNRunner" is running the task. One of the nice things about starting YARN is that you can see how the tasks are running via the web interface: http://localhost:8088/cluster .

But YARN mainly provides better resource management and task scheduling for the cluster. However, this has no value on a single machine, but will make the program run slightly slower. Therefore, whether to enable YARN on a single machine depends on the actual situation.

If you do not start YARN, you need to rename mapred-site.xml
. If you do not want to start YARN, be sure to rename the configuration file mapred-site.xml to mapred-site.xml.template, and change it back when needed. Otherwise, when the configuration file exists and YARN is not turned on, the running program will prompt an error of "Retrying connect to server: 0.0.0.0/0.0.0.0:8032", which is why the initial file name of the configuration file is mapred- site.xml.template.

Similarly, the script to close YARN is as follows:

./sbin/stop-yarn.sh
./sbin/mr-jobhistory-daemon.sh stop historyserver When
running here, it prompts that mr-jobhistory-daemon has been replaced and should use mapred --daemon stop, but check the file still There is the shell mr-jobhistory-daemon, so follow the code above.

spark installation

http://spark.apache.org/downloads.html
natively installed spark-2.3.0-bin-hadoop2.7

sudo tar -zxf ~/download/spark-1.6.0-bin-without-hadoop.tgz -C /usr/local/
cd /usr/local
sudo mv ./spark-1.6.0-bin-without-hadoop/ . /spark
sudo chown -R hadoop:hadoop ./spark # where hadoop is your username

After installation, you need to modify Spark's classpath in ./conf/spark-env.sh, and execute the following command to copy a configuration file:

cd /usr/local/spark
cp ./conf/spark-env.sh.template ./conf/spark-env.sh

Edit ./conf/spark-env.sh (vim ./conf/spark-env.sh) and add the following line at the end:

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

Running the Spark example
Note that Hadoop must be installed in order to use Spark, but if HDFS is not used in Spark, it is possible to not start Hadoop. In addition, the commands and directories that appear in the following tutorials, if there is no description, generally use the Spark installation directory (/usr/local/spark) as the current path, please pay attention to the distinction.

There are some Spark sample programs in the ./examples/src/main directory, which are available in Scala, Java, Python, R and other languages. We can first run an example program SparkPi (that is, calculate an approximation of π), and execute the following command:

cd /usr/local/spark
./bin/run-example SparkPi

When executing, it will output a lot of running information, the output results are not easy to find, you can filter it through the grep command (2>&1 in the command can output all the information to stdout, otherwise due to the nature of the output log, it will still be output to the screen):

./bin/run-example SparkPi 2>&1 | grep "Pi is roughly" The
filtered run result is shown in the figure below, and the 5-digit approximation of π can be obtained.

The Python version of SparkPi needs to be run via spark-submit:

./bin/spark-submit examples/src/main/python/pi.py

Interactive Analysis with Spark Shell The
Spark Shell provides an easy way to learn the API and an interactive way to analyze data. Spark Shell supports Scala and Python, and Scala is chosen for this tutorial.

Scala
Scala is a modern multi-paradigm programming language that aims to express common programming patterns in a concise, elegant, and type-safe manner. It smoothly integrates features of object-oriented and functional languages. Scala runs on the Java platform (JVM, Java Virtual Machine) and is compatible with existing Java programs.

Scala is the main programming language of Spark. If you just write Spark applications, you don't have to use Scala. You can use Java or Python. The advantages of using Scala are higher development efficiency, simpler code, and interactive real-time query through Spark Shell to facilitate troubleshooting.

Run the following command to start Spark Shell:

./bin/spark-shell

Connect jupyter notebook and spark through pyspark


export SPARK_HOME=/usr/local/spark
export PYTHONPATH=\(SPARK_HOME/python:\)SPARK_HOME/python/lib/py4j-0.10.6-src.zip:\(PYTHONPATH export PATH=\)HADOOP_HOME/bin:\(SPARK_HOME/bin:\)PATH
export LD_LIBRARY_PATH=\(LD_LIBRARY_PATH:/usr/local/hadoop/lib/native\){LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3

export HIVE_HOME=/usr/local/hive
export PATH=\(PATH:\)HIVE_HOME/bin

run pyspark
$SPARK_HOME/bin/pyspark

References:
https://wangchangchung.github.io/2017/09/28/Ubuntu-16-04%E4%B8%8A%E5%AE%89%E8%A3%85Hadoop%E5%B9%B6%E6 %88%90%E5%8A%9F%E8%BF%90%E8%A1%8C/
http://www.powerxing.com/install-hadoop/
http://www.powerxing.com/hadoop-build -project-by-shell/
http://dblab.xmu.edu.cn/blog/1689-2/
Python programs can be written through hadoop stream

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324866221&siteId=291194637