hadoop(1)Quick Guide to Hadoop on Ubuntu
The Apache Hadoop software library scale up from single servers to thousands of machines, each offering local computation and storage.
Subprojects:
Hadoop Common
Hadoop Distributed File System(HDFS): A distribute file system that provides high-throughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Other Hadoop-related projects:
Avro: A data serialization system
Cassandra: A scalable multi-master database with no single points of failure.
chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large table.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.
1. Single Node Setup
Quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System(HDFS).
We need to install http://www.cygwin.com/ in win7 first. Win7 is only for developing. But I use my virtual machine ubuntu.
Download the Hadoop release from http://mirror.cc.columbia.edu/pub/software/apache//hadoop/common/stable/. The file names are
hadoop-0.23.0-src.tar.gz and hadoop-0.23.0.tar.gz.
I decided to build it from the source file.
Install ProtocolBuffer on ubuntu.
download file from this URL http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz
>wget http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz
>tar zxvf protobuf-2.4.1.tar.gz
>cd protobuf-2.4.1
>sudo ./configure --prefix=/usr/local
>sudo make
>sudo make install
Install Hadoop Common
>svn checkout http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.23.0/
>cd release-0.23.0
>mvn package -Pdist,native,docs,src -DskipTests -Dtar
error message:
org.apache.maven.reactor.MavenExecutionException: Failed to validate POM for project org.apache.hadoop:hadoop-project at /home/carl/download/release-0.23.0/hadoop-project/pom.xml
at org.apache.maven.DefaultMaven.getProjects(DefaultMaven.java:404)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:272)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)
Solution:
try to install maven3 instead
>sudo apt-get remove maven2
>sudo apt-get autoremove maven2
>sudo apt-get install python-software-properties
>sudo add-apt-repository "deb http://build.discursive.com/apt/ lucid main"
>sudo apt-get update
>sudo apt-get install maven
add this to /etc/environment PATH column.
/usr/local/apache-maven-3.0.3/bin
>. /etc/environment
It is work now.
>mvn package -Pdist,native,docs,src -DskipTests -Dtar
Fail, and I check the BUILDING.txt file and get this:
* Unix System
* JDK 1.6
* Maven 3.0
* Forrest 0.8 (if generating docs)
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.4.1+ (for MapReduce)
* Autotools (if compiling native code)
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)
Install Forrest on Ubuntu
http://forrest.apache.org/
>wget http://mirrors.200p-sf.sonic.net/apache//forrest/apache-forrest-0.9.tar.gz
>tar zxvf apache-forrest-0.9.tar.gz
>sudo mv apache-forrest-0.9 /usr/local/
>sudo vi /etc/environment
add path /usr/local/apache-forrest-0.9/bin in PATH
>. /etc/environment
Install Autotools in Ubuntu
>sudo apt-get install build-essential g++ automake autoconf gnu-standards autoconf-doc libtool gettext autoconf-archive
build the hadoop again
>mvn package -Pdist -DskipTests=true -Dtar
Build success. I can get the file /home/carl/download/release-0.23.0/hadoop-dist/target/hadoop-0.23.0-SNAPSHOT.tar.gz
Make sure ssh and rsync are on my system.
>sudo apt-get install ssh
>sudo apt-get install rsync
Unpack the hadoop distribution.
>tar zxvf hadoop-0.23.0-SNAPSHOT.tar.gz
>sudo mv hadoop-0.23.0-SNAPSHOT /usr/local/
>cd /usr/local/
>sudo mv hadoop-0.23.0-SNAPSHOT hadoop-0.23.0
>cd hadoop-0.23/conf/
>vi hadoop-env.sh
modify the line of JAVA_HOME to following statement
JAVA_HOME=/usr/lib/jvm/java-6-sun
check the hadoop command
>bin/hadoop version
Hadoop 0.23.0-SNAPSHOT
Subversion http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.23.0/hadoop-common-project/hadoop-common -r 1196973
Compiled by carl on Wed Nov 30 02:32:31 EST 2011
From source with checksum 4e42b2d96c899a98a8ab8c7cc23f27ae
There are 3 modes:
Local (Standalone) Mode
Pseudo-Distributed Mode
Fully-Distributed Mode
Standalone Operation
>mkdir input
>cp conf/*.xml input
>vi input/1.xml
YARNtestforfun
>bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar grep input output 'YARN[a-zA-Z.]+'
>cat output/*
1 YARNtestforfun
Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Configuration
conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
Setup passphraseless ssh
>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
>cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
>ssh localhost
Then I can ssh connect to the localhost without password.
Execution
Format a new distributed-filesystem:
>bin/hadoop namenode -format
Start the hadoop daemons:
>bin/start-all.sh
The logs go there ${HADOOP_HOME}/logs. /usr/local/hadoop-0.23.0/logs/yarn-carl-nodemanager-ubuntus.out. And the error messages are as following:
No HADOOP_CONF_DIR set.
Please specify it either in yarn-env.sh or in the environment.
solution:
>sudo vi yarn-env.sh
>sudo vi /etc/environment
>sudo vi hadoop-env.sh
add one line:
HADOOP_CONF_DIR=/usr/local/hadoop-0.23.0/conf
HADOOP_COMMON_HOME=/usr/local/hadoop-0.23.0/share/hadoop/common
HADOOP_HDFS_HOME=/usr/local/hadoop-0.23.0/share/hadoop/hdfs
>bin/start-all.sh
http://192.168.56.101:9999/node
http://192.168.56.101:8088/cluster
change the configuration files, comment all the other xml files in conf directory.
>vi conf/yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Find something different with the latest version 0.23.0. So I need to do some changes according to another guide.
references:
http://guoyunsky.iteye.com/category/186934
http://hadoop.apache.org/common/docs/r0.19.2/cn/quickstart.html
http://hadoop.apache.org/
http://guoyunsky.iteye.com/blog/1233707
http://hadoop.apache.org/common/
http://hadoop.apache.org/common/docs/stable/single_node_setup.html
http://www.blogjava.net/shenh062326/archive/2011/11/10/yuling_hadoop_0-23_compile.html
http://sillycat.iteye.com/blog/965534
http://www.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/
http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/SingleCluster.html
http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html
hadoop(1)Quick Guide to Hadoop on Ubuntu
猜你喜欢
转载自sillycat.iteye.com/blog/1556106
今日推荐
周排行