Introduction and Hadoop

Foreword

Recent want to learn the next big data, a little anxious, then I went online to find a variety of video training institutions, found that most of the poor quality, basic theory is not that simple or say next, and then teach you the code Zhaomaohuahu knock, out of the question no law analysis. Finally find the open class Xiamen University from the theoretical start slow and steady learning.

A Hadoop origin

Hadoop main theories of the origin of the three papers from Google Inc., and quickly applied to the world's leading Internet company. Therefore, learning Hadoop big data is not open around a knowledge point. This year, the rise of big data actually comes from the development of computer technology, networking technology generated a lot of data, cloud technology allows computer storage and computing resources become more popular, so Big Data technologies should be shipped out to solve the storage of large amounts of data and the problem of computing.

Two Hadoop ecosystem

Learning Hadoop, need to know the function and role of Hadoop ecosystem in each project, why do we need to develop this new project, instead of using the existing projects to achieve this functionality.

2.1 HDFS

Hadoop underlying file system, the file system is different from the conventional in that it is distributed. At the same time, compared with the existing distributed file system, distributed file it with the old system no advantage. Such as: high availability, high reliability, high throughput, low-cost server can be used to build, can be expanded by increasing the number of machines. Specific implementation record in HDFS article.

2.2 HBase

Construction of distributed database system on top of HDFS. NoSQL database is based on a column, from another point of view can be seen NoSQL database key pairs. Compared with traditional relational database, it may be the greatest advantage of the machine by increasing the scale, and can use an inexpensive server.

2.3 Hive

SQL language to operate with the latter class of distributed data warehouse, external data can be imported. Generally used for query and analysis of historical data. And HBase different, HBase commonly used in real-time interactive query.

2.4 MapRuduce

A distributed computing framework, MapRuce originally the name of a model calculation. The core idea is to "divide and rule", can be broken down into a number of small compute computing, computing simultaneously by multiple machines. Suitable for off-line batch processing.

2.5 Storm

Computing a streaming frame, MapRuce for batch processing streaming data can not be completed, and thus the development of streaming framework.

2.6 Common large data processing requirements

  • Offline Batch characteristics: using historical data, high-volume processing throughput requirements.
  • Real-time interactive processing features: user interactive use, reflecting the speed required several seconds to several minutes between.
  • Streaming data processing features: a stream data input, processing speed required milliseconds, and after treatment the majority do not stored.

2.7 Hadoop component relationship

The basic relationship is that the underlying storage with HDFS, calculated on the core frame MapRuduce. And Hive, Hbase, Pig and the like are generally the assembly operation itself is then converted to the code Mapreduce calculated by Mapreduce implement the functions. At the same time with the same level MapRuduce framework Storm solve the processing streaming data. Although the use Mapreduce Hbase framework for processing, but also to achieve the basic requirements of real-time interactive processing. (It is also Mapreduce problems exist, Spark gradually rise, although Mapreduce has done a variety of optimization, but in some areas compared to the Spark still some gaps).

Three Hadoop installation

  1. ready. Linux servers on the same LAN the number, I was with my game this simultaneously opened three virtual machines replaced.

  2. Linux was created hadoop user, responsible for Hadoop project, to facilitate the management and the division of authority.

  3. Install JDK, the official will specify JDK version requirements when downloading Hadoop, set the environment variable JDK

  4. SSH-free installation and set up secret login. Because communication with the management node of the HDFS NameNode and other DateNode so is based on the SSH protocol. Domain name and the machine will be used to write hosts file for easy naming.

  5. Go to the official website to download and extract Hadoop. Hadoop modify the configuration file, located hadoop / etc / hadoop / below, respectively:
  • slaves. Write DateNode machines, because before modifying the DNS file, you can write directly to the domain name, do not write the IP.
  • core-site.xml. The core Hadoop configuration file
    fs.defaultFS, the default host and port the file system, the file system is here hdfs.
    hadoop.tmp.dir hadoop temporary file path, the system is not set up temporary file path will be used, after the system is rebooted lost.
<configuration>
       <property>
               <name>fs.defaultFS</name>
               <value>hdfs://Master:9000</value>
       </property>
       <property>
               <name>hadoop.tmp.dir</name>
               <value>file:/usr/local/hadoop/tmp</value>
               <description>Abase for other temporary directories.</description>
       </property>
</configuration>
  • hdfs-site.xml. HDFS profile
    dfs.namenode.secondary.http-address. SecondNameNode machine and port
    dfs.replication. Save a copy of the file number of HDFS systems.
    dfs.namenode.name.dir, dfs.datanode.data.dir. NameNode and DataNode data stored in the original location in the file system.
   <configuration>
       <property>
               <name>dfs.namenode.secondary.http-address</name>
               <value>Master:50090</value>
       </property>
       <property>
               <name>dfs.replication</name>
               <value>1</value>
       </property>
       <property>
               <name>dfs.namenode.name.dir</name>
               <value>file:/usr/local/hadoop/tmp/dfs/name</value>
       </property>
       <property>
               <name>dfs.datanode.data.dir</name>
               <value>file:/usr/local/hadoop/tmp/dfs/data</value>
       </property>
</configuration>
  • mapred-site.xml. MapReuce profile
    mapreduce.framework.name. MapReuce resource management systems. This option yarn, originally MapReuce themselves when distributed computing resource management, and later found to be insufficient efficiency will split apart again developed a framework.
    mapreduce.jobhistory.address. MapReuce quest log system, specify the machine and port.
    mapreduce.jobhistory.webapp.address. Machine and port task logging system used by the web page, you can view the task log in the system through this web page.
<configuration>
       <property>
               <name>mapreduce.framework.name</name>
               <value>yarn</value>
       </property>
       <property>
               <name>mapreduce.jobhistory.address</name>
               <value>Master:10020</value>
       </property>
       <property>
               <name>mapreduce.jobhistory.webapp.address</name>
               <value>Master:19888</value>
       </property>
</configuration>
  • yarn-site.xml. YARN profile
    yarn.resourcemanager.hostname. YARN ResourceManager of used machines. This is responsible for the overall resource allocation and management.
    yarn.nodemanager.aux-services. You can customize some services, such as MapReuce the shuffle is to use this configuration. We are currently using to fill shuffle on the line.
<configuration>
       <property>
               <name>yarn.resourcemanager.hostname</name>
               <value>Master</value>
       </property>
       <property>
               <name>yarn.nodemanager.aux-services</name>
               <value>mapreduce_shuffle</value>
       </property>
</configuration>
  1. After Hadoop configured to copy the file from the primary node to each slave node, Hadoop and MapReduce's HADFS installation is complete. (CentOs system needs to close the corresponding firewall)

Four other Hadoop ecosystem components are installed

User blog and take a look at the official documentation will be about the same, and a basic form. Download extract - configuration environment variable - profile configuration components, basically xxxx-env.sh, xxx-site.sh, xxx-core.sh, slave, work that, like the configuration parameters on demand in it, meaning and specific parameters must be configured parameters look official document will understand (funny).

Guess you like

Origin www.cnblogs.com/taojinxuan/p/11130321.html