Hadoop Learning Roadmap

Contents: .1.
Hadoop Family Products
2. Hadoop Family Learning Roadmap

    1. Hadoop family products
      As of 2013, according to cloudera's statistics, Hadoop family products have reached 20!
      Next, I divided these 20 products into 2 categories.
      The first category is what I have mastered
      The second category is what TODO is ready to continue learning

One- sentence product introduction:
Apache Hadoop: It is an open source framework for distributed computing of the Apache open source organization, providing a distributed file system sub-project (HDFS) and a software architecture that supports MapReduce distributed computing.

Apache Hive: It is a data warehouse tool based on Hadoop. It can map structured data files into a database table, and quickly implement simple MapReduce statistics through SQL-like statements. It is not necessary to develop special MapReduce applications. It is very suitable for data warehouse applications. Statistical Analysis.

Apache Pig: It is a large-scale data analysis tool based on Hadoop. The SQL-LIKE language it provides is called Pig Latin. The compiler of this language converts SQL-like data analysis requests into a series of optimized MapReduce operations.

Apache HBase: It is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. HBase technology can be used to build large-scale structured storage clusters on cheap PC Servers.

Apache Sqoop: It is a tool used to transfer data between Hadoop and relational databases. It can import data from a relational database (MySQL, Oracle, Postgres, etc.) into Hadoop's HDFS, or it can HDFS data is imported into a relational database.

Apache Zookeeper: It is a distributed and open source coordination service designed for distributed applications. It is mainly used to solve some data management problems often encountered in distributed applications and to simplify the coordination and management of distributed applications. , providing high-performance distributed services

Apache Mahout: is a distributed framework for machine learning and data mining based on Hadoop. Mahout used MapReduce to implement some data mining algorithms and solved the problem of parallel mining.

Apache Cassandra: is an open source distributed NoSQL database system. It was originally developed by Facebook to store data in a simple format, combining the data model of Google BigTable with the fully distributed architecture of Amazon Dynamo

Apache Avro: is a data serialization system designed to support data-intensive, high-volume data exchange applications. Avro is a new data serialization format and transmission tool, which will gradually replace the original IPC mechanism of Hadoop

Apache Ambari: is a web-based tool that supports provisioning, management, and monitoring of Hadoop clusters.

Apache Chukwa: It is an open source data collection system for monitoring large-scale distributed systems. It can collect various types of data into files suitable for Hadoop processing and save them in HDFS for Hadoop to perform various MapReduce operations.

Apache Hama: It is a BSP (Bulk Synchronous Parallel) parallel computing framework based on HDFS. Hama can be used for large-scale, big data computing including graph, matrix and network algorithms.

Apache Flume: It is a distributed, reliable, and high-availability massive log aggregation system that can be used for log data collection, log data processing, and log data transmission.

Apache Giraph: is a scalable distributed iterative graph processing system based on the Hadoop platform, inspired by BSP (bulk synchronous parallel) and Google's Pregel.

Apache Oozie: is a workflow engine server for managing and coordinating tasks running on the Hadoop platform (HDFS, Pig and MapReduce).

Apache Crunch: is a Java library written based on Google's FlumeJava library for creating MapReduce programs. Similar to Hive, Pig, Crunch provides schema libraries for implementing common tasks like concatenating data, performing aggregations, and sorting records

Apache Whirr: is a set of libraries that run on cloud services (including Hadoop), providing a high degree of complementarity. Whirr Learning supports Amazon EC2 and Rackspace services.

Apache Bigtop: A tool for packaging, distributing and testing Hadoop and its surrounding ecosystem.

Apache HCatalog: Hadoop-based data table and storage management to achieve central metadata and schema management, spanning Hadoop and RDBMS, using Pig and Hive to provide relational views.

Cloudera Hue: It is a web-based monitoring and management system that implements web-based operations and management of HDFS, MapReduce/YARN, HBase, Hive, and Pig.

  1. Hadoop family learning roadmap
    Below I will introduce the installation and use of each product, and summarize my learning route based on my experience.
    Hadoop
    Hadoop Learning Roadmap

 Yarn Learning Roadmap

 Build Hadoop projects with Maven

 Hadoop historical version installation

Hadoop programming calls HDFS

 Massive Web log analysis to extract KPI statistical indicators with Hadoop

 Building a movie recommendation system with Hadoop

Create a Hadoop parent virtual machine

 Clone virtual machines to add Hadoop nodes

R language injects statistical blood into Hadoop

One of the RHadoop practice series to build a Hadoop environment

 Matrix Multiplication with MapReduce

 Parallel implementation of PageRank algorithm

PeopleRank finds individual value
Hive from social networks

Hive Learning Roadmap

Hive installation and usage guide

Hive import 10G data test

Hive of R Sword NoSQL series articles


 Extract reverse repurchase information Pig from historical data with RHive

Pig Learning Roadmap
Zookeeper

 Zookeeper learning roadmap

 ZooKeeper pseudo-stepped cluster installation and use

ZooKeeper implements distributed queue Queue

ZooKeeper implements distributed FIFO queue

 Zookeeper-based step-by-step queue system integration case
HBase

HBase learning roadmap

 Install HBase in Ubuntu

RHadoop practice series four: rhbase installation and use of
Mahout

Mahout Learning Roadmap

 Analyze Mahout user recommendation collaborative filtering algorithm (UserCF) with R

RHadoop practice series three R implements the collaborative filtering algorithm of MapReduce

 Build the Mahout project with Maven

Mahout recommendation algorithm API detailed explanation

 Dissecting Mahout recommendation engine from source code

Mahout step-by-step program to develop item-based collaborative filtering ItemCF

Mahout step-by-step procedure to develop clustering Kmeans

 Build a job recommendation engine with Mahout

Mahout builds book recommendation system
Sqoop

Sqoop Learning Roadmap
Cassandra

 Cassandra Learning Roadmap

 Cassandra single-cluster experiment with 2 nodes

 Cassandra of the R sword NoSQL series

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325479357&siteId=291194637