Contents: .1.
Hadoop Family Products
2. Hadoop Family Learning Roadmap
-
- Hadoop family products
As of 2013, according to cloudera's statistics, Hadoop family products have reached 20!
Next, I divided these 20 products into 2 categories.
The first category is what I have mastered
The second category is what TODO is ready to continue learning
- Hadoop family products
One- sentence product introduction:
Apache Hadoop: It is an open source framework for distributed computing of the Apache open source organization, providing a distributed file system sub-project (HDFS) and a software architecture that supports MapReduce distributed computing.
Apache Hive: It is a data warehouse tool based on Hadoop. It can map structured data files into a database table, and quickly implement simple MapReduce statistics through SQL-like statements. It is not necessary to develop special MapReduce applications. It is very suitable for data warehouse applications. Statistical Analysis.
Apache Pig: It is a large-scale data analysis tool based on Hadoop. The SQL-LIKE language it provides is called Pig Latin. The compiler of this language converts SQL-like data analysis requests into a series of optimized MapReduce operations.
Apache HBase: It is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. HBase technology can be used to build large-scale structured storage clusters on cheap PC Servers.
Apache Sqoop: It is a tool used to transfer data between Hadoop and relational databases. It can import data from a relational database (MySQL, Oracle, Postgres, etc.) into Hadoop's HDFS, or it can HDFS data is imported into a relational database.
Apache Zookeeper: It is a distributed and open source coordination service designed for distributed applications. It is mainly used to solve some data management problems often encountered in distributed applications and to simplify the coordination and management of distributed applications. , providing high-performance distributed services
Apache Mahout: is a distributed framework for machine learning and data mining based on Hadoop. Mahout used MapReduce to implement some data mining algorithms and solved the problem of parallel mining.
Apache Cassandra: is an open source distributed NoSQL database system. It was originally developed by Facebook to store data in a simple format, combining the data model of Google BigTable with the fully distributed architecture of Amazon Dynamo
Apache Avro: is a data serialization system designed to support data-intensive, high-volume data exchange applications. Avro is a new data serialization format and transmission tool, which will gradually replace the original IPC mechanism of Hadoop
Apache Ambari: is a web-based tool that supports provisioning, management, and monitoring of Hadoop clusters.
Apache Chukwa: It is an open source data collection system for monitoring large-scale distributed systems. It can collect various types of data into files suitable for Hadoop processing and save them in HDFS for Hadoop to perform various MapReduce operations.
Apache Hama: It is a BSP (Bulk Synchronous Parallel) parallel computing framework based on HDFS. Hama can be used for large-scale, big data computing including graph, matrix and network algorithms.
Apache Flume: It is a distributed, reliable, and high-availability massive log aggregation system that can be used for log data collection, log data processing, and log data transmission.
Apache Giraph: is a scalable distributed iterative graph processing system based on the Hadoop platform, inspired by BSP (bulk synchronous parallel) and Google's Pregel.
Apache Oozie: is a workflow engine server for managing and coordinating tasks running on the Hadoop platform (HDFS, Pig and MapReduce).
Apache Crunch: is a Java library written based on Google's FlumeJava library for creating MapReduce programs. Similar to Hive, Pig, Crunch provides schema libraries for implementing common tasks like concatenating data, performing aggregations, and sorting records
Apache Whirr: is a set of libraries that run on cloud services (including Hadoop), providing a high degree of complementarity. Whirr Learning supports Amazon EC2 and Rackspace services.
Apache Bigtop: A tool for packaging, distributing and testing Hadoop and its surrounding ecosystem.
Apache HCatalog: Hadoop-based data table and storage management to achieve central metadata and schema management, spanning Hadoop and RDBMS, using Pig and Hive to provide relational views.
Cloudera Hue: It is a web-based monitoring and management system that implements web-based operations and management of HDFS, MapReduce/YARN, HBase, Hive, and Pig.
- Hadoop family learning roadmap
Below I will introduce the installation and use of each product, and summarize my learning route based on my experience.
Hadoop
Hadoop Learning Roadmap
Yarn Learning Roadmap
Build Hadoop projects with Maven
Hadoop historical version installation
Hadoop programming calls HDFS
Massive Web log analysis to extract KPI statistical indicators with Hadoop
Building a movie recommendation system with Hadoop
Create a Hadoop parent virtual machine
Clone virtual machines to add Hadoop nodes
R language injects statistical blood into Hadoop
One of the RHadoop practice series to build a Hadoop environment
Matrix Multiplication with MapReduce
Parallel implementation of PageRank algorithm
PeopleRank finds individual value
Hive from social networks
Hive Learning Roadmap
Hive installation and usage guide
Hive import 10G data test
Hive of R Sword NoSQL series articles
Extract reverse repurchase information Pig from historical data with RHive
Pig Learning Roadmap
Zookeeper
Zookeeper learning roadmap
ZooKeeper pseudo-stepped cluster installation and use
ZooKeeper implements distributed queue Queue
ZooKeeper implements distributed FIFO queue
Zookeeper-based step-by-step queue system integration case
HBase
HBase learning roadmap
Install HBase in Ubuntu
RHadoop practice series four: rhbase installation and use of
Mahout
Mahout Learning Roadmap
Analyze Mahout user recommendation collaborative filtering algorithm (UserCF) with R
RHadoop practice series three R implements the collaborative filtering algorithm of MapReduce
Build the Mahout project with Maven
Mahout recommendation algorithm API detailed explanation
Dissecting Mahout recommendation engine from source code
Mahout step-by-step program to develop item-based collaborative filtering ItemCF
Mahout step-by-step procedure to develop clustering Kmeans
Build a job recommendation engine with Mahout
Mahout builds book recommendation system
Sqoop
Sqoop Learning Roadmap
Cassandra
Cassandra Learning Roadmap
Cassandra single-cluster experiment with 2 nodes
Cassandra of the R sword NoSQL series