Java future trends Java promotes the development of big data

Without Java, there would not even be a big development of big data, Hadoop itself is written in Java. When you need to release new functionality on a cluster of servers running MapReduce, you need to do dynamic deployment, which is what Java is good at.


Mainstream open source tools that support Java in the field of big data:

1. HDFS

HDFS is the main distributed storage system in Hadoop applications. The HDFS cluster includes a NameNode (master node), which is responsible for managing the metadata and storage of all file systems. DataNode for real data (data nodes, there can be many). HDFS is designed for massive data, so compared to the optimization of traditional file systems for large batches of small files, HDFS optimizes the access and storage of small batches of large files.

2. MapReduce

Hadoop MapReduce is a software framework for easily writing parallel applications that process massive (terabytes) data, connecting tens of thousands of nodes (commercial hardware) in large clusters in a reliable and fault-tolerant manner.

3. HBase

Apache HBase is a Hadoop database, a distributed and scalable big data store. It provides random and real-time read/write access on large data sets and is optimized for large tables on commodity server

clusters - tens of billions of rows, tens of millions of columns. At its core is an open-source implementation of the Google Bigtable paper, distributed columnar storage. Just like Bigtable utilizes

the distributed data storage provided by GFS (Google File System), it is a Bigatable-like class provided by Apache Hadoop on the basis of HDFS.

4. Cassandra

Apache Cassandra is a high-performance, linearly scalable, highly available database that can run on commodity hardware or cloud infrastructure to create the perfect mission-critical data platform.

Cassandra is best-in-class for replication across data centers, providing users with lower latency and more reliable disaster backup. With log-structured updates, strong support for denormalized and materialized views, and powerful built-in caching, Cassandra's data model provides convenient secondary indexes (column indexes).

5. Hive

Apache Hive is a data warehouse system for Hadoop that facilitates data summarization (maps structured data files into a database table), ad hoc querying, and analysis of large data sets stored in Hadoop-compatible systems. Hive provides a complete SQL query function - HiveQL language, and when using this language to express a logic becomes inefficient and cumbersome, HiveQL also allows traditional Map/Reduce programmers to use their own customized Mapper and Reducer.

6. Pig

Apache Pig is a platform for the analysis of large data sets that includes a high-level language for data analysis applications and an infrastructure for evaluating these applications. The shining property of Pig applications is that their structure can withstand a lot of parallelism, which means they can support very large data sets. Pig's infrastructure layer contains compilers that generate Map-Reduce tasks. Pig's language layer currently includes a native language, Pig Latin, which was developed for ease of programming and scalability.


7. Chukwa

Apache Chukwa is an open source data collection system for monitoring large distributed systems. Built on the HDFS and Map/Reduce frameworks, it inherits the scalability and stability of Hadoop. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to ensure that the data is used to the best effect.

8. Ambari

Apache Ambari is a web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, supporting Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop. Ambari also provides cluster health dashboards such as heatmaps and the ability to view MapReduce, Pig, and Hive applications to diagnose their performance characteristics with a user-friendly interface.

9. ZooKeeper

Apache ZooKeeper is a reliable coordination system for large-scale distributed systems. The functions provided include: configuration maintenance, naming service, distributed synchronization, group service, etc.

The goal of ZooKeeper is to encapsulate complex and error-prone key services, and provide users with an easy-to-use interface and a system with high performance and stable functions.

10. Sqoop

Sqoop is a tool used to transfer data in Hadoop and relational databases to each other. It can import data from a relational database into HDFS of Hadoop, or import data from HDFS into relational databases.

11. Oozie

Apache Oozie is a scalable, reliable and extensible workflow scheduling system for managing Hadoop jobs. Oozie Workflow jobs are Active Directed Acyclical

Graphs (DAGs). Oozie Coordinator jobs are triggered by periodic Oozie Workflow jobs, and the cycle is generally determined by time (frequency) and data availability. Oozie with

Combined with the rest of the Hadoop stack, out-of-the-box supports many types of Hadoop jobs (eg: Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop, and Distcp) and other system jobs (eg, Java programs and shell scripts) ).

12. Mahout

Apache Mahout is an extensible machine learning and data mining library. Mahout currently supports four main use cases:

Recommendation mining: collect user actions and recommend things that users may like.

Aggregation: Collect files and group related files.

Classification: Learn from existing classified documents, find similar features in documents, and correctly classify unlabeled documents.

Frequent Itemset Mining: Group a set of items and identify which individual items frequently occur together.

13. HCatalog

Apache HCatalog is a mapping table and storage management service for Hadoop to establish data. It includes:

providing a shared schema and data type mechanism.

Provide an abstract table so that users don't need to pay attention to how and where data is stored.

Provides interoperability for data processing tools like Pig, MapReduce, and Hive.
more 0

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326255950&siteId=291194637