Top 10 Open Source Big Data Technologies

Big data has become one of the hottest technologies today and is exploding. Every day new projects from all over the world spring up like mushrooms. Fortunately, open source allows more and more projects to directly adopt big data technologies. Here is a list of the top ten most popular open source big data technologies.

1. Hadoop - Efficient, reliable, scalable, and able to provide the YARN, HDFS and infrastructure needed for your data storage projects, and run major big data services and applications.

2. Spark - Simple to use, supports all major big data languages ​​(Scala, Python, Java, R). Has a strong ecosystem, growing rapidly, and simple support for microbatching/batching/SQL. Spark is better suited for MapReduce algorithms that require iteration, such as data mining and machine learning.

3. NiFi - Apache NiFi is an open source project contributed by the US National Security Agency (NSA) to the Apache Foundation and is designed to automate the flow of data between systems. Based on its workflow programming philosophy, NiFi is very easy to use, powerful, reliable, and highly configurable. The two most important features are its powerful user interface and good data retrospective tools. It is the Swiss Army Knife in the big data toolbox.

4.Apache Hive 2.1 - Hive is a data warehouse infrastructure built on Hadoop. It provides a set of tools for extract-transform-load (ETL), a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. With the release of the latest version, the performance and functions have been comprehensively improved, and Hive has become the best solution for SQL on big data.

5. Kafka - Kafka is a high-throughput distributed publish-subscribe messaging system that can process all action streaming data in a consumer-scale website. It has become the best choice between asynchronous and distributed messaging for big data systems. It provides powerful glue from Spark to NiFi to third-party plugin tools to Java to Scala.

6. Phoenix—is the SQL driver for HBase. A large number of companies are currently adopting it and expanding its scale. NoSQL backed by HDFS integrates well with all tools. The Phoenix query engine converts the SQL query into one or more HBase scans and orchestrates the execution to generate standard JDBC result sets.

7. Zeppelin - Zeppelin is a web-based notebook that provides interactive data analysis. It is convenient for you to make beautiful documents that are data-driven, interactive and collaborative, and supports multiple languages, including Scala (using Apache Spark), Python (Apache Spark), SparkSQL, Hive, Markdown, Shell, etc.

8. Sparkling Water - H2O fills the gap in Spark's Machine Learning, it does all your machine learning.

9. Apache Beam - Provides unified data process pipeline development in Java, and supports Spark and Flink well. Many online frameworks are provided, and developers do not need to learn too many frameworks.

10. Stanford CoreNLP – Natural language processing has huge room for growth, and Stanford is working hard to improve their framework.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326640544&siteId=291194637