Hardtop know enough? The nine essential skill you must master!

In addition to the 9 outer Hadoop Big Data technologies:

1.Apache Flink

2.Apache Samza3.Google Cloud Data Flow4.StreamSets5.Tensor Flow6.Apache NiFi7.Druid8.LinkedIn WhereHows9.Microsoft Cognitive Services

Hadoop big data field is the most popular technology, but not the only. There are many other techniques can be used to solve big data problems. In addition to Apache Hadoop, the other nine big data technology is also necessary to understand.

1.Apache Flink

Is an efficient, distributed, based on a common big data analysis engine implemented in Java, which has a class of efficient distributed MapReduce platform, flexibility and scalability, and parallel database query optimization, it supports batch and flow-based data analysis and it provides a Java and Scala-based API.

This is an open source framework for analyzing large data distributed by the community-driven, like Apache Hadoop and Apache Spark. Its engine and memory means of a data stream (in-memory) processing iterations to improve performance. Currently Apache Flink has become a top-level project (Top Level Project, TLP), in April 2014 was incorporated into the Apache Incubator, there are many contributors worldwide.

Flink has been MPP database technology (Declaratives, Query Optimizer, Parallel in -memory, out-of-core algorithm) and Hadoop MapReduce technology (Massive scale out, User Defined functions , Schema on Read) inspired, there are a lot of unique features (Streaming , Iterations, Dataflow, General API) . Big Data learning 410,391,744 zero-based group to combat online live free open class, there are professional answers the teacher, learning exchanges, in order to progress faster

2.Apache Samza:

It is an open source, distributed stream processing framework that uses the open-source Apache Kafka distributed message handling system to implement messaging service, and use Windows Explorer to Apache Hadoop Yarn achieve fault-tolerant processing, the processor isolation, security, and resource management.

The technology developed by the LinkedIn original purpose was to solve the Apache Kafka exists in terms of scalability problems, including such as Simple API, Managed state, Fault Tolerant, Durable messaging, Scalable, Extensible, and Processor Isolation capabilities.

Yarn Samza code can be used as the job runs, the interface can also be implemented StreamTask, thereby defining process () call. StreamTask task instance may run inside, which is itself located inside a container Yarn.

3.Cloud Dataflow:

Google Cloud Dataflow is a data processing service native, is a building, management and optimization of complex data pipeline for building mobile applications, debugging, tracking and monitoring product class cloud applications. It uses a technique Flume and MillWhell Google inside, wherein Flume for efficient parallel processing of data, and is used MillWhell streaming Internet with good fault tolerance level.

This technique provides a simple programming model, it can be used for batch processing tasks and streaming data. This technique provides data flow management services may be performed to control the data processing job, the job may be created using a data processing Data Flow SDK (Apache Beam).

Google Data Flow provides data management related tasks, monitoring, and security capabilities. Sources Sink and can abstract the read and write operations performed in a pipeline, the pipeline from the package the entire calculation sequence can accept certain input data from external sources, some generated by the output data of the data conversion.

4.StreamSets:

StreamSets is a specially optimized data processing platform for the transmission of data, providing a visual data stream to create the model, issued by open-source approach. The technology can be deployed in the cloud or internal environment, provide a wealth of monitoring and management interface.

The data collector may be used in real-time streaming data and processes the data pipeline, the pipeline flow is described how data from the source to the final destination, and may comprise source, destination, and a processing program. Life-cycle data collector can be controlled through the management console.

5.TensorFlow:

Following DistBelief is the second generation of machine learning systems. TensorFlow from Google's Google Brain project, the main objective is to Google for various types of company-wide application of different products and services neural network machine learning.

Support distributed computing TensorFlow distributed model enables users to train in their own machine-learning infrastructure. The system is a high-performance gRPC database support, complementary to the Google cloud machine learning system recently released, allowing users to use Google Cloud platform, TensorFlow training model and provide services.

This is an open source software library, may use data stream map (data flow graph) numerical computation, this technique has been included DeepDream, RankBrain, Smart Replyused including various items used by Google.

Stream map data used by the node (Node) and the edge (Edge) consisting of the numerical operations described in FIG. (Directed graph). Atlas node represents the operational edge Representative multidimensional data array (tensor, the Tensor) communication between nodes. Edge also describes the input / output relationship between the nodes. "TensorFlow" The name implies the meaning of the tensor on the plot flow.

6.Druid:

Druid is a high fault tolerance for real-time query and analysis of large data, high-performance open source distributed system designed to quickly handle large-scale data, and the ability to achieve fast query and analysis, born in 2011, contains information such as interactive data drive application, multi-tenant: a large number of concurrent users, scalability: Every day trillions event, sub-second query, real-time analysis and other functions. Druid further comprising special function key, for example, low-latency data ingestion, rapid polymerization, any cutting capability, high availability, and accurately calculate the approximate calculation.

Create a Druid's original intent is mainly to solve the query latency problems when trying to use Hadoop to implement interactive query and analysis, but it is difficult to meet the needs of real-time analysis. The Druid provides the ability to interactively access the data, and weigh the flexibility and performance of the query took a special storage format.

The technology also provides other useful features, such as real-time nodes, historical node, Broker node, Coordinator node, use the Indexing Service query language based on JSON. Learn more

Unless 7.Apache;

Apache NiFi is a powerful and reliable data processing and distribution system that can be used for data transfer and conversion creates a directed graph. With this system you can use a graphical interface to create, monitor, control data flow, there is a wealth of configuration options are available, you can modify the data stream at runtime, dynamically created data partition. Further data may also be tracking the origin of the data flow throughout the system. By developing custom components, also easily be extended.

Apache NiFi operation is inseparable concepts such as FlowFile, Processor, Connection, and so on.

8.LinkedIn WhereHows:

Enterprise catalog WhereHows offers metadata search (Enterprise catalog), can give you an idea of ​​where data is stored, how to save there. This may provide collaboration tools, data lineage analysis and other functions, and can be connected to multiple data sources and the extraction, conversion and load (ETL) tool.

The tools for data discovery provides a Web interface, API support for back-end servers responsible for controlling metadata crawl (Crawling) and integration with other systems.

9.Microsoft Cognitive Services:

The technology from Project Oxford and Bing, offers 22 kinds of cognitive computing API, major categories include: vision, speech, language, knowledge, and search. The technology has been integrated in Cortana Intelligence Suite.

This is an open source technology that provides 22 different cognitive computing REST API, and provides applicable to Windows, IOS, Android and the Python SDK for developers.

Guess you like

Origin www.cnblogs.com/dashjunih/p/11002898.html