From zero to build a large enterprise data analysis and machine learning platform - describes the technology stack (c)

Data acquisition and transmission

Sqoop data transfer tools
the actual project development, often a lot of business data is stored in a relational database, such as MySQL database. We need to focus on these data into the data warehouse management, ease of use for statistical computing model, this type of mining operation.

Sqoop is ⼀ a top-level Apache Software Foundation open source data transfer tool for data transfer between Hadoop and relational databases (such as MySQL, Oracle, PostgreSQL, etc.). It can be turned START data in relational data into Hadoop distributed file system (the HDFS), it is also possible to export data distributed file system (the HDFS) in the relational database.

Flume log collection tool
in practical projects, some of the source data is stored in a database gz compressed format on disk directory destination time, not stored. If you need to store this type of data source files to a distributed file system (HDFS), the Flume can make use of this top-level Apache log collection tools.

Flume ⼀ is a distributed, highly available, highly reliable system capable of massive different data sources collected, transmitted, stored in a data storage system, such as distributed file system files (the HDFS), publish-subscribe messaging system ( Kafka).

Kafka distributed message queue
the Apache Kafka ( http://kafka.apache.org) is LinkedIn company designed and developed ADVANCED high throughput distributed publish-subscribe messaging system, its internal design is distributed, it has good scalability sex. Kafka creators of some of the messages in the middleware before Use with time, found that if strictly follow the norms JMS, although the message delivery success rate is very high, but it will increase the number of additional consumption, such as heavy desired JMS message header, and the maintenance of a variety of overhead and other index structure. Eventually leading to system performance is difficult to have a further breakthrough, not suitable for the application of massive data. Therefore, they are not fully in accordance with the JMS specification to design Kafka, but some of the original definition of ⼀ made to simplify that dramatically improves the processing performance, while the success rate of the transfer are guaranteed. Overall, Kafka has the following features.

High-performance storage: The disk data structures specially designed to ensure that the time complexity is O (1) persistent message, so the number of messages stored in a TB can be maintained good stability. In addition, messages can be saved many times by the consumer, ETL for business intelligence and other applications in real time to apply it.

Distributed born: Kafka is designed as a distributed system that uses ZooKeeper to manage multiple agents (Broker), and a copy of the support load balancing mechanism, easy to expand laterally. ZooKeeper aims to build reliable, distributed data structures, where Kafka used to manage and coordinate the agency. When the system added a proxy, or a proxy fault fails, ZooKeeper service will inform producers and consumers, allowing them to work accordingly with other agents coordination.

High Throughput: Due to significantly improve storage performance, scalability and good lateral, so even very ordinary hardware Kafka can support hundreds of thousands of messages per second flow, while providing amazing throughput of publish and subscribe.

Stateless proxy: other messages different systems, Kafka agent is stateless. The agent does not record messages to be consumed state, but the needs of consumers are maintained.

Theme (Topic) and partition (Partition): Support to partition messaging server by Kafka and consumption machine clusters. A topic can be considered a type of message, and each subject can be divided into a plurality of partitions. Partitions may be dispersed to a plurality of data servers, avoid reaching single bottleneck. More partitions means you can accommodate more consumers, effectively enhance the ability of concurrent consumption. Based on the copy program, it is possible to back up and scheduling a plurality of partitions.

Consumer groups (Consumer Group)

Data storage
Hbase distributed database Nosql
in large-scale data sets, consider the data storage of high availability, high throughput, semi-structured data, efficient query performance and other factors, general database full EMPTY difficult enough demand. There is a demand there will be a natural solution for the birth of HBase nicely fills this gap.

⼀ HBase is a distributed, column-oriented open source non-relational databases (NoSQL), and Google's BigTable similar ability. HBase general and relational databases, it is suitable for storing unstructured data. Tip: BigTable is Google's distributed data storage system design, Use for ⼀ handle massive amounts of data a non-relational databases (NoSQL) comes in handy. HBase has high availability, performance, column oriented storage, scalability and other features. HBase use of these features, you can build a large-scale storage cluster ⼀ on inexpensive servers.

Scenario:

Large amount of data, and need to meet the needs random access, fast response.
We need to meet the needs of dynamic expansion.
Required to satisfy properties (such as transaction, connecting cross table) in a relational database.
When writing data, you need to have a high throughput capacity.
Hdfs distributed file system
HDFS (Hadoop Distributed File System), realized as Google File System (GFS) is the core Hadoop subprojects project is the basis for distributed computing in the data storage management, flow-based data access and processing mode demand and the development of large files, can run on inexpensive commodity servers. It has a high fault tolerance, high reliability, scalability, high availability, high throughput, and other characteristics is provided for storing massive data afraid fault, to bring the large data set (Large Data Set) application processing a lot of convenience.

Large data processing
the Hadoop
the Hadoop is an Apache Foundation developed a distributed system architecture, it allows users to develop distributed applications without knowing the underlying details of the distributed full use of the power of the high-speed computing and cluster storage. By definition can be found, it solves two problems: large data storage, Hadoop is a large two core: HDFS and MapReduce.

Today's Hadoop system architecture already allows users to easily distributed storage platform, the development and operation of large-scale data processing applications, its main advantages are as follows.

Transparency: The user may Hadoop distributed without knowing the underlying details of the development of distributed applications, the full use of the power of the high-speed computing and cluster storage.

High scalability: extended into scale-up and expansion, to increase the longitudinal extension of the stand-alone resources, always reaches a bottleneck; and increasing the number of lateral machine cluster to obtain an approximately linear increase in performance, not easy to achieve the bottleneck. Hadoop cluster node resources, the transverse mode is employed, can easily be expanded, and achieve significant performance improvements.

Efficiency: As a result of the plurality of parallel processing resources, such Hadoop no longer limited to stand-alone operations (in particular slow disk I / O read and write), the task can be completed quickly scale. Plus it has the scalability, with the increase in hardware resources into ⼀ performance will be further improved.

High fault tolerance and high reliability: many data in Hadoop has backup, if the data loss or damage, can be automatically restored from other replicas (Replication). Similarly, the failure of computing tasks can be assigned to a new resource nodes, automatic retry.

Low cost: It is precisely because Hadoop has good scalability and fault tolerance, so there is no need to purchase expensive ADVANCED its high-end servers. Inexpensive hardware, even ⾄ to a personal computer can be a resource node. HDFS (Hadoop Distributed File System) is a scalable, fault-tolerant, high-performance distributed file systems, asynchronous replication, write once read, is responsible for storage.

The Spark
the Spark is an open source by the University of California, Berkeley AMP Lab massive distributed data processing universal engine, with high throughput, low latency, common scalable, high fault tolerance and so on. Spark internal development library provides a rich, integrated data analysis engine Spark SQL, Figure computing framework GraphX, machine learning library MLlib, flow calculation engine Spark Streaming. Spark in functional programming your language and speech Scala implemented, provides a rich developer API, support for Scala, Java, Python, R and other development languages. Meanwhile, Spark offers a variety of operating modes, both by way of independently deployable running, you can also rely on Hadoop YARN, Apache Mesos and other resource managers to run scheduled tasks. Currently, Spark has been widely used in various fields of finance, transportation, medical, weather and the like.

Data analysis tools
the Apache Hive
the Apache Hive is based on Hadoop data warehouse, which provides a set of tools that can be used to query and analyze data. Hive provides execute SQL interface stored in the operation data for Use Hadoop Distributed File System (the HDFS) in.

Hive structured data file can be mapped to a database table, and provides a convenient SQL query capabilities, developers can be transformed by the SQL statement will achieve business functions for the MapReduce task to run. Hive of lower learning costs, can quickly achieve statistical MapReduce tasks like SQL statements, so developers do not have to develop specialized applications MapReudce, very suitable for statistical work data warehouse. Hive defines a SQL-like query, called HQL or Hive SQL. It allows the user to implement the query, statistics, tables data migration and other functions by writing SQL statements. It also allows developers to write MapReduce familiar Mapper and Reducer custom to implement complex needs. Hive data warehouse is built on top of Hadoop distributed file system (HDFS), and the bottom Hive designed to perform tasks submitted by the user MapReduce computing framework. MapReduce computation reasons underlying framework design, and therefore have a higher delay operation data warehouse (Hive), and require significant resource overhead when submitting the job (the Job) and scheduling (Scheduler), and thus more suitable for processing offline Hive data, such as online analytical processing (OLAP).

Pig, Impala and Spark SQL
except Hive, there are other options that can help the user data stored in Hadoop easier to use, here briefly elaborate Pig, Impala and Spark SQL. We can easily find Hive from the introduction of, to use this tool, you need to have a more in-depth understanding of SQL-like language. However ⽽ and some developers, although SQL do not understand, but good at MapReduce programming. So, for these people if there are tools to increase their productivity it? Pig ( http://pig.apache.org) that came into being in this context, it is Apache's open source project. Very often it requires multiple MapReduce data processing process can be achieved, data processing and data conversion may also be difficult. The Pig providing large data sets a higher level of abstraction, as well as richer data structure. From the point of view abstract level, it provides a scripting your language and the words Pig Latin, the language compiler converts the data analysis request will go through a series of MapReduce operation optimization process can be considered a simplified version of a process-oriented SQL. Among them, a statement is an operation similar to the table in the database. Meanwhile, Pig also has a large number of data types, not only the support package, advanced concepts such as maps and tuples, also supports simple data types. Pig comparison operators are relatively intact, including a rich pattern matching using regular expressions. Therefore, with the Pig, users do not necessarily need to know SQL syntax and meaning can also control MapReduce jobs, while simplifying the transition between the development of MapReduce and different data.

Another execution on existing Hadoop infrastructure interactive SQL query engine is Impala, Cloudera company which is leading the development of the inquiry system. Similarly Apache Hive, Impala also through class PB magnitude data stored in the SQL query language in HDFS and HBase. However, Impala considers the real-time demands more in the design and Hive different. Hive method uses a SQL query is converted into MapReduce tasks, it still is a batch process, and therefore difficult to meet interactive query. In contrast, the fast pace of the Impala became a major feature of it. To achieve this, Impala reference to Google's interactive data analysis system Dremel. Impala use Parquet achieve a column store, and draws a parallel MPP database of ideas. At the same time, it uses HiveQL and JDBC interfaces, globally unified metadata storing and reading. For the user query is directly distributed processing in the local HBase HDFS or write, it has good scalability and fault tolerance. Further, since abandoned MapReduce running frame, it does not start MapReduce jobs, shuffling, sorting and other expenses, without the intermediate results written to disk, saving a lot of I / O overhead, but also reduces the amount of data transmitted over the network. Of course, not intended to replace existing Impala MapReduce framework, but rather as a strong complement MapReduce. In general, Impala better suited for smaller output data query request, and for a large amount of data batch jobs, MapReduce is still the better choice. Reason to believe that in the near future, with the advantage of processing speed, Impala may be occupying a place in the big field of data processing

Machine learning
Mahout
main goal is to build Apache Mahout scalable machine learning algorithm. This scalability is for large-scale data sets in. Apache Mahout algorithms running under Apache Hadoop platform, which implements MapReduce by mode. However, Apache Mahout is not strictly necessary to implement the algorithm based on Hadoop platform, a single node or non-Hadoop platform is also available. Non-core distributed algorithm Apache Mahout libraries also have good performance.

Apache Mahout is the Apache Software Foundation (ASF) 's an open source project that provides some classic machine learning algorithms, designed to help developers more quickly and easily create intelligent applications. The project has been developed into its third year, with three public release. Apache Mahout project includes clustering, classification, recommendation engine, frequent child mining.

Mllib Spark
MLlib is Spark machine learning (ML) library. Engineering practices designed to simplify the work of machine learning, and facilitate the expansion to a larger scale. MLlib by a number of common learning algorithms and tools, including classification, regression, clustering, collaborative filtering, dimension reduction, etc., and also include the optimization of the underlying high-level primitives and pipeline API.

Other tools
Big Data platform CDH (a one-stop package)
Cloudera version (Cloudera's Distribution Including Apache Hadoop, referred to as "CDH"), a Web-based user interface that supports most Hadoop components, including HDFS, MapReduce, Hive, Pig, Hbase, Zookeeper, Sqoop, simplifying installation big data platform, the use of difficulty.

Hue hadoop visualization
Hue is an open source Apache Hadoop UI system, first used by Cloudera Desktop evolved contribution to the open source community of Cloudera, which is based on the Python Web framework Django implementation. We can Hue by using the browser on the Web console to interact with Hadoop cluster to analyze process data, for example, features data on the operation of HDFS, MapReduce Job run Hue supported by a collection and so on:

The default Lightweight sqlite database management session data, user authentication and authorization, can be customized for MySQL, Postgresql, and Oracle
access HDFS-based file browser (File Browser)
based Hive editor to develop and run Hive query
support based on Solr be application search, and provides a visual view of data, and the instrument panel (dashboard)
to support interactive query applications Impala-based
support Spark editor and instrument panel (dashboard)
supports Pig editor, and to submit a script task
support Oozie editor can be submitted through the dashboard and monitoring Workflow, Coordinator and Bundle
support for HBase browser, to visualize data, query data, modify HBase table
support Metastore browser, you can access the Hive metadata, and HCatalog
support the Job browser to access MapReduce Job (MR1 / MR2-YARN)
support Job Designer, to create MapReduce / Streaming / Java Job
support Sqoop 2 editor and dashboards (dashboard)
support ZooKeeper browser and editor
supports MySql, PostGresql, Sqlite and Oracle database query editor device
- --------------
Disclaimer: This article is CSDN bloggers "have the ideal iter 'original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source and this link statement.
Original link:https://blog.csdn.net/huangmingleiluo/article/details/100523815

From zero to build a large enterprise data analysis and machine learning platform - describes the technology stack (c)

Guess you like