Big Data Core technology

The original address: http: //bigdata.idcquan.com/dsjjs/159544.shtml

Big Data large and complex system of technology, the basic technology includes data collection, data preprocessing, distributed storage, the NoSQL databases, data warehouses, machine learning, parallel computing, and other visualization techniques and different technical scope. Firstly, a common framework of the large data processing, is divided into the following areas: data acquisition and pre-processing, data storage, data cleaning, data analysis and data visualization.

A7jErye

A data acquisition and pre-processing

For data from various sources, including mobile Internet data, and other social networks, these massive structured and unstructured data is scattered, so-called islands of data, the data at this time does not make sense, data the acquisition is to write data to the data warehouse, put together scattered data, these data are analyzed together. Data acquisition includes acquisition of the log file, the log collection database, and accessing the access relational database application and the like. In the small amount of data, you can write a timed script to write the log storage system, but with the amount of data grows, these methods can not provide data security, and operation and maintenance difficult and requires a more robust solution.

Flume NG in real-time as log collection system to support various types of systems custom log data sender, for collecting the data, while the data processing is simple, and writes various data reception side (such as text, HDFS, Hbase etc.) . Flume NG uses a three-tier architecture: Agent layer, Collector Store layer and layer, each layer can be horizontal expansion. Agent which contains Source, Channel and Sink, source used for consumption (collect) data source component to the channel, channel as intermediate temporary storage, preservation component information of all source, sink to read data from the channel, the reading will be deleted after a successful the information channel.

NDC, Netease Data Canal, the canal system data literally translated as NetEase, NetEase live migration for the data structure of a database, synchronization and subscription platform solutions. It integrates a variety of tools and YORK past experience in the field of data transmission, the single database, a distributed database, the OLAP system, and downstream applications strung together by a data link. In addition to guarantee efficient data transmission outside, NDC's design follows the unit and platform-based design philosophy.

Logstash is open source server-side data processing pipeline, capable of simultaneous data acquisition from multiple sources, data conversion, and then sends the data to your favorite "repository" in. Commonly used repository is Elasticsearch. Logstash supports a variety of input options can be used from a number of sources of data capture event, capable of continuous streaming mode, data easily from your log, index, Web applications, data storage, and various AWS services acquisition at the same time .

Sqoop, relational database and the data in Hadoop for each tool will be transferred, may import data from a relational database (e.g. Mysql, Oracle) to the Hadoop (e.g. HDFS, Hive, Hbase) may also be data (e.g. HDFS, Hive, Hbase) in Hadoop introduced into a relational database (e.g. Mysql, Oracle) in. Sqoop enabled a MapReduce job (extremely fault-tolerant distributed parallel computing) to perform tasks. Another advantage is that it Sqoop transmission of large amounts of structured or semi-structured data process is fully automated.

Stream computing is a hot industry study, flow calculation for multiple high-throughput data sources in real-time cleaning, aggregation and analysis, can quickly deal of presence in social networking sites, news and other information data stream and feedback , there are many large data flow analysis tools, such as open source strom, spark streaming and so on.

Strom main cluster structure is a master node (nimbus) and a plurality of workers (supervisor) composed by configuring a static structure from the designated master node or daemon in the background dynamic election, nimbus Storm and supervisor are provided by the runtime, communication between the coupling state change notification and monitoring Zookeeper notification process. Nimbus primary responsibility is to manage the process, running on the coordination and monitoring of cluster topology (including the release of topology, task assignment, reassignment task event processing). supervisor process to generate and monitor worker (jvm process) to perform the task Wait nimbus assigned tasks. supervisor and worker run on a different jvm, initiated by the supervisor if a worker quit because the error exception (or kill off), supervisor will try to rebuild a new worker process.

When data is calculated upstream module, statistics, analysis, messaging systems can be used, especially in a distributed messaging system. Kafka use Scala to write, it is a distributed messaging system based publish / subscribe. Kafka is one of the design concept while offering off-line processing and real-time processing, and real-time backup data to another data center , Kafka can have many producers and consumers share more topics, the message topic in units of induction; programs program Kafka dissemination of information called producer, also known as the producer, the consumer book topics and messages is called a consumer, also known as consumers; when Kafka run in a cluster, the service may consist of one or more services, each a service is called a broker, during operation producer sends a message to the network Kafka cluster, the cluster to provide information to consumers. Kafka Zookeeper by management cluster configuration, the election leader, as well as rebalance changes in the Consumer Group. Producer use push mode to publish a message to the broker, Consumer subscribe using the pull mode and consume messages from the broker. Flume Kafka and work together, if desired transfer streaming data from Kafka to Hadoop, Flume Agent Agent may be used, as a source of the Source Kafka, so that data can be read from Kafka to Hadoop.

Zookeeper is a distributed, open-source coordination service for distributed applications, provide data synchronization services. Its main role configuration management, service name, and cluster distributed lock management. Configuration management refers to the modification of the configuration in one place, so interested in this place can get all the configuration changes, eliminating the tedious manual copy configuration also ensures a good reliable and consistent data, At the same time it can be obtained by name or address resource information services, etc., you can monitor changes in cluster machine to achieve a similar mechanism of heart function.

Second, the data storage

As an open source Hadoop framework designed for off-line and large-scale data analysis and design, HDFS storage engine as its core, has been widely used for data storage.

HBase, is a distributed, open column-oriented database, may be considered hdfs package, essentially a data storage, the NoSQL database. HBase is a Key / Value system deployed on hdfs, overcomes the disadvantages of random access hdfs in this aspect, and as hadoop, Hbase rely mainly on the target scale, by increasing inexpensive commercial servers, to increase the computing and storage ability.

Phoenix, the equivalent of a Java middleware, development engineers can help as access relational databases using JDBC access as NoSQL database HBase.

Yarn is a Hadoop Explorer, provides a uniform for applications the resource management and scheduling, its introduction to cluster in terms of utilization of resources unified management and data sharing has brought great benefits. Yarn consists of the following major components: a global resource manager ResourceManager, ResourceManager each node agent NodeManager, representing each application and each ApplicationMaster Application Container with multiple runs on NodeManager.

Mesos is an open source cluster management software, support for Hadoop, ElasticSearch, Spark, Storm and Kafka and other application architectures.

Redis is a very fast non-relational database may store the mapping between the keys and values ​​of five different types, may be the key in the memory of the hard disk to store persistent data, a copy to extend the performance characteristics, You can also use the client to extend the fragmentation write performance.

Atlas is located in a middleware between the applications and MySQL. DB opinion at the rear end, Atlas corresponding to connect its clients, the application appears in the front end, Atlas corresponds to a DB. Atlas as a server communication with the application that implements the MySQL client and server protocols, both as a client and MySQL communication. It shields the details of the application DB, and in order to reduce the burden of MySQL, it maintains a connection pool. After starting the Atlas creates multiple threads, one of the main thread, and the rest for the worker thread. The main thread is responsible for monitoring all client connection request, the worker thread listen only the main thread of command request.

Kudu around the storage engine Hadoop ecosystem established, and the Kudu has Hadoop ecosystem common design philosophy, which runs on an ordinary server, distributed-scale deployment, and meet high availability requirements of industry. The design concept is fast analytics on fast data. As an open source storage engine can provide low-latency random access and efficient data analysis simultaneously. Kudu not only provides row-level insert, update, delete API, but also provides batch scanning operations close Parquet performance. Use the same storage, either random read and write, but also meet the requirements of data analysis. Kudu very broad application scenarios, such as for real-time data analysis, time series data for applications such as data there may be a change.

In the data storage process, the data tables are related to hundreds column containing complex Query, recommended storage method using a column, such parquent, ORC and other data compression. Parquet can support flexible compression options, significantly reduce the storage on the disk.

Third, data cleansing

Hadoop MapReduce as a query engine for parallel computation of large data sets, "Map (Mapping)" and "Reduce (reduction)", is the main idea. It greatly facilitates the programmers in the case will not be distributed and parallel programming will run its own program in a distributed system.

With the increase in business volume of data, the need for training and cleaning of data will become more complex, this time on the need task scheduling system, such as oozie or azkaban, mission-critical scheduling and monitoring.

Oozie Hadoop platform for a workflow scheduling engine, a RESTful API interface to submit a request to accept a user's (job submission workflow), when submitted to a workflow, the workflow executed by the workflow engine and the state of charge of the conversion . Users deployed on HDFS good job (MR job), and then submitted to the Workflow Oozie, Oozie job asynchronously (MR jobs) submitted to Hadoop. This is why when the call Oozie RESTful interface to submit jobs to return a JobId reason immediately, the user program without waiting for the completion of job execution (because some big jobs may be executed for a long time (hours or days)). Oozie asynchronously in the background, and then submitted to the workflow corresponding Action hadoop performed.

Azkaban control engine also a workflow can be used to resolve the dependencies between a plurality of spark hadoop or other computing tasks off problem. azkaban mainly consists of three parts: Relational Database, Azkaban Web Server and Azkaban Executor Server. azkaban Most of the state information is stored in MySQL, Azkaban Web Server provides Web UI, is the main azkaban managers, including project management, certification, scheduling, and monitoring of the implementation of the workflow process such as; Azkaban Executor Server used to schedule tasks and workflows, logging workflow or task.

Sloth stream processing platform computing tasks, Netease is the first self-developed stream computing platform, designed to address each product in the company's growing stream computing needs. As a computing service platform, which is characterized by ease of use, real-time, reliable, users can save (development, operation and maintenance) of investment in technology to help users focus on solving the computing needs of the product stream itself.

Fourth, the data query analysis

Hive core work is to translate into SQL statements MR program, structured data can be mapped to a database table, and provides HQL (Hive SQL) queries. Hive itself does not calculate and store the data, and it is totally dependent on HDFS MapReduce. Hive can be understood as a client tool, the operation is converted to the corresponding SQL MapReduce jobs, then run hadoop above. Hive supports standard SQL syntax, eliminating the need for users to write MapReduce process program, it appears that lets SQL skills, proficient, but are not familiar with MapReduce, is weak and not good programming skills in Java users can large data sets in HDFS on easily use SQL query language, aggregate, analyze the data.

Hive is a large-volume data processing and raw, Hive appears to solve the bottleneck of traditional relational databases (MySql, Oracle) on the large data processing. Hive shuffle-> reduce-> map-> shuffle-> reduce ... the model into the implementation plan map->. If a Query is compiled into several rounds of MapReduce, then there will be more to write intermediate results. Because the characteristics of the frame itself MapReduce performed, excessive intermediate process increases the overall execution time of the Query. During operation Hive, the user only needs to create tables, import data, you can write SQL statement analysis. The rest of the process is automatically completed by the Hive frame.

Impala is a complement to the Hive, you can achieve efficient SQL queries. Impala is implemented using SQL on Hadoop, big data used for analysis in real time. To manipulate large data through a familiar SQL-style traditional relational database, but the data also can be stored in HDFS and HBase. Impala no longer use a slow Hive + MapReduce batch, but similar to the parallel distributed relational database query engine (consisting of three parts Query Planner, Query Coordinator and Query Exec Engine) by using commercially available, or directly from the HDFS HBase using SELECT, JOIN query data and statistical functions, thus greatly reducing latency. Impala entire query execution plan into a tree, rather than a series of MapReduce tasks, compared Hive gone MapReduce start time.

Hive suitable for long batch query and analysis, and the Impala is suitable for real-time interactive SQL queries, data Impala personnel to provide a quick experiment to test the idea of ​​big data analytics tool, you can use the Hive data conversion process, and then use the Impala fast data analysis on a data set of Hive deal. Overall: Impala the implementation plan showed a complete implementation plan tree can be more natural to distribute to various Impalad implementation plan to execute the query, but do not like the Hive as it combined into a pipeline type map-> reduce mode, Impala order to ensure a better concurrency and avoid unnecessary intermediate sort and shuffle. But the Impala does not support the UDF, the problem can handle certain restrictions.

Spark has Hadoop MapReduce with the characteristics, it Job intermediate output stored in memory, so that no reading HDFS. Spark-enabled memory distributed data sets, in addition to providing interactive query, it also can optimize iterative workloads. Spark is implemented in the Scala language, it will Scala as its application framework. And different Hadoop, Spark and Scala can be tightly integrated, which Scala can operate as a local collection of objects as easily as operating a distributed data sets.

Nutch is an open source Java implementation of the search engine. It provides all the tools we needed to run its own search engine, including full-text search and the Web crawler.

Solr is written in Java, full-text search server is running in a Servlet container (such as Apache Tomcat or Jetty) is an independent enterprise search applications. It provides operational similar Web-service API interface, users can submit certain format by http requests to the search engine server XML file, generate the index; you can also put forward lookup request, and get back the results in XML format via Http Get operation.

Elasticsearch is an open source full-text search engine Lucene-based search server, you can quickly store, search, and analyze vast amounts of data. Designed for cloud computing , it is possible to achieve real-time search, stable, reliable, fast and easy to install.

Also involves a number of machine learning language, for example, Mahout main objective is to create some scalable machine learning algorithms for developers in the Apache license free use; deep learning framework Caffe and the use of data flow diagrams open-source software library of numerical calculation TensorFlow the like, such as commonly used in machine learning algorithms, Bayesian logistic regression, decision trees, neural networks, and other collaborative filtering.

Fifth, data visualization

Some docking BI platform, will analyze the resulting data visualization to guide decision making. Mainstream BI platform, for example, foreign Agile BI Tableau, Qlikview, PowrerBI and other domestic SmallBI and several emerging Netease and so on.

At each stage of the above, to protect the security of data can not be ignored.

Web-based authentication protocol Kerberos, is used in a non-secure network, personal identity authentication to secure the communication means that allows an entity to communicate in a non-secure network environment to another entity in a secure way to prove his identity.

ranger control permission is a privilege Hadoop cluster framework, provide operation, monitoring, complex data management authority, which provides a centralized management mechanism, management is based on Hadoop ecosystem yarn of all data access. Component can for ecological as Hadoop Hive, Hbase fine-grained access control data. By operating Ranger console, administrators can easily configure policies to control user access HDFS folder, HDFS files, databases, tables, fields permissions. These policies can be set for different users and groups, and permissions can be docked with hadoop seamless.

Simply put there are three core technologies: get the data, count data, sales data.

First, as big data, we can not get large amounts of data are white torn. Now due to the rise of machine learning algorithms and the rise of Tiger Balm, leading to decline in the status algorithm, improved data status. For popular example, like due to the development of education, resulting in reduced importance of personal intelligence, educational background becomes important, because most people according to the standard procedure to read a book, you can know more than Newton. Google said: Take data fed Niubi a general algorithm, in many cases better than the silly to get data fed fast hardware algorithms. And know how difficult it is to get hold fast hardware algorithm? Most people have difficulty even this bad ...... do not know well is very important to take the data, can not make bricks without straw ah! So why many companies want to burn grab entrance, grab the user, data sources in order to compete it! but operations, products and more concerned about this, I am a programmer, I do not care ......

The second is the count data, if the data is valuable to get directly, then it does not need company, and direct government earn extra money just fine. Fall of an apple can see, people Newtonian gravitation can the whole, I can only eat picked up, the gap ah ...... so the data was there, what can dig up on their luck elsewhere. Count data you need computing platform, and how data is stored (HDFS, S3, HBase, Cassandra), how to count (Hadoop, Spark) we rely on the program ape ......

Again it is selling to cash out, or is engaged in the public interest, such as "Person of Interest" John Doe inside them and a sledgehammer ...... see people not seen, predict the future and avoiding disadvantages is the ultimate goal and there is intelligent meaning, right? we rely on this together pondering.

Actually, I think the last one is the "core technology", what Spark, Storm, Deep-Learning, are second tier ...... of course, do not have a strong operator support force, intelligence should not start now.

Guess you like

Origin www.cnblogs.com/aabbcc/p/11531594.html