The most systematic inventory of big data technology, data is learned half Daniel

Speaking of big data, many people can talk for a while, but if you ask what are the core technology of big data, it is estimated that many people can not tell twelve came.

To learn from the machine data visualization, big data development has already had a fairly mature technology tree, different technologies have different levels of technical architecture, but also the emergence of new technical terms each year. Faced with such complex technology infrastructure, a lot of first contact with white big data almost always daunting.

In fact, I want to know what big data is the core technology is very simple, nothing more than three processes: access to data, count data, data . Some people might say or think too vague, big data life cycle from the point of view, nothing less than four simple terms: large data collection, data preprocessing big, big data storage, big data analytics , came together to form large data life-cycle core technology, the following separately:

A large collection of data

Large data acquisition, i.e., massive structured and unstructured data from various sources, the acquisition performed.

  1. Database acquisition: popular are Sqoop and ETL, traditional relational database MySQL and Oracle are also many companies still act as a way of storing data. Of course, the current for the Kettle and Talend open source itself, but also integrates a large integrated content data, enabling data between hdfs, hbase and mainstream Nosq database synchronization and integration.
  2. Network data collection: one kind or by means of a web crawler site open API, to obtain unstructured or semi-structured data from web pages, and data collection into the local data of its unified structure.
  3. Files collection: including real-time capture and document processing technologies flume, log collection and incremental acquisition ELK-based and so on.

The most systematic inventory of big data technology, data is learned half Daniel

Data acquisition life cycle

Second, the large data preprocessing

Large data preprocessing, referring to terms such as "cleaning, filling, smoothing, combined, standardized, the consistency check" and a series of operations aimed at improving the data prior to data analysis, first the collected raw data performed quality and lay the foundation for later analysis work. Data preprocessing includes four parts: data cleaning, data integration, data conversion, data statute . Want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307

The most systematic inventory of big data technology, data is learned half Daniel

Large Data Preprocessing

  1. Data cleaning: refers to the use ETL and other cleaning tools, there is missing data (missing attributes of interest), the noise data (present data with error, or deviation from the expected value data), the data processing inconsistencies.
  2. Data Integration: refers to the data from different data sources, combined into a unified database storage, storage method, focus on solving three problems: pattern matching, data redundancy, conflict detection and processing data values.
  3. Data Conversion: refers to the extracted inconsistent data present in the process of treatment. It also contains the cleaning work through data, i.e., the abnormal data according to business rules to clean, to ensure the accuracy of the results of subsequent analysis
  4. Data statute: refers to the maximum holding on the basis of the original data, the maximum amount of data to streamline, the operation to obtain smaller sets of data, comprising: a data gathering side, dimension reduction, data compression, numerical statute, the concept of stratification.

Third, the large data storage

Large data storage, memory refers to the process in the form of a database, to store the captured data, comprising three typical routes:

1, the new database cluster based MPP architecture

Using Shared Nothing architecture, combined with efficient MPP architecture distributed computing model, a large number of data processing techniques by column store, coarse-grained indexing, key data for large data storage industry expanded. Low cost, high performance, high scalability and other characteristics, has been widely used in the field of business analysis type applications.

Compared with traditional database, based on PB-level data analysis capabilities MPP products, has significant advantages. Natural, MPP database, has become the best choice for a new generation of enterprise data warehouse.

2, based on the expansion and packaging technology Hadoop

Hadoop based technological expansion and packaging, and the scene data are for a conventional relational database difficult to treat (good at handling non-structural, semi-structured data (like for unstructured data storage and computing), the use of open and Correlation characteristic Hadoop complex ETL process complex data mining and computational models, etc.), derived from a large data related art process.

With the advances in technology, its application scenarios will be gradually expanded, currently the most typical application scenarios: to achieve high data storage on the Internet and by extension package Hadoop, support analysis, which involves dozens of NoSQL technologies.

3, a large machine data

This is an analysis of the large data designed software and hardware combination products. It consists of an integrated set of servers, storage devices, operating systems, database management systems, as well as the data query, processing, analysis and optimization of the pre-installed software, has a good stability and longitudinal extensibility.

The most systematic inventory of big data technology, data is learned half Daniel

Large data storage

Fourth, the analysis of large data mining

From the visual analysis, data mining algorithms, predictive analysis, semantic engine, the data quality management, chaotic data, the process extraction, extraction and analysis.

1, visual analysis

Visual analysis, referring to the aid of a graphical means to communicate clearly and effectively communicate with the analytical tools of information. Mainly used in massive data correlation analysis, namely by means of visual data analysis platform, to disperse heterogeneous data correlation analysis, and make the process complete analysis chart.

It has a simple, clear and intuitive, easy-to-accept features.

The most systematic inventory of big data technology, data is learned half Daniel

FineBI Visualization

2, data mining algorithms

Data mining algorithms, namely by creating a data mining model, and test the data and calculations, data analysis tools. It is the theoretical core of large data analysis.

A variety of data mining algorithms, and because of different algorithms based on different data types and formats, will be showing a different data characteristics. But in general, to create a model of the process is similar to that first analyzes user-supplied data, and then look for patterns and trends of a particular type of mining model and create optimal parameters defined by the results of the analysis and application of these parameters the entire data set to extract a viable model and detailed statistics.

The most systematic inventory of big data technology, data is learned half Daniel

FineBI data mining functions

3, predictive analysis

Predictive analysis is one of the most important areas of application of large data analysis, by combining a variety of advanced analysis functions (especially statistical analysis, predictive modeling, data mining, text analysis, physical analysis, optimization, real-time scoring, machine learning, etc.) to achieve the purpose of prediction of uncertain events.

Trend points to help users analyze structured and unstructured data, patterns and relationships, and to use these indicators to predict future events, provide the basis for action.

The most systematic inventory of big data technology, data is learned half Daniel

FineBI forecast return

4, a semantic engine

Semantic engine, refer to the operation by adding semantics to existing data, improve the user experience of Internet search.

5, data quality management

It refers to each stage of the data life cycle (planning, acquiring, storing, sharing, maintenance, application, dying, etc.) all kinds of data quality problems may arise in the identification, measurement, monitoring, early warning and other operations to improve data a series of quality management activities.


The above is from the big ways, big data framework technology specifically there are many, Here are some of them:

File Storage: Hadoop HDFS, Tachyon, KFS

Off-line calculation: Hadoop MapReduce, the Spark

Streaming, real-time calculation: Storm, the Spark Streaming, S4, Heron

KV, NOSQL database: HBase, Redis, MongoDB

Resource Management: YARN, Mesos

Log collection: Flume, Scribe, Logstash, Kibana

Message system: Kafka, StormMQ, ZeroMQ, RabbitMQ

Analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid

Distributed Coordination Services: Zookeeper

Cluster management and monitoring: Ambari, Ganglia, Nagios, Cloudera Manager

Data mining, machine learning: Mahout, the Spark MLLib

Data synchronization: Sqoop

Task scheduling: Oozie

······

Published 174 original articles · won praise 3 · views 20000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104829154