Want to read large data, you have to understand these technologies

Speaking of big data, many people can talk for a while, but if you ask what are the core technology of big data, it is estimated that many people can not tell twelve came.

To learn from the machine data visualization, big data development has already had a fairly mature technology tree, different technologies have different levels of technical architecture, but also the emergence of new technical terms each year. Faced with such complex technology infrastructure, a lot of first contact with white big data almost always daunting.

In fact, I want to know what big data is the core technology is very simple, nothing more than three processes: access to data, count data, data. Some people might say or think too vague, big data life cycle from the point of view, nothing less than four simple terms: large data collection, data preprocessing big, big data storage, big data analytics, came together to form large data life-cycle core technology, the following separately:

A large collection of data
a large collection of data, i.e., the various sources of structured and unstructured mass data acquisition performed.

Database acquisition: popular are Sqoop and ETL, traditional relational database MySQL and Oracle are also many companies still act as a way of storing data. Of course, the current for the Kettle and Talend open source itself, but also integrates a large integrated content data, enabling data between hdfs, hbase and mainstream Nosq database synchronization and integration.
Network data collection: one kind or by means of a web crawler site open API, to obtain unstructured or semi-structured data from web pages, and data collection into the local data of its unified structure.
Files collection: including real-time capture and document processing technologies flume, log collection and incremental acquisition ELK-based and so on.
Recommended learning exchange large data skirt 606 eight hundred fifty-nine 705 20:10 pm every day there will be a large data live courses, courses focus on big data development, data analysis, large data programming, large data warehousing, big data cases, artificial intelligence, data mining are pure dry goods share, learn a little technical knowledge every day.
Second, the large data preprocessing
large data preprocessing, before data referring to the analysis, to the collected raw data performed as "cleaning, filling, smoothing, combined and standardized, the consistency check" series operation aimed at improving the quality of data, laying the foundation for later analysis work. Data preprocessing includes four parts:

Data cleaning, data integration, data conversion, data protocol.

Data cleaning: ETL refers to the use other cleaning tools, there is missing data (missing attributes of interest), the noise data (present data with error, or deviation from the expected value data), the data processing inconsistencies.
Data Integration: refers to the data from different data sources, combined into a unified database storage, storage method, focus on solving three problems: pattern matching, data redundancy, conflict detection and processing data values.
Data Conversion: refers to the extracted inconsistent data present in the process of treatment. It also contains the cleaning work through data, i.e., the abnormal data according to business rules to clean, to ensure the accuracy of subsequent analysis result
data Statute: refers to the maximum holding on the basis of the original data, the maximum amount of data to streamline, to afford operation smaller data set, comprising: a data gathering side, dimension reduction, data compression, numerical statute, the concept of stratification.
Third, large data storage
big data storage, memory refers to the process in the form of a database, to store the captured data, comprising three typical routes:

1, the new database cluster based MPP architecture

Using Shared Nothing architecture, combined with efficient MPP architecture distributed computing model, a large number of data processing techniques by column store, coarse-grained indexing, key data for large data storage industry expanded. Low cost, high performance, high scalability and other characteristics, has been widely used in the field of business analysis type applications.

Compared with traditional database, based on PB-level data analysis capabilities MPP products, has significant advantages. Natural, MPP database, has become the best choice for a new generation of enterprise data warehouse.

2, based on the expansion and packaging technology Hadoop

Hadoop based technological expansion and packaging, and the scene data are for a conventional relational database difficult to treat (good at handling non-structural, semi-structured data (like for unstructured data storage and computing), the use of open and Correlation characteristic Hadoop complex ETL process complex data mining and computational models, etc.), derived from a large data related art process.

With the advances in technology, its application scenarios will be gradually expanded, currently the most typical application scenarios: to achieve high data storage on the Internet and by extension package Hadoop, support analysis, which involves dozens of NoSQL technologies.

3, a large machine data

This is an analysis of the large data designed software and hardware combination products. It consists of an integrated set of servers, storage devices, operating systems, database management systems, as well as the data query, processing, analysis and optimization of the pre-installed software, has a good stability and longitudinal extensibility.

Fourth, the large data mining analysis
from visual analysis, data mining algorithms, predictive analysis, semantic engine, the data quality management, chaotic data, the process extraction, extraction and analysis.

1, visual analysis

Visual analysis, referring to the aid of a graphical means to communicate clearly and effectively communicate with the analytical tools of information. Mainly used in massive data correlation analysis, namely by means of visual data analysis platform, to disperse heterogeneous data correlation analysis, and make the process complete analysis chart.

FineBI visualization
2, data mining algorithms

Data mining algorithms, namely by creating a data mining model, and test the data and calculations, data analysis tools. It is the theoretical core of large data analysis.

A variety of data mining algorithms, and because of different algorithms based on different data types and formats, will be showing a different data characteristics. But in general, to create a model of the process is similar to that first analyzes user-supplied data, and then look for patterns and trends of a particular type of mining model and create optimal parameters defined by the results of the analysis and application of these parameters the entire data set to extract a viable model and detailed statistics.

3, predictive analysis

Predictive analysis is one of the most important areas of application of large data analysis, by combining a variety of advanced analysis functions (especially statistical analysis, predictive modeling, data mining, text analysis, physical analysis, optimization, real-time scoring, machine learning, etc.) to achieve the purpose of prediction of uncertain events.

Trend points to help users analyze structured and unstructured data, patterns and relationships, and to use these indicators to predict future events, provide the basis for action.

4, a semantic engine

Semantic engine, refer to the operation by adding semantics to existing data, improve the user experience of Internet search.

5, data quality management

Refers to each stage of the data life cycle (planning, acquiring, storing, sharing, maintenance, application, dying, etc.) all kinds of data quality problems may arise in the identification, measurement, monitoring, early warning and other operations to improve data a series of quality management activities.

--------Dividing line--------

The above is from the big ways, big data framework technology specifically there are many, Here are some of them:

File Storage: Hadoop HDFS, Tachyon, KFS

Off-line calculation: Hadoop MapReduce, Spark

Streaming, real-time calculation: Storm, Spark Streaming, S4, Heron

KV, NOSQL database: HBase, Redis, MongoDB

Resource Management: YARN, Mesos

Log collection: Flume, Scribe, Logstash, Kibana

Message system: Kafka, StormMQ, ZeroMQ, RabbitMQ

Analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid

Distributed Coordination Services: Zookeeper

Cluster management and monitoring: Ambari, Ganglia, Nagios, Cloudera Manager

Data mining, machine learning: Mahout, Spark MLLib

Data synchronization: Sqoop

Task scheduling: Oozie

·····

Guess you like

Origin blog.51cto.com/14342636/2429004