Big data pre-processing architecture and method Introduction

Data preprocessing includes data cleaning (Data Cleaning), data integration (Data Integration), data conversion (Data Transformation) and a data reduction (Data Reduction). This section explains a method for pre-processing the data on the basis of a large data preprocessing introduces the basic concepts.

Big data preprocessing overall architecture

Large data pretreatment data into structured data and semi-structured / unstructured data, respectively, using traditional tools and the ETL distributed parallel processing framework to implement. The overall architecture shown in Figure 1.

The overall architecture of large data preprocessing
FIG 1 The overall architecture of data preprocessing

Structured data may be stored in a traditional relational database. Relational database-processing services, timely response, to ensure consistency of data has a natural advantage.

Unstructured data may be stored in a new distributed storage, as Hadoop HDFSo the semi-structured data may be stored in a new distributed NoSQL databases, as HBase.

Distributed storage has a significant advantage in the scale of the system, cost of storage, file read speed.

Data between the structured and unstructured data can migrate on demand data processing. For example, for fast parallel processing, data needs to be introduced into the structure of a conventional relational database to store the distributed.

Sqoop can use tools such as a relational database table structure first introduced distributed database, and then to the import table structure of the distributed database data.

Classification of data quality issues

Cleaning aggregate multiple data dimensions, a plurality of sources, after a variety of data structures, data extraction, transformation and loading integration.

In the above process, in addition to correcting some errors in the data beyond repair system, more of a data merge sort and save it to a new storage medium. Among them, the quality of the data is critical.

2, the common data quality problems can be divided into four categories according to the number of data sources and their respective levels (defining layer and the layer instance).

1) a single data source definition layer

Contrary to the field constraints (e.g., date appears on September 31), attribute-dependent field conflict (e.g., two records describe one of the properties of the same person, but the value is inconsistent), violating the uniqueness (the same primary key ID appears more )Wait.

2) Examples of single-source layer

Single attribute value contains too much information, spelling errors, there is a blank value, the presence of noise data, duplication data and outdated;

3) Multi-layer data source definition

With a different address entities (e.g. custom_id, custom_num), different definitions of the same attribute (e.g., a length field definitions are inconsistent, inconsistent field types, etc.);

4) Multi-layer data source instance

Dimension Data, the particle size is inconsistent (for example, according to some records storage GB, TB according to some records storage; some on the annual statistics, some statistics by month), data duplication, spelling errors.

Classification of data quality issues
2 classification data quality issues

In addition, the data generated in processing the "second data", there will be a noise, a repetitive or erroneous.

Data adjustment and cleaning, will also involve the format, units of measure and normalized data with the normalized correlation things, so as to produce a relatively large impact on the experimental results. Usually this type of problem can be attributed to uncertainty.

There are two aspects of uncertainty content, including the presence data point itself uncertainty and uncertainty data point attribute value. The former is available probabilistic description, which has multiple described embodiment, the probability density function is described as an attribute value, a statistical value or the like to the representative variance.

Large data preprocessing method

Noise data is data exists in data errors or anomalies (deviation from the expected value), incomplete data refers to properties of interest has no value, and inconsistent data refers to data inconsistencies arise connotation (eg, as a keyword in the same sector there are different coding values).

Cleaning means removing noise data present in the data to correct errors and inconsistencies in the. Data integration means to combine data from multiple data sources together to constitute a complete data set.

Data conversion means to convert the data in one format to another data format. Data reduction consists in eliminating redundant data by removing redundant features or clusters.

Incomplete, noisy and inconsistent in terms of big data is a very common situation. Produce incomplete data for a variety of reasons.

  • Some attributes are sometimes the content is not, for example, customer information involved in the sales transaction data is incomplete.
  • Some of the data generated when the transaction is considered to be necessary without being recorded.
  • Misunderstanding or detecting equipment failure causes the relevant data is not recorded.
  • It is inconsistent with other recorded content deleted.
  • History or modification of the data is ignored. Loss of data, loss of data, especially some of the key attributes may need to be derived.

Causes the noise data is as follows.

  • Data acquisition device has a problem.
  • In the human or computer data entry error has occurred.
  • An error occurred during data transmission.
  • Because different naming convention or data inconsistency caused by the code.

Data cleansing process generally includes padding data missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. The data in question would mislead the search process of data mining. For details refer to the "data cleansing" tutorial.

Although most of the data mining process are included to deal with incomplete or noisy data, but they are not entirely reliable, and the focus is often on the process of how to avoid the mined-out mode data too accurate description. Therefore, certain data cleansing data processing is necessary.

Is to integrate data from multiple data sources are merged together. Sometimes the same properties for the description of a concept can take different names in different databases, so often cause inconsistent or redundant data during data integration.

For example, in a database, a customer identification code is "custom_number", and was "custom_id" in another database. Naming inconsistencies often can lead to different aspects of the same property value.

For example, in a database to take a person's family name "John", and then taken up in another database "J". A large amount of redundant data mining will not only reduce the speed, but also mislead the mining process. Thus, in addition to cleaning the data, in data integration is also noted that the elimination of redundant data.

For details refer to "Data Integration" tutorial.

Data conversion is performed by the data normalization operation. Prior to the formal data mining, especially when using the subject distance based on mining algorithms, such as neural networks, nearest neighbor classification, data must be normalized, i.e. reduced to be within a specific range, such as [0, l ].

For example, for a customer information database age attribute or attributes wages, as the values ​​of properties wages than the age of the property value to be much larger, if not normalizes, based on the distance calculated value of the property will clearly pay far more than based on the distance calculated the age of the property, which means that the role of wage attributes are incorrectly magnified in distance calculation of data objects in a whole.

For details refer to "data conversion" tutorial.

The purpose of data reduction is to reduce the size of the data mining, but it will not affect (or does not substantially affect) the final results of the excavation. Data reduction following conventional methods.

1) data aggregation (DataAggregation) , such as configuration data cube.

2) reduction dimensions (DimensionReduction) , such as by eliminating redundant correlation properties.

3) Data Compression (DataCompression) , using as coding methods (such as minimum length encoding or wavelets).

4) reduction of data blocks (NmnerosityReduction) , such as a parametric model using clusters or replace the existing data. In addition, the use of generalization based on the concept of the tree (Generalization) can also achieve reduction of the size of the data.

These data preprocessing methods are not mutually independent, but interrelated. For example, to eliminate data redundancy data may be regarded as a form of washing, it may be considered a data reduction.

Zero-based Big Data Quick Start Tutorial

Java Basic Course

9. Large data acquisition through web crawler
10. The Scrapy crawler profile
11. The large data pre-processing architecture and method

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/91896739