Big data collection

1. The source of big data
1. Human activities
2. Computer
3. The physical world
2. Big data collection equipment
1. Scientific research data
(1) Large Hadron Collider
(2) Radio telescope
(3) Electron microscope
2. Network data
We can use the data center to collect data from the network.
3. Big data collection methods
1. Scientific research data
2. Network data
Crawler (use with caution)
3. System log
(1) Scribe is Facebook's open source log collection system, which has been widely used within Facebook. The Scribe architecture is shown in the following figure:
write picture description here
(2) Chukwa
Chukwa provides a complete solution and framework for the collection, storage, analysis and display of large-volume log data. The Chukwa structure is shown in the following figure:
write picture description here
4. Big data preprocessing technology
1. There are currently four mainstream data preprocessing technologies: data cleaning, data integration, data reduction and data transformation.
2. The main tasks of data processing
(1) The main steps of data processing: data cleaning, data integration, data reduction and data transformation.
(2) Data cleaning routines "clean the data" by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
(3) The data integration process integrates data from multiple data sources.
(4) The purpose of data reduction is to obtain a simplified representation of the dataset. Data reduction includes dimension reduction and numerical reduction.
(5) Data transformation Using methods such as normalization, data discretization, and concept stratification, data mining can be carried out at multiple abstraction layers. Data transformation operations are additional preprocessing procedures that guide the success of the data mining process.
3. Data cleaning
(1) Missing values
​​For the processing of missing values, the general idea is to try to make up for it, or simply discard it. Common treatments are: ignore tuples, fill in missing values ​​manually, fill in missing values ​​with a global variable, fill in missing values ​​with the attribute's centrality measure, use the attribute mean or median of all samples that belong to the same class as a given tuple , Fill missing values ​​with the most probable value
(2) Noise data
Noise is the random error or variance of the measured variable. Techniques to remove noise and make data "smooth": binning, regression, outlier analysis
(3) Data cleaning process The
data cleaning process mainly includes data preprocessing, determining cleaning methods, verifying cleaning methods, executing cleaning tools and data Archive.
The principle of data cleaning is to use existing technical means and methods to clean up the "dirty data" by analyzing the causes and existing forms of "dirty data", and transform the "dirty data" into data that meets the data quality or application requirements, thereby Improve the data quality of the dataset.
There are two main methods of data analysis: data derivation and data mining.
V. Data Integration
1. Entity Recognition
2. Redundancy and Correlation Analysis
Redundancy is another important issue in data integration. Some redundancies can be detected by correlation analysis, for example, for numerical attributes, correlation coefficients and covariances can be used to assess how one attribute changes with another.
3. Detection and processing of data conflicts
6. Data transformation and data discretization (emphasis)
1. Common methods of data transformation
(1) Centralized transformation. Centering transformation is a method of coordinate axis translation processing.
(2) Range normalization transformation. The normalization transformation is to find the maximum and minimum values ​​from each variable of the data matrix, and the difference between the two is called the range.
(3) Standardized transformation. Normalized transformation is a data processing method that is similar to normalized transformation on the value and dimension of variables.
(4) Logarithmic transformation. Logarithmic transformation is to take the logarithm of each original data, and use the logarithmic value of the original data as the new value after the transformation. Uses of logarithmic transformation: normalize data subject to lognormal distribution; standardize variance; straighten the curve, often used for curve fitting.
2. Data discretization Purpose of
data discretization:
(1) Algorithm needs. For example, decision trees and Naive Bayes cannot directly use continuous variables by themselves
(2) Discretization can effectively overcome the hidden defects in the data and make the model results more stable.
(3) It is helpful to diagnose and describe the nonlinear relationship.
Principles of data discretization:
(1) Equal spacing Equal
spacing can maintain the original distribution of the data, and the more paragraphs, the better the preservation of the original appearance of the data.
(2) Equal frequency The
equal frequency processing transforms the data into a uniform distribution, but the observation value in each segment is the same, which cannot be achieved by equidistant segmentation.
(3) To optimize discreteness,
it is necessary to link the independent variable and the target variable to investigate. Breakpoints are breakpoints that cause significant changes in the target variable. Commonly used test indicators are information gain, Gini index or WOE (requires the target variable to be a binary variable).
Data discretization methods:
clustering
decision tree
correlation analysis (ChiMerge)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325729870&siteId=291194637