Introduction of big data

Section I: Data

First, the concept

Numerical data is, i.e. by our observation, experiment or calculated result. Digital, picture,      video .......

Second, the classification

1, in accordance with the sub-structure

Structured data: data mysql table, excel tables, strictly two-dimensional table data. Each line has the same column, the same type of column corresponding to each row.

Unstructured data: no data structures, video, images, audio ....... Binary storage.

Semi-structured data: structured, there is no strict two-dimensional table structure. html, css, xml, json, divided label structure.

2, the generated time points

Offline data: data that already exists, static.

Real-time data: data generated in real time, dynamic.

Near real-time data

Section II: Big Data

First, the concept

The concept of large and complex data sets that traditional tools can not handle (store | computing)

Second, the data unit

1 Byte =8 bit

1 KB = 1,024 Bytes = 8192 bit

1 MB = 1,024 KB = 1,048,576 Bytes (common user data level) text

1 GB = 1,024 MB = 1,048,576 KB

1 TB = 1,024 GB = 1,048,576 MB

1 PB = 1,024 TB = 1,048,576 GB (Enterprise-level data)

1 EB = 1,024 PB = 1,048,576 TB

1 ZB = 1,024 EB = 1,048,576 PB (total global data level)

1 YB ZB = 1.024 EB = 1048576

1 BB = 1,024 YB = 1,048,576 ZB

1 NB = 1,024 BB = 1,048,576 YB

1 DB = 1,024 NB = 1,048,576 BB

Note: The amount of data businesses generally TB or PB

Third, the characteristics of Big Data

Capacity: amount of data, the current data generated fast, complex data types (video)

Variety: structure, semi-structured, unstructured

Speed: anywhere in generating data, the Internet is no secret

High Value: High overall value, low single data value, low value density

Truth

Value four big data

Users portrait: according to all the commercial behavior of the user to the user to play tag.

Commercial use user network activity, consumption habits, the search focus, develop relationships diagram label characters. For precision marketing, thousands of thousand faces.

Fifth, the core concept of Big Data

1, cluster

A task (storage | computing) requires multiple servers (nodes) together to complete, a group called the cluster that multiple servers. Each server in the cluster is called a node (different networks, different same LAN ip address).

2, distributed

( 1 ) concept

A task requires more than one node have completed, implementation of this task is distributed.

( 2 ) distributed memory

Distributed file systems: a large file is cut into on one node, the final large file on each of the small number of small files to be stored in a file is stored a plurality of nodes.

Distributed Database: a large table is cut into a plurality of small table is stored on a plurality of nodes.

( 3 ) Distributed Computing

Computing a large task, a node can not be calculated, this calculation task split, calculates each running on different nodes.

3, load balancing

Accounting the amount of data stored in each node corresponding to the cluster.

Process Sixth, data processing

1 Introduction

Data acquisition ---- ---- data storage data cleaning (the ETL) --- --- data calculation result data stored visual display Web -----

2, the data acquisition (data source)

( 1) traffic data

Own business database, your site generate log

( 2) Reptile

For example: Analysis of the average salary of the Internet industry practitioners, to climb recruitment website data. There are anti-reptile, reptile counter-countermeasure technology

( 3) Data sale

3, the data processing

( Data 1) is missing fields

1) does not affect the final results: delete, such as web browsing log data.

2) For some data and associated money: supplemented by a variety of computing needs, precision requirements.

3) requires precise data: for example, large industrial data, sensor data, further according to empirical values.

( 2) data sensitive

Such as phone number, ID number to desensitization treatment, encrypt sensitive fields ( MD5, uuid).

Guess you like

Origin www.cnblogs.com/zhangxiaofan/p/11110943.html