Section I: Data
First, the concept
Numerical data is, i.e. by our observation, experiment or calculated result. Digital, picture, video .......
Second, the classification
1, in accordance with the sub-structure
Structured data: data mysql table, excel tables, strictly two-dimensional table data. Each line has the same column, the same type of column corresponding to each row.
Unstructured data: no data structures, video, images, audio ....... Binary storage.
Semi-structured data: structured, there is no strict two-dimensional table structure. html, css, xml, json, divided label structure.
2, the generated time points
Offline data: data that already exists, static.
Real-time data: data generated in real time, dynamic.
Near real-time data
Section II: Big Data
First, the concept
The concept of large and complex data sets that traditional tools can not handle (store | computing)
Second, the data unit
1 Byte =8 bit
1 KB = 1,024 Bytes = 8192 bit
1 MB = 1,024 KB = 1,048,576 Bytes (common user data level) text
1 GB = 1,024 MB = 1,048,576 KB
1 TB = 1,024 GB = 1,048,576 MB
1 PB = 1,024 TB = 1,048,576 GB (Enterprise-level data)
1 EB = 1,024 PB = 1,048,576 TB
1 ZB = 1,024 EB = 1,048,576 PB (total global data level)
1 YB ZB = 1.024 EB = 1048576
1 BB = 1,024 YB = 1,048,576 ZB
1 NB = 1,024 BB = 1,048,576 YB
1 DB = 1,024 NB = 1,048,576 BB
Note: The amount of data businesses generally TB or PB
Third, the characteristics of Big Data
Capacity: amount of data, the current data generated fast, complex data types (video)
Variety: structure, semi-structured, unstructured
Speed: anywhere in generating data, the Internet is no secret
High Value: High overall value, low single data value, low value density
Truth
Value four big data
Users portrait: according to all the commercial behavior of the user to the user to play tag.
Commercial use user network activity, consumption habits, the search focus, develop relationships diagram label characters. For precision marketing, thousands of thousand faces.
Fifth, the core concept of Big Data
1, cluster
A task (storage | computing) requires multiple servers (nodes) together to complete, a group called the cluster that multiple servers. Each server in the cluster is called a node (different networks, different same LAN ip address).
2, distributed
( 1 ) concept
A task requires more than one node have completed, implementation of this task is distributed.
( 2 ) distributed memory
Distributed file systems: a large file is cut into on one node, the final large file on each of the small number of small files to be stored in a file is stored a plurality of nodes.
Distributed Database: a large table is cut into a plurality of small table is stored on a plurality of nodes.
( 3 ) Distributed Computing
Computing a large task, a node can not be calculated, this calculation task split, calculates each running on different nodes.
3, load balancing
Accounting the amount of data stored in each node corresponding to the cluster.
Process Sixth, data processing
1 Introduction
Data acquisition ---- ---- data storage data cleaning (the ETL) --- --- data calculation result data stored visual display Web -----
2, the data acquisition (data source)
( 1) traffic data
Own business database, your site generate log
( 2) Reptile
For example: Analysis of the average salary of the Internet industry practitioners, to climb recruitment website data. There are anti-reptile, reptile counter-countermeasure technology
( 3) Data sale
3, the data processing
( Data 1) is missing fields
1) does not affect the final results: delete, such as web browsing log data.
2) For some data and associated money: supplemented by a variety of computing needs, precision requirements.
3) requires precise data: for example, large industrial data, sensor data, further according to empirical values.
( 2) data sensitive
Such as phone number, ID number to desensitization treatment, encrypt sensitive fields ( MD5, uuid).