Big Data_Data Center_Data Convergence Unicom

Table of contents

1. Methods and tools for data collection and aggregation

1. Online behavior collection

2. Offline behavior collection

3. Internet data collection

4. Internal data aggregation

2. Data exchange products

1. Data source management

2. Offline data exchange

3. Real-time data exchange

3. Selection of data storage

1. Online and offline

2. OLTP and OLAP

3. Storage technology


        The first step in building an enterprise-level data center is to realize the interconnection of data in various business systems and physically break the data islands. It is mainly realized through the ability of data aggregation and exchange. In the face of different scenarios, different solutions are selected according to data types and data storage requirements.

1. Methods and tools for data collection and aggregation

1. Online behavior collection

①Client burying point

Full burying point : Record all user operation behaviors on the terminal device. Generally, the purpose of all collection behaviors can be achieved by doing some initial configuration with the embedded SDK. Also called traceless burying point, no burying point, etc. Advantages: Full data can be obtained without frequent upgrades Disadvantages: High storage and transmission costs

Visual buried points : record part of the user's operations on the terminal device, and generally record and save selectively through server-side configuration. Advantages: no frequent releases, lower cost than fully buried points, more flexible; Disadvantages: the desired data may not be collected, and reconfiguration is required, etc.

Code embedding point : To customize each mobile phone content according to the needs, the corresponding terminal module needs to be upgraded. Advantages: strong flexibility, independent design, and more optimizations for storage and bandwidth; Disadvantages: high cost, difficult maintenance, and long upgrade cycle.

Server-side burying point

The common form of server-side burial is access_log in the HTTP server, which is the log data of all web services. Advantages: Reduce the complexity of the client and improve information security; Disadvantages: Cannot collect information that the client does not interact with the server.

2. Offline behavior collection

Offline data is generally collected through hardware, such as Wifi probes, cameras, sensors, etc.

3. Internet data collection

This data collection method generally uses a web crawler, a program or script that automatically grabs Internet information according to established rules, and is often used for automated testing and behavior simulation of websites. Common web crawler frameworks: Apache Nutch 2, WebMagic, Scrapy, PhpCrawl, etc. Internet data collection must comply with corresponding security specifications, protocols, etc.

4. Internal data aggregation

Classification of data organization forms

Structured data : regular, complete, data that can be represented by two-dimensional tables, data in common databases and excel.

Semi-structured data : data that is regular and complete, but cannot be represented by two-dimensional tables, such as complex structures such as JSON and XML

Unstructured data : The data is irregular and incomplete, and cannot be represented by a two-dimensional table. Complex logic is required to extract it, such as pictures, images, audio, etc.

   ② Data timeliness and application scenario analysis

Offline : It is mainly used for periodic migration of large batches of user data, and does not require high timeliness. Generally, distributed batch data synchronization is adopted, and data is read through connections. The process of reading data can have full and incremental methods , written to the target storage after unified processing.

Real-time : Mainly for low-latency data application scenarios, generally implemented through incremental logs or notification messages, the industry has canal, flink and other methods to achieve.

③ETL and ELT

ETL ( Extract-Transform-Load, extract - transform - storage) , processing during extraction, advantages: save storage, simplify subsequent processing Disadvantages: incomplete or lost data, low processing efficiency

ELT ( Extract-Load-Transform, extract - storage - transform) , after the extraction is completed, it is processed. Advantages: the data is complete, and the effect of distributed post-processing such as big data is higher. Disadvantages: the storage occupies a large amount, too much useless data may be cause inefficiency

④Common data aggregation tools

Canal: A data push tool that monitors log changes by disguising itself as a slave such as Mysql. It is often used as a data collection tool for mysql data changes, but it is not suitable for multi-consumption and data distribution scenarios.

Sqoop: A general big data solution, a tool for data migration between structured data and HDFS, based on Hadoop's MapReduce implementation. Advantages: Specific scenarios, high data exchange efficiency. Disadvantages: high degree of customization, not easy to operate, and rely on MapReduce, the functional scalability is restricted and limited.

DataX: Ali's set of plug-in offline data exchange tools, which is based on the direct connection of in-process reading and writing.

2. Data exchange products

The tools introduced above generally can only satisfy some single scenarios or processes. In order to meet complex enterprise data exchange scenarios, we need a complete data exchange product, including data source management, offline data processing, real-time data processing, etc.

1. Data source management

The management of the data source is mainly to manage the storage used by the data, which can be used to manage the external storage conveniently when the platform performs data exchange.

Classification of data sources:

Relational database: such as Oracle, Mysql, SQL Server, Creenplum, etc.

NoSQL storage: such as HBase, Redis, Elasticsearch, Cassandra, MongoDB, Neo4j, etc.

Network and MQ: such as Kafka, HTTP, etc.

File system: such as HDFS, FTP, OSS, CSV, TXT, EXCEL, etc.

Big data related: such as HIVE, Impala, Kudu, MaxCompute, etc.

2. Offline data exchange

During offline data exchange, it solves the problem of batch migration of large-scale data for scenarios with low data timeliness requirements and high throughput.

Highlights of offline data synchronization technology:

① Pre-audit

②Data conversion

③Cross-cluster data synchronization

④Full synchronization

⑤ Incremental synchronization

3. Real-time data exchange

Real-time data exchange is mainly responsible for connecting data such as databases and log crawlers to storage such as Kafka, Hive, and Oracle in real time. Its two core services are: data subscription service (Client Server) and data consumption service (Consumer Server).

Example of a real-time switching architecture diagram :

3. Selection of data storage

For data storage, we generally need to consider the scale of data, the method of data production, and the method of data application, through comprehensive consideration.

1. Online and offline

Online storage means that the storage device and the stored data remain "online" at all times and can be read by users at will, meeting the speed requirements of the computing platform for data access. Online storage is generally disk, disk array, cloud storage, etc.

Offline storage is to back up data stored online to prevent possible data disasters. Data stored offline will not be called frequently. Common typical products are hard drives, magnetic tapes, and optical discs.

2. OLTP and OLAP _

OLTP and OLAP are not competing or mutually exclusive, but cooperate with each other and achieve win-win cooperation.

OLTP

OLAP

user

Operator-oriented, supporting daily operations

For decision makers, support management needs

Function

Daily Operations

analysis-oriented

DB design

Application-oriented, transaction-driven

Subject-oriented, analysis-driven

data

current, up-to-date, detailed, two-dimensional, discrete

historical, aggregated, multidimensional, integrated, unified

access

Updatable, read/write dozens of records

Not updatable, but refreshed periodically, reading millions of records

employer

simple affairs

complex query

DB size

100MB to GB level

100GB to TB level

3. Storage technology

1. Distributed system

Distributed systems commonly include distributed file systems (the storage system requires the cooperation of multiple technologies, in which the file system provides support for the lowest-level storage capabilities) and distributed key-value systems (users store semi-structured data with simple relationships)

2. NoSQL database _

The advantage of NoSQL is that it can support ultra-large-scale data storage. The flexible data model supports web2.0 applications very well, and it has strong horizontal expansion capabilities. Typical examples include: key-value databases, column family databases, document databases, and graph databases. Such as: HBASE, MongoDB, etc.

3. Cloud database

Cloud database is a shared infrastructure method based on cloud computing technology, which is a database deployed and virtualized in a cloud computing environment.

 

Guess you like

Origin blog.csdn.net/wanghaiping1993/article/details/128192411