How the ETL technology floor

ETL Overview

ETL (Extraction-Transformation-Loading) is a data service system after extraction, loading it into the data warehouse after cleaning the conversion, the purpose is to disperse the enterprise, messy, integration standards are not unified data together, corporate decision-making provide analytical basis,  ETL is BI (Business intelligence) project important aspect.

 

 

 

Data governance processes

 

Data mining generally refers to data from a large number of hidden process in which the information through algorithmic search. It is usually associated with computer science, and through statistics, online analytical processing, information retrieval, machine learning, expert systems (rely on rules of thumb past) many methods and pattern recognition to achieve these goals. Its analysis method comprising: classification, estimated, predicted, or association rules associated grouping, clustering and mining complex data types.

 

 

1) Data Acquisition

Start your day with data, data collection there are two ways, the first way is to take a professional point of view is called crawling or crawling, for example, the search engine is doing so, it put all the information on the Web are downloaded to it the data center, then you can search out a search.

 

2 transmission) data

Usually carried out by way of the queue, because the amount of data is too great, the data must be processed will be useful, but the system to handle, however, had lined up, slowly processing.

 

3 ) data stored

Now the data is money, master data is equivalent to master the money. Otherwise how do you know what website you want to buy it? Because it has data of your transaction history, this information may not be to others, it is very valuable, so they need stored.

 

4 ) washing and analysis of data

The above data is stored in the original data, the original data is multi disorganized, there are a lot of garbage data in it, and thus needs to be cleaned and filtered to give some high quality data. For high-quality data can be analyzed in order to classify the data, or find that the relationship between the data to obtain knowledge.

 

Note: the third and fourth step, after the washing and cleaning of the existing reload, really in business scenarios can be appropriately interchanged.

 

5 ) the data retrieval and digging

Retrieval is the search, the so-called Foreign Affairs asked google, in the matter and asked Baidu. Mining, search only come out can not meet the requirements of the people, and also need to dig out from the mutual relationship information.

 

6 ) Load and application data

How friendly display and delivered to the user for data mining work well closed.

 

Data management tools

1 ) Data collection tool

1, the log file for class

tool

definition

Logstash

Logstash is an open source data collection engine, with real-time pipeline function. Logstash dynamically from the unified data from different data sources, and data is normalized to the selected destination.

Filebeat

Filebeat  as a lightweight log transfer tool may be pushed to the center of the log  Logstash.

Fluentd

Fluentd  its original intention is mainly used as  JSON  as a log output, the transmission means and the transmission line downstream of the type of each field do not need to guess which substring. In this way, it provides libraries for almost any language, that is, it can be inserted into the program self-definition.

Logagent

Logagent  is  Sematext  transfer tool is provided, which is used to transmit the log  Logsene (a SaaS-based  platform  Elasticsearch API).

Rsylog

The vast majority of  Linux  release version default daemon, rsyslog  read and written to  / var / log / messages. It can extract files, parsing, a buffer (disk and memory) and transfer them to a plurality of destinations, comprising  elasticsearch  . From here you can find how to deal with  Apache  , and system logs.

Logtail

Producer Ali cloud logging service, Ali Group currently runs on the inside of the machine, more than three years after the test of time, log collection is currently providing services to the public cloud users Ali.

 

A full description on the log collection tool Logstash, Filebeat, Fluentd, Logagent, Rsylog and Logtail the strengths, weaknesses

 

2, for reptiles

 

Download page  ->  page rendering  ->  data storage

(1) page downloader

For downloader, python library requests to meet the needs of most test + crawl, advanced engineering scrapy, dynamic web API interface priority to find if there is a simple encryption to crack, rendering it difficult to use splash.

(2) page parser

①BeautifulSoup (entry-level): Python Reptile entry module BeautifulSoup

②pyquery (similar to jQuery): Python Reptile: pyquery module parses the page

③lxml: Python Reptile: Use lxml parse web content

④parsel:Extract text using CSS or XPath selectors

⑤scrapy of Selector (highly recommended,  more advanced packaging, parsel based)

⑥ Selector (Selectors): python reptiles: scrapy framework xpath and css selector syntax

--------------------- 

to sum up:

The parser used directly scrapy the Selector  on the line, simple, direct and efficient.

(3) Data Storage

①txt text: Python full-stack of the road: File file for common operations

②csv file: python csv file read write

③sqlite3  (Python comes): Python Programming: Using database sqlite3

④MySQL: SQL: pymysql module write data mysql

⑤MongoDB: Python programming: the basic change search operation deletions mongodb

--------------------- 

to sum up:

Data stored there is nothing to get to the bottom, according to business needs on the line, usually quick test use MongoDB, MySQL business use

(4) Other Tools

①execjs  : js execution

Python Reptile: execjs run javascript code in python

②pyv8:  execution js

mac module is mounted pyv8 -JavaScript translated into python

③html5lib

Python Reptile: scrapy use html5lib resolve non-standard html text

 

2 ) Data cleaning tools

1、DataWrangler

Cleaning and rearranging data service is a network-based visual design group at Stanford University. Very simple text editor. For example, when I chose to headline "Alabama" in a row "Reported crime in Alabama" sample data, and then select another set of data of "Alaska", it is recommended to extract the name of each state. Hovering on the suggestion, you can see highlighted in red line.

2、Google Refine

It can import and export a variety of data formats, such as labels or comma-delimited text file, Excel, XML and JSON file. Refine features a built-in algorithms can be found some spelling is not the same, but in fact the text should be divided into one group. After importing your data, select edit cells -> cluster, edit, and then choose to use the algorithm. Data options, providing quick and easy data distribution profile. This feature can reveal abnormalities that may be due to input errors - for example, payroll records and turned out to be not $ 80,000 to $ 800,000; or point out inconsistencies - for example, the difference between the salary data records, some hourly wages, some weekly paid some salary. In addition to the data steward function, Google Refine is also provided some useful analysis tools, such as sorting and filtering.

3、Logstash

Logstash  is a powerful data processing tool, which can achieve data transfer, forms processing, output format, as well as a powerful plug-in function, commonly used in log processing.

3 ) data storage means

Data stored in the storage memory is divided into structured data and unstructured data.

1, structured data

(1) the definition of

Refers generally stored in the database, data having a certain logical and physical structure, is the most common data stored in a relational database; unstructured data: generally refers to data other than the structured data, the data is not stored in the database , it stores various types of text, wherein some of the data on the Web (embedded in HTML or XML tags) also has a logical and physical structure, is referred to as semi-structured data.

(2) Storage System

The relatively mature structured storage system is the Oracle, MySQL, Hadoop and so on.

2, unstructured data

(1) the definition of

Unstructured data is pre-defined data model data structure irregular or incomplete, no, inconvenient to use two-dimensional logical database tables to represent data. All formats including office documents, text, images, XML, HTML, various reports, images and audio / video information and so on.

(2) storage

1) use the file system to store files, and storage access path in the database. The advantage of this approach is simple, does not require the advanced features of DBMS, but this approach can not be achieved transactional access to files, data backup and recovery is not easy, not easy data migration; 

2) use OSS Ali cloud file storage capabilities.

4 ) Data calculation tool

Data calculation is divided into real-time calculation, calculate online, offline computing.

1, real-time calculation data

Apache Storm

2, the data line is calculated

Elasticsearch

MySQL

3, the data calculated offline

HaDoop Hive

 

5 ) data analysis tools

1, the data matrix scientific computing: Python's library numpy

2, conventional processing of the data slicing: Powerful pandas library

3, the data modeling process: sklearn library

6 ) data load tool

1, data visualization: Python and the matplotlib library seaborn

2, visualization tools common BI: Tableu soft and sails

3、ECharts

——————————————

Recommended Reading

[Manufacturers] based rabbitMQ message center technology solutions

Dry thoroughly publicize an article [data mining]

Depth of knowledge map modeling

[Taiwan] data in the data center construction plan

[Streaming] ffmpeg + HLS achieve live and playback

Technology can not override the business, but can guide Business

Said frame design ideas

Wang said architecture

FaaS technical architecture

From the URL of the page to see the process

Guess you like

Origin www.cnblogs.com/Javame/p/11599546.html
ETL