ETL Overview
ETL (Extraction-Transformation-Loading) is a data service system after extraction, loading it into the data warehouse after cleaning the conversion, the purpose is to disperse the enterprise, messy, integration standards are not unified data together, corporate decision-making provide analytical basis, ETL is BI (Business intelligence) project important aspect.
Data governance processes
Data mining generally refers to data from a large number of hidden process in which the information through algorithmic search. It is usually associated with computer science, and through statistics, online analytical processing, information retrieval, machine learning, expert systems (rely on rules of thumb past) many methods and pattern recognition to achieve these goals. Its analysis method comprising: classification, estimated, predicted, or association rules associated grouping, clustering and mining complex data types.
1) Data Acquisition
Start your day with data, data collection there are two ways, the first way is to take a professional point of view is called crawling or crawling, for example, the search engine is doing so, it put all the information on the Web are downloaded to it the data center, then you can search out a search.
2 transmission) data
Usually carried out by way of the queue, because the amount of data is too great, the data must be processed will be useful, but the system to handle, however, had lined up, slowly processing.
3 ) data stored
Now the data is money, master data is equivalent to master the money. Otherwise how do you know what website you want to buy it? Because it has data of your transaction history, this information may not be to others, it is very valuable, so they need stored.
4 ) washing and analysis of data
The above data is stored in the original data, the original data is multi disorganized, there are a lot of garbage data in it, and thus needs to be cleaned and filtered to give some high quality data. For high-quality data can be analyzed in order to classify the data, or find that the relationship between the data to obtain knowledge.
Note: the third and fourth step, after the washing and cleaning of the existing reload, really in business scenarios can be appropriately interchanged.
5 ) the data retrieval and digging
Retrieval is the search, the so-called Foreign Affairs asked google, in the matter and asked Baidu. Mining, search only come out can not meet the requirements of the people, and also need to dig out from the mutual relationship information.
6 ) Load and application data
How friendly display and delivered to the user for data mining work well closed.
Data management tools
1 ) Data collection tool
1, the log file for class
tool |
definition |
Logstash |
Logstash is an open source data collection engine, with real-time pipeline function. Logstash dynamically from the unified data from different data sources, and data is normalized to the selected destination. |
Filebeat |
Filebeat as a lightweight log transfer tool may be pushed to the center of the log Logstash. |
Fluentd |
Fluentd its original intention is mainly used as JSON as a log output, the transmission means and the transmission line downstream of the type of each field do not need to guess which substring. In this way, it provides libraries for almost any language, that is, it can be inserted into the program self-definition. |
Logagent |
Logagent is Sematext transfer tool is provided, which is used to transmit the log Logsene (a SaaS-based platform Elasticsearch API). |
Rsylog |
The vast majority of Linux release version default daemon, rsyslog read and written to / var / log / messages. It can extract files, parsing, a buffer (disk and memory) and transfer them to a plurality of destinations, comprising elasticsearch . From here you can find how to deal with Apache , and system logs. |
Logtail |
Producer Ali cloud logging service, Ali Group currently runs on the inside of the machine, more than three years after the test of time, log collection is currently providing services to the public cloud users Ali. |
A full description on the log collection tool Logstash, Filebeat, Fluentd, Logagent, Rsylog and Logtail the strengths, weaknesses
2, for reptiles
Download page -> page rendering -> data storage
(1) page downloader
For downloader, python library requests to meet the needs of most test + crawl, advanced engineering scrapy, dynamic web API interface priority to find if there is a simple encryption to crack, rendering it difficult to use splash.
(2) page parser
①BeautifulSoup (entry-level): Python Reptile entry module BeautifulSoup
②pyquery (similar to jQuery): Python Reptile: pyquery module parses the page
③lxml: Python Reptile: Use lxml parse web content
④parsel:Extract text using CSS or XPath selectors
⑤scrapy of Selector (highly recommended, more advanced packaging, parsel based)
⑥ Selector (Selectors): python reptiles: scrapy framework xpath and css selector syntax
---------------------
to sum up:
The parser used directly scrapy the Selector on the line, simple, direct and efficient.
(3) Data Storage
①txt text: Python full-stack of the road: File file for common operations
②csv file: python csv file read write
③sqlite3 (Python comes): Python Programming: Using database sqlite3
④MySQL: SQL: pymysql module write data mysql
⑤MongoDB: Python programming: the basic change search operation deletions mongodb
---------------------
to sum up:
Data stored there is nothing to get to the bottom, according to business needs on the line, usually quick test use MongoDB, MySQL business use
(4) Other Tools
①execjs : js execution
Python Reptile: execjs run javascript code in python
②pyv8: execution js
mac module is mounted pyv8 -JavaScript translated into python
③html5lib
Python Reptile: scrapy use html5lib resolve non-standard html text
2 ) Data cleaning tools
1、DataWrangler
Cleaning and rearranging data service is a network-based visual design group at Stanford University. Very simple text editor. For example, when I chose to headline "Alabama" in a row "Reported crime in Alabama" sample data, and then select another set of data of "Alaska", it is recommended to extract the name of each state. Hovering on the suggestion, you can see highlighted in red line.
2、Google Refine
It can import and export a variety of data formats, such as labels or comma-delimited text file, Excel, XML and JSON file. Refine features a built-in algorithms can be found some spelling is not the same, but in fact the text should be divided into one group. After importing your data, select edit cells -> cluster, edit, and then choose to use the algorithm. Data options, providing quick and easy data distribution profile. This feature can reveal abnormalities that may be due to input errors - for example, payroll records and turned out to be not $ 80,000 to $ 800,000; or point out inconsistencies - for example, the difference between the salary data records, some hourly wages, some weekly paid some salary. In addition to the data steward function, Google Refine is also provided some useful analysis tools, such as sorting and filtering.
3、Logstash
Logstash is a powerful data processing tool, which can achieve data transfer, forms processing, output format, as well as a powerful plug-in function, commonly used in log processing.
3 ) data storage means
Data stored in the storage memory is divided into structured data and unstructured data.
1, structured data
(1) the definition of
Refers generally stored in the database, data having a certain logical and physical structure, is the most common data stored in a relational database; unstructured data: generally refers to data other than the structured data, the data is not stored in the database , it stores various types of text, wherein some of the data on the Web (embedded in HTML or XML tags) also has a logical and physical structure, is referred to as semi-structured data.
(2) Storage System
The relatively mature structured storage system is the Oracle, MySQL, Hadoop and so on.
2, unstructured data
(1) the definition of
Unstructured data is pre-defined data model data structure irregular or incomplete, no, inconvenient to use two-dimensional logical database tables to represent data. All formats including office documents, text, images, XML, HTML, various reports, images and audio / video information and so on.
(2) storage
1) use the file system to store files, and storage access path in the database. The advantage of this approach is simple, does not require the advanced features of DBMS, but this approach can not be achieved transactional access to files, data backup and recovery is not easy, not easy data migration;
2) use OSS Ali cloud file storage capabilities.
4 ) Data calculation tool
Data calculation is divided into real-time calculation, calculate online, offline computing.
1, real-time calculation data
Apache Storm
2, the data line is calculated
Elasticsearch
MySQL
3, the data calculated offline
HaDoop Hive
5 ) data analysis tools
1, the data matrix scientific computing: Python's library numpy
2, conventional processing of the data slicing: Powerful pandas library
3, the data modeling process: sklearn library
6 ) data load tool
1, data visualization: Python and the matplotlib library seaborn
2, visualization tools common BI: Tableu soft and sails
3、ECharts
——————————————
Recommended Reading
[Manufacturers] based rabbitMQ message center technology solutions
Dry thoroughly publicize an article [data mining]
Depth of knowledge map modeling
[Taiwan] data in the data center construction plan
[Streaming] ffmpeg + HLS achieve live and playback
Technology can not override the business, but can guide Business