Big data ad hoc query and analysis

The "magnitude" of big data:
traditional IT and business systems are mostly dominated by OLTP1, especially the traditional database orcle, mysql and other data volumes are mostly hundreds of thousands or millions, and the data must be divided into databases and tables. Over 100 million will be used another data processing technology OLAP2 online analytical processing.
     Google's three papers GFS3, Bigtable, and MapReduce can be the cornerstone of big data development.
Every minute and every second, each data source is imported into Hadoop through online, offline and other methods. These data are like carts of gold ore. It was stored in a rectangular warehouse. The process of discovering gold mines, collecting ores, and transporting them back to the warehouse may not be the most exciting. What everyone is most looking forward to is how to turn ore into gold, that is, alchemy. Big data "alchemy", discovering the potential value of data, can be divided into the following three categories from the perspective of different usage scenarios of data points: batch processing (Batch), ad hoc query and analysis (Ad-hoc), stream computing (Stream computing) ).
Batch processing is usually offline computing, which does not require high computing timeliness. It is mainly used to chew on big and hard bones. A task can be given dozens of T or even P data, and it can eat it. It can be said to be "very robust", whether it is the first-generation computing engine MapReduce or the second-generation computing engine Spark, both adopt a "very yellow and violent" method, a tens of billions of data analysis, MapReduce The calculation time may be as long as several hours, and Spark also runs for several minutes to tens of minutes. Even if some relatively lightweight data analysis requests are executed, Spark usually completes it in minutes.
Stream computing is to complete the corresponding computing operations at the same time as the data flows in. It has extremely high timeliness and is very suitable for real-time statistics, warning based on preset rules, and combining various algorithms to make predictions and other data analysis needs. It has been widely used in various forms. However, stream computing is essentially pre-computing analysis, and the data or dimensions to be statistically analyzed must be known in advance. Like other pre-computing engines in the industry, the shortcoming is that the flexibility is greatly limited.
The computing mode of ad hoc query and analysis has both good timeliness and flexibility, and is a powerful complement to the two computing modes of batch processing and stream computing.
The concept of
ad hoc query Ad Hoc is that users can flexibly select query conditions according to their own needs, and the system can generate corresponding statistical reports according to the user's choice. The biggest difference between ad hoc queries and ordinary application queries is that ordinary application queries are custom developed, while ad hoc queries are user-defined query conditions.
There is a concept in the data warehouse field called Ad hoc queries. The usual way is to map dimension tables and fact tables in the data warehouse to the semantic layer. Users can select tables through the semantic layer, establish associations between tables, and finally generate SQL statements. There is no essential difference between an ad hoc query and a normal query in terms of SQL statements. The difference between them is that the usual queries are known when the system is designed and implemented, so we can optimize these queries by building indexes, partitioning and other techniques during system implementation, so that these queries are very efficient. Ad hoc queries are produced temporarily by users when they are in use, and the system cannot optimize these queries in advance. Therefore, ad hoc queries are also an important indicator for evaluating data warehouses. The location of ad hoc queries is usually in relational data warehouses, ie in EDW or ROLAP.
Ad hoc analysis:
It is a query that the system cannot pre-optimize under unknown user query conditions, and the analysis is temporarily generated.










Big data ad hoc query and analysis blog
http://blog.csdn.net/vv8086/article/details/56011624Database
optimization:
http://blog.csdn.net/vv8086/article/details/56011624 ://blog.csdn.net/xlgen157387/article/details/44156679Operation
and maintenance:
http://www.ywnds.com/?cat=5Supplementary
knowledge:
1: OLTP:
   On-Line Transaction Processing (OLTP)
is also known as transaction-oriented processing. Its basic feature is that the user data received in the foreground can be immediately transmitted to the computing center for processing, and the result is given in a very short time. One of the ways to quickly respond to user operations The biggest advantage
of doing so is that it can process the input data in real time and answer in time, also known as a real-time system.
An important indicator to measure the result of leveling transaction processing is system performance, which is embodied in real-time request-response time, that is, the time required for the computer to reply to the request after the user inputs data on the terminal.
   OLTP is completed by the foreground, application, and database. The processing speed and the degree of processing depend on the database engine server. The application engine
  OLTP database only enables the transactional application to write only the required data in order to process a single transaction as soon as possible.
  Today's data processing can Divided into two categories: Leveling Transaction Processing (OLTP) and Online Analytical Processing (OLAP)
2 OLAP
Online Analytical Processing (OLAP): It enables analysts, managers or executives to quickly, A class of software techniques for interactive access to gain a deeper understanding of data. The goal of OLAP is to meet decision support or meet specific query and report requirements in a multi-dimensional environment. Its technical core is
the of "dimension". Type division. "Dimensions" generally contain hierarchical relationships, which can sometimes be quite complex. By defining the most important attributes of an entity as multiple dimensions, users can compare different data. Therefore, OLAP can also be said to be a collection of multidimensional data analysis tools.
  The basic multi-dimensional analysis operations of OLAP include drilling (roll up and drill down), slicing and dicing, rotation, and cross-probing (Drill Across: refers to query operations that query multiple fact tables and combine the results into a result set), Drill through (drill through: refers to the use of database relationships to drill through the bottom layer of the cube and enter the back-end relational table when operating on a cube)
  Drilling is to change the level of dimensions and change the granularity of analysis. It includes drill up (roll up) and drill down (drill down). Roll up is to generalize low-level detail data to high-level summary data in a certain dimension, or reduce the number of dimensions; while drill down is on the contrary, it drills down from summary data to detail data to observe or add new
  dimensions to slice and dice After selecting values ​​on a part of the dimensions, the block is concerned with the distribution of the measurement data in the remaining dimensions. If there are only two remaining dimensions, it is slicing; if there are three, it is dicing.
 Rotation is the direction of transformation, that is, the placement of rearrangement in the table (such as row and column exchange).
OLAP has multiple implementation methods, according to the stored data. Different ways can be divided into ROLAP, MOLAP, HOLAP
ROLAP represents the realization of OLAP based on relational database. Taking the relational database as the core, the relational structure is used to represent and store multidimensional data.
MOLAP represents an OLAP implementation based on multidimensional data organization. It is based on the way of multi-dimensional data organization. The core of multi-dimensional data organization means that MOLAP uses multi-dimensional arrays to store data. Multidimensional data will form a "cude" structure in storage. In MOLAP, the rotation, dicing, and slicing of "cube" are the main techniques for generating multidimensional data reports.
HOLAP represents an OLAP implementation based on mixed data organization. For example, the lower layer is relational and the upper layer is multi-dimensional matrix, which has better flexibility
According to the different ways of organizing comprehensive data, there are two common OLAPs: MOLAP based on multidimensional database and ROLAP based on relational database. MOLAP organizes and stores data in a multi-dimensional way, while ROLAP uses existing relational database technology to simulate multi-dimensional data. In data warehouse applications, OLAP applications are generally the front-end tools of data warehouse applications. At the same time, OLAP tools can also be used in conjunction with data mining tools and statistical analysis tools to enhance decision analysis functions.
 3 GFS
GFS is an extensible distributed file system. For large-scale, distributed applications that access large amounts of data, it runs on cheap common hardware (this is what makes it special and amazing) and provides fault tolerance. It can provide a large number of users with services with high overall performance. 
4 Hive 
hive is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provide simple sql query functions, which can convert sql Statements are converted into MapReduce tasks to run. The advantage is that the learning cost is low, and simple MapReduce statistics can be quickly realized through sql-like statements, without the need to develop special MapReduce applications, which is very suitable for statistical analysis of data warehouses.
5 DMP
DMP (Data Management Platform) data management platform is the most important for Internet companies. One of the back-end systems that integrates scattered multi-party data into a unified technology platform, standardizes and subdivides these data, and then pushes these subdivision results into the existing interactive marketing environment

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325813557&siteId=291194637