On the big data base system as well as scientific issues

You can execute SQL queries in single-user and multi-user model. br /> NoSQL reference
NoSQL database capable of handling semi-structured and unstructured data efficiently, which accounts for large data sets greater proportion of unstructured data is very suitable. Yahoo developed its cloud service benchmark --YCSB, NoSQL database for evaluation. YCSB workload generated by the client and a standard load configuration package, the package load space cover part of the performance, a load such as a large read operation, the write operation of a large number of scan load and the load. The three load may be directed to Cassandra, HBase, PNUTs and simple 4 kinds MySQL shared data storage system operation. other studies YCSB expanded framework, integrates a number of advanced features, such as pre-splitting, bulk load and server-side filtering.
Ghazal and other retail production model first proposed end of a large data base --BigBench, consists of two main components: a data generator and workload query specification. Data generator may generate a structured, semi-structured and unstructured data of these three types of the raw data; query specification is characterized in accordance with exemplary production report McKinsey retailers, query type defines the type, language processing and data analysis algorithms .BigBench covers "3Vs" characteristics of large data systems.
Tony yo will cloud data (http://www.bbeyo.com), as artificial intelligence AI technology-driven domestic categorized based on the data accumulated large data areas, data analysis and data labels large trading platform to support massive data distributed collection, calculation and processing, machine learning and thus to promote the development of trade data, allows data to maximize the value. Internet open data, enterprise data access, cleaning, filtration, desensitization treatment after the transaction, in the form of data and algorithms rules precipitation data in the trading platform to meet the business data analysis, data operations and demand precision marketing and other aspects. Internet open data, enterprise data access, cleaning, filtration, desensitization treatment after the transaction, in the form of data and algorithms rules mat, enterprise and digital transformation of government. Tel: 0351-6106588,0351-6106599, the company [email protected] mailbox,
Address: Taiyuan Xiaodian East Central southern section 259 pro-International a 24-storey Block A, No. 2422, Chang-mountain Federer Technology Co., Ltd.
Second, the large data scientific issues
many of the challenges faced by large data systems need to follow study and solve. Throughout the large data lifecycle, from big data processing platform and model to all aspects of application scenarios and so on, there are some direction worthy of study.
Big data base platform
Although Hadoop has become the main framework for large data analysis, but more than 40 years of development and RDBMS systems as compared to big data platform is far from mature. First of all, Hadoop need to integrate real-time data acquisition and transmission mechanisms to provide fast processing mechanism of non-batch mode. Secondly, Hadoop provides a simplified user programming interface that hides the details of the complex background of the implementation of this simplification to some extent reduce the processing performance. It should design more advanced interface is similar to DBMS systems, optimizing Hadoop performance from multiple angles. Again, large-scale Hadoop cluster consists of hundreds of thousands or even hundreds of thousands of servers, consumes a lot of energy. Hadoop depends on its ability to deploy a wide range of energy efficiency. In addition, the platform also includes basic research distributed mass data storage management, real-time index query, real-time big data platform power consumption, as well as massive data collection, transmission and processing problems. Hu put forward a platform based SDN big data, social TV data for analysis. Tony yo will cloud data (
http://www.bbeyo.com )
big data applications
Large data has just started to study the typical big data applications can bring profits to the business, improve government efficiency, and promote the development of human science the main scenarios are: map data parallel computing models and frameworks, social network analysis, ranking and recommendations, web mining and information retrieval, media analysis retrieval and natural language processing.
Processing mode
Difficult to adapt existing batch mode processing massive data in real-time requirements, the need to design new real-time processing mode. In conventional batch mode, the data is first stored, and then scan the entire data set obtained by processing the analysis results, time is very earth wasted on data transmission, storage and re-scan. the new real-time processing mode can reduce this waste. For example, on-site (in-situ) analysis of the data transfer can be avoided due to the overhead of centralized storage infrastructure brings, thus improving real-time performance. large data system is a system problem,
in the processing mode need to consider many factors. Problem-solving algorithms is not just a task, and all aspects of transport and storage are also related. Only from computational complexity for analysis is not enough, because in theory, low computational complexity of the algorithm, not actually running on the machine It must quickly. Moreover, due to the low large data value of the density characteristics, can be taken dimension reduction, or based on the sampling data analysis to reduce the amount of data processed. specifically, for handling mode involves a large data visualization computational analysis, large data-processing complexities parallel depth of machine learning and data mining algorithms, heterogeneous data integration based on low-density mass data sampling issues and the value of the amount of data dimensionality reduction of high dimensional sea issues.
Big Data privacy
an important issue in the field of data privacy is also a big user of the information may be being exposed, such as the company's marketing strategy, personal spending habits, etc., especially in e-commerce, e-government and healthcare, privacy protection seems Especially important, the need to enhance access control. in addition, between the need to enhance access control and data processing convenience to achieve a balance.
"Unlimited" data
With cloud computing, technology development was linked network, mobile terminals, wearable devices, etc., we have entered the era of big data. However, the amount of data generated also will be growing. The current big data, in the near future will also be a small data. Therefore, for the future of big data the most accurate description might be "unlimited" data. Accordingly, the data will be incremental and learning is an important issue. For example, the current sample with 1 billion a trained classifier with good results, but in the future the number of samples to 15 million times (one billion before the sample can not fully express the characteristics of the data), will be faced with a problem that is the use of 1.5 billion samples re-train a classifier, or take advantage of the newly added 500 million to correct the original sample with 1 billion sample of the trained classifier is it? If you re-train the classifier, which will result in too much time and space overhead, and poor scalability. in the past, in order to avoid duplication of learning historical samples and reduce the subsequent training time, we incremental learning adoption, namely the use of previous results history learning and newly added sample to correct the classifier. but in the face "unlimited" big data evolving, the need to study the new incremental learning method to dynamically adaptive prediction and ensure the accuracy of the model, probably will be an important issue in the future development of large data need to be addressed.
Today introduced a large data base systems and scientific issues part, will continue later, we meet not as big data. It has extended personal space for development through increased understanding and awareness of Big Data

Guess you like

Origin blog.51cto.com/14465882/2424135