Big Data concepts, challenges, algorithms, treatment and research progress

Big Data Big Data refers to the size exceeds the data sets commonly used software tools in the run-time can withstand the collection, management and processing data capabilities; big data is the storage mode and capability, computing mode and the capacity can not meet the storage and processing now there is a relative concept scale data sets generated.

First, the basic concept of Big Data

Pretreatment of large data

Discriminating main complete received data, extracting, washing and other operations.

(1) Extraction: by acquiring data may have various structures and types, data extraction process may help us to these complex data into a single configuration or to facilitate the process, in order to achieve rapid analysis process.

(2) Cleaning: For large data, not all valuable, some data is not what we are concerned, while others are completely false data interference term, and therefore the data "de-noising" by filtering to extract the valid data.

Mathematical problems caused by large data

Mathematically point of view, there is a computer becomes larger and larger data sets, absolutely great data does not exist, all the data sets in the computer is a finite set.

Large data sampling - the large data becomes smaller, the algorithm to find the minimal set of samples and to adapt, the effects of sampling error algorithm

Big data representation - representation decide to store, shows the effect of algorithm efficiency

Large data inconsistencies - cause the algorithm to fail and no solution, how inconsistent digestion

Ultra-high dimensional problem of big data - resulting in ultra-high-dimensional data is sparse, increasing the complexity of the algorithm

Uncertain dimensional problem of big data - the coexistence of multi-dimensional data, according to a given dimension difficult task

Ill-posed problem of big data - much higher dimensional solution that caused the problem in a dilemma

Characteristics of Big Data

Dense and sparse coexist: Local and Global dense sparse

And redundancy and lack: a lot of redundancy and partial deletions

Both explicit and implicit: a large number of explicit and implicit rich

Flickering static and dynamic: static associated with the dynamic evolution

Diverse and heterogeneous coexistence: multi-varied and heterogeneous heterosexual

Available with large contradiction: large low and scarce available

Extension of the current Big Data

Large data size is an evolving indicators:

The current single set of data processing tasks, data from dozens of TB to PB-level scale of ten (TB «PB« EB «ZB)

Reasonable period of time to handle large data-dependent tasks can wait objectives:

Seismic data to predict requirements only effective within a few minutes, meteorological data should be in the hour level, the aircraft lost contact with the data to be processed within seven days of data mining is generally required within 12 hours

Second, the large data paradox

Big data has been defined as a fourth paradigm of scientific inquiry. Following thousands of years of experimental science, hundreds of years ago scientific theory and computational science decades ago, today's data explosion gave birth to the data intensive science, theoretical, experimental and computational simulation paradigm unified. Big Data has been described as "non-competitive" factors of production. Big Data has the "inexhaustible" feature, sustained release its potential value in the constant reuse, reorganization and expansion in a wide range of public, shared constantly creating new wealth. Rooted in the value of big data is to predict the unknown, the future trend of non-specific factors that break the long-term, widespread social problem. And now big data technologies and applications are still limited to historical and real-time data correlation analysis, limited to meet the short-term, specific market needs. The process of resolving the paradox, it is the theory and method came into being in the process. And people trying to efforts to resolve the paradox, just big data is the driving force to put down roots.

Methodology absence

Since 2008, "Nature" magazine launched the "big data" special issue since, on the concept of big data from large academic discussion, turned to the digital transformation of enterprises, and then up to the "Open Government Data" strategic layout. However, a large scale on the mere number, and can not easily big data and the previous "mass data", "ultra-scale data" and other distinguished, because the three orders of magnitude and so caught not set threshold.

Methodology absence is the biggest obstacle. The core power of big data development stems from people measuring, recording and analysis of world hunger, the desire to meet the needs of these data, technology and thinking three elements. In computing, communications technology has become more sophisticated today, in an inexpensive, convenient digital storage popularize the moment, data is everywhere, technology is a standardized, commercialized way to provide, in fact, it is the thinking and methodology decide the success or failure of large data the key, but for now, bridge the gap between academic and industrial, technology and application of the methodology is still imperfect.

Gold Rush in the social problems in

As the history of mathematics led to the development of the birth of three crises are axioms of geometry, the creation of set theory and modern data, like paradox is tremendous impetus theory, technology and application progress. Big Data to solve the paradox, but also will promote the release of the popularity and social value of big data applications. After the media hype and academic conferences, Big Data technology trends suddenly fell to the bottom, many start-up companies become precarious ...... data based on this famous Gartner hype cycle, big data has gone through infancy and bubble of speculation and into the trough of the next 3 to 5 years.

Market gap

Big Data marketing model will experience innovators, early adopters, early majority, late majority and laggards such as five stages. There are four cracks between five phases, of which the largest, most dangerous crack exists between the early market and the mainstream market, which we call the "gap."

Mainstream big data comes from pragmatic conservatism of the early majority and late majority, both of which each occupy 1/3 of the big data market share. Common feature of these two groups is that both have good information technology infrastructure and big data accumulated deep, and well versed in the social and economic value of big data. Is different is that the former would like to see Applications proven solutions and successful, they are mostly financial, energy, telecommunications and other public services. The latter requires a more secure and reliable data protection and big broad social basis of applications, most of them are committed to addressing environmental, energy and health and other social issues of public administration.

Big Data technologies and applications sought after innovator is obvious, get early market support is also easy. However, because they are "fashionable" to join, because they are "out of date" and quit to become a major mainstream market data Nuggets. Unfortunately, many companies might become a "gap in the victim," and missed the arrival of large real data applications market.

Overall product planning

The founder of modern marketing - Theodore Levitt gives the concept of "whole products". According to this concept, big data products should be included as a "core attraction" of general products to meet the psychological needs of primary products and to achieve the desired higher order participation and extended products and potential products 4 part self-realization.

Big Data concepts, challenges, algorithms, treatment and research progress

Guess you like