Five key technology and its application of large data processing

Data processing mass data value is extracted complicated, and is one of the most valuable parts of predictive analysis that can be visualized by the data, statistical pattern recognition, data mining and other forms of data describing data to help scientists better understand the data, obtain predictable decisions based on the results of data mining. Where the main work areas include:

  large data acquisition, large data pre-processing, large data storage and management of large data analysis and mining, large data display and application (large data retrieval, large data visualization, big data applications, large data security) .

  A large data acquisition techniques

  data is structured by means of various types of RFID radio frequency data, sensor data, social network data interaction data and mobile Internet, etc. obtained, semi-structured (or so-called weakly structured) and non- structured mass data, big data is the fundamental knowledge service model. Key to break the high-speed distributed crawling or reliable data acquisition, high-speed data-wide image and other large data collection techniques; breakthrough high-speed data parsing, conversion and loading and other large data integration technology; design quality assessment model, the development of data quality technology.

\
  Large data collection is generally divided into:

  1) Large data intellisense layer: includes data sensing system, network communications system, the sensor system is adapted, identification system and the hardware resource access system, structured, semi-structured to intelligent unstructured mass data identifying, locating, tracking, access, transmission, signal conversion, monitoring, preliminary processing and management. Capture must focus for large intelligent recognition data source, perception, adaptation, transmission and access technologies.

  2) the basic support level: to provide the required large data services platform virtual servers, databases, network resources and physical-linked structured, semi-structured and unstructured data and other infrastructure supporting environment. Focusing on capturing distributed virtual storage technology, big data acquisition, visualization interface technology to store, organize, analyze and decision-making operations, and network transmission of large data compression technology, big data privacy protection technology.

  Second, the large data pretreatment

  complete discrimination of the received data, extracting, washing and other operations.

  1) Extraction: by acquiring data may have various structures and types, data extraction process may help us to these complex data into a single configuration or to facilitate the process, in order to achieve rapid analysis process.

  2) Cleaning: For large data, not all valuable, some data is not what we are concerned, while others are completely false data interference term, and therefore the data "de-noising" through the filter to extract the valid data.

  Third, the large data storage and management of

  large data storage and management use memory to store data collected up to establish the appropriate database, and manage and calls. Focused on solving complex structured, semi-structured and unstructured big data management and processing technology. Mainly to solve big data can be stored and can be expressed, can handle several key issues such as reliability and efficient transmission. Development of reliable distributed file system (DFS), to optimize the energy efficiency of storage, computing into the store, go big data redundancy and cost-efficient big data storage technology; breakthrough large distributed non-relational data management and processing technology, different data structure of data fusion, data organization techniques large data modeling techniques; break large data indexing techniques; break large data movement, backup, replication technology; the development of large data visualization techniques.

  Development of new database technology, database into a relational database, non-relational databases and database caching system. Wherein the non-relational databases refers primarily NoSQL database, is divided into: the key database, the database storage column, the type of document databases and survival database. Relational database contains the traditional relational database systems and NewSQL database.

  Development of large data security technology: improved data destruction, transparent encryption and decryption, distributed access control, data auditing technology; breakthrough privacy and inference control, identification and forensic data authenticity, data integrity verification and other technologies hold.

  Fourth, large data analysis and mining technology

  Large data analysis techniques: Improved existing data mining and machine learning techniques; Development Network Data mining, excavation group specific, FIG mining new data mining techniques; break connection object based on the data, and the like connected to the large similarity data fusion techniques; breakthroughs user interest analysis, network behavioral analysis, large data fields for emotional semantic analysis mining technology.

  Data mining is from a large number of incomplete, noisy, fuzzy, random data in practical application, extract hidden in them that people do not know in advance, but is potentially useful information and knowledge.

  Many data mining technology involved, there are a variety of classifications. The mining tasks can be divided into classification or prediction models found summary data, clustering, association rules, sequential patterns, dependency or dependency models found exceptions and trends found the like; mining objects can be divided according to the relational database, for object database, database space, temporal database, the source text data, multimedia databases, heterogeneous databases, legacy databases and the web web; The mining points, can be roughly divided into: machine learning methods, statistical methods, neural network and the database method.

  Machine learning, can be subdivided into inductive learning methods (decision trees, rule induction, etc.), learning-based paradigm, genetic algorithms. Statistical methods can be broken down as follows: regression analysis (multiple regression, autoregressive, etc.), discriminant analysis (Bayesian classifier, Fisher discrimination, non-parametric identification, etc.), cluster analysis (hierarchical clustering, dynamic clustering etc.), exploratory analyzes (principal component analysis, correlation analysis, etc.). Neural network approach can be broken down as follows: forward neural network (BP algorithm), self-organizing neural network (SOM competitive learning, etc.). The method is mainly a database or OLAP multidimensional data analysis methods, in addition to a method for the induction properties.

  Data mining is the main process: According to the analysis a mining target, the data extracted from the database, and then through the tissue to be suitable for analysis ETL mining algorithms use the wide table, and data mining software mining. Traditional data mining software, generally only support small-scale data processing on a single machine, affected by the limitations of traditional data mining analysis generally be applied sampling method to reduce the size of the data analysis.

  Data mining, computational complexity and flexibility far more than the first two categories needs. First, because data mining open-ended questions, resulting in data mining will involve a large number of derived variables is calculated, derived variable changeable result in data preprocessing computational complexity; the second is a lot of data mining algorithm itself is more complex computation on large, especially a large number of machine learning algorithms are iterative calculations required to find the optimal solution through multiple iterations, such as K-means clustering algorithm, PageRank algorithm.

  From the perspective of mining methods and mining tasks, focusing on breakthroughs:

  1) visual analysis. Data visualization both for ordinary users or data analysts, are the most basic functions. The image of the data allows data to speak for itself, allowing users to feel the results intuitive.

  2) data mining algorithms. The image of a machine language translation posters, data mining is the mother tongue of the machine. Segmentation, clustering, outlier analysis as well as a wide variety of algorithms Let us refine a wide variety of data mining value. These algorithms must be able to cope with large amounts of data, but also has a high processing speed.

  3) predictive analysis. Predictive Analytics allows analysts to make some forward-looking judgment based on the results of image analysis and data mining.

  4) semantic engine. Semantic engine to be designed to have sufficient AI sufficient active extract information from the data. Language processing technology, including machine translation, sentiment analysis, public opinion analysis, intelligent input, question and answer system.

  5) data quality and data management. Data quality and management is the best practice management processes data through standardized processes and machines can ensure a preset quality of analytical results.

  Predictive analysis of seven secret of success

  predicting the future has always been a risky proposition. Fortunately, the emergence of predictive analytics technology allows users to predict future outcomes based on historical data and analysis techniques (such as statistical modeling and machine learning), which makes predictions and trends become more reliable than in the past few years.

  Nevertheless, as with any emerging technology, we want to realize the full potential of predictive analysis is very difficult. And it may make the challenge even more complicated is that, by the imperfect policy or predictive analysis tool misuse inaccurate or misleading may result in weeks, months or even within a few years will emerge.

  Predictive analysis has the potential to revolutionize many industries and businesses, including retail, manufacturing, supply chain, network management, financial services and healthcare. AI Network Technologies Mist Systems co-founder, Bob fridy chief technology officer, he predicted: "deep learning and predictive analysis of AI technology will change all parts of our society, just as ten years to transform the Internet and cellular technology offers the same .. "

  Here are seven suggestions to help your organization take full advantage of its predictive analysis program.

  1. access to high quality, easy to understand data

  predictive analytic applications require large amounts of data, and rely on information provided by the feedback loop to continuously improve. Chief Data global IT solutions and service provider Infotech and analysis of official Soumendra Mohanty commented: "are mutually reinforcing relationship between the data and predictive analytics."

  Understand the type of data flowing into predictive analysis model is very important. "What kind of a person data will?" Eric Feigl - Ding asked, he was epidemiologists, nutritionists and health economist, is currently a visiting scientist at Harvard School of Public Health Chen. "Real-time data is collected on Facebook and Google each day, required medical records or difficult to access medical data?" In order to make accurate predictions, the model needs to be designed to handle a specific type of data it absorbed.

  Simply threw large amounts of data computing resources predictive modeling work is doomed to fail. "Due to the presence of large amounts of data, most of the data may not be relevant to a particular issue, but there may be correlated in a given sample," FactSet vice president of portfolio management and trading solutions and Henri Waelbroeck research director explained, is a FactSet financial data and software company. "If you do not understand the data generation process data in a biased training models may be completely wrong."

  2. find the right model

  Richard Mooney SAP senior product manager analysis pointed out that everyone is obsessed algorithm, but the algorithm and data must be input to the algorithm as well. "If you can not find a suitable model, then they are useless," he wrote. . "Most of the data set has its hidden mode"

  mode is usually hidden in two ways:

  1) the relationship between a pattern located two columns. For example, by comparing the expiration date information and data related to e-mail the opening price of the transaction is about to be discovered a pattern. Mooney said: "If the deal is about to end, open rate of emails should be a substantial increase, since the buyer will have a lot of people need to read and review the contract."

  2) shows the relationship model variables change over time. "In the example above, for example, understand the customer opened the 200 e-mail is not as aware of them as useful opened 175 times in the last week," Mooney said.

  3. focus on manageable tasks, these tasks may lead to a positive return on investment in

  the New York Institute of Technology and Michael Urmeneta business intelligence analysis director, said: "Today, people want to put machine learning algorithms to massive amounts of data in order to obtain deeper insights. "he said. the problem with this approach is that it is like trying to cure all forms of cancer once the same. Urmeneta explained: ".. This can cause problems too big, too messy data - do not have enough money and enough support so that it is impossible to succeed."

  When the task is relatively concentrated, the likelihood of success is much greater. Urmeneta pointed out: "If there are any problems, we are likely to come into contact with experts who can understand complex relationships." "This way, we are likely to have better or clearer understanding of data processing."

  4. Use the correct way to do the job

  good news is that there are almost countless ways to generate precise predictive analytics. However, this is bad news. University of Chicago NORC (former National Opinion Research Center) behavior, economic analysis and decision-making practice director Angela Fontes said: "Every day new and popular methods of analysis appears, use the new method is very easy to make people excited." "However, in my experience, the most successful projects are those results really think deeply and let it guide their project selection method - even if the most appropriate method is not the most sexy, the latest method"

  Rochester Institute of Technology Computer Engineering Head of Department, Associate Professor shanchie Jay Yang suggested: "the user must carefully select the appropriate method for their needs." "You must have an efficient and interpretable technique, a way using the sequence data, the statistical properties of time data, then extrapolated to the most likely future," Yang said.

  5. Build model with a well-defined goal

  that may seem obvious, but when the goal of many predictive analytics project begins to build a magnificent model, but without a clear end-use plan. "There are a lot of great models have never been used before, because no one knows how to use these models to achieve or provide value," senior vice president, automotive, insurance and collision repair industries SaaS provider CCC Information Services Product Management Jason Verlen commented.

  In this regard, Fontes agreed. "Use the right tool will certainly want to make sure we get the results from the analysis ......" because it forces us to be very clear about their goals, "she explained." If we do not know the target of the analysis, there will never You could really get what we want. "

  6. Establish close working relationship between IT and related business

  Build strong partnerships between business and technical organizations is essential. Paul lasserr vice president of customer experience artificial intelligence technology provider Genesys product management, said: "You should be able to understand how to deal with business challenges of new technologies or improve existing business environment." Then, once set goals, you can in a limited range application test model to determine whether the solution actually provides the required value.

Here I would like to recommend my own build large data exchange learning skirt qq: 522 189 307, there are learning skirt big data development, if you are learning to big data, you are welcome to join small series, we are all party software development, Share occasional dry (only big data development related), including a new advanced materials and advanced big data development tutorial myself finishing, advanced welcome and want to delve into the big data companion. The above information plus group can receive


  7. Do not be misled poorly designed model

  model is designed for people, so they often contain potential pitfalls. Model wrong model or the use of incorrect or improper data to build very misleading, in extreme cases, even produce completely wrong prediction.

  Not achieved adequate randomized selection bias confuse predictions. For example, in a hypothetical study diet, 50% of participants may choose to exit the subsequent weight measurement. However, those who drop out of the people left behind have different weight trajectory. This makes analysis more difficult, because in this study, those who insist on participating in the project are usually the people who really lose weight. On the other hand, smokers usually those who have little or no weight loss experience. Thus, while the weight loss in the whole world is causal and predictable, but in a limited database drop-out rate of 50%, the actual weight loss results may be hidden.

  Six large data presentation and application technology

  Big Data technologies can be hidden in vast amounts of data in the information and knowledge excavated, provide the basis for human social and economic activities, thereby enhancing operational efficiency in all areas, greatly improving the overall socio-economic intensification degree.

  In our country, we will focus on large data used in the following three areas: business intelligence, government decision-making, public services. For example: business intelligence technology, government decision-making technology, telecommunications, information processing and data mining technology, information processing and network data mining technology, meteorological information analysis, environmental monitoring technology, police cloud applications (road monitoring, video surveillance, network monitoring, intelligent transportation, anti telecommunications fraud, such as dispatching and public security information system), a large-scale gene sequence analysis and comparison technology, Web information mining technology, multimedia data-parallel processing technology, video production rendering technology, cloud computing various other industries and mass data processing application technology.

Guess you like

Origin blog.csdn.net/yyu000001/article/details/90521656