DAMA Data Management Knowledge System Guide-Reading Notes 14

Chapter 14 Big Data and Data Science

I. Introduction

Big data not only refers to the large amount of data, but also refers to the variety of data (structured and unstructured, documents, files, audio, video, streaming data, etc.), as well as the fast speed of data generation.

Data scientists combine various methods from mathematics, statistics, computer science, signal processing, probability modeling, pattern recognition, machine learning, uncertainty modeling and data visualization to predict behavior based on large data sets to obtain more information.

1.1 Business driven

Big data can inspire innovation through the exploration of more and larger data sets, which can be used to define predictive models to predict customer needs and enable personalized presentation of products and services. Machine learning algorithms can automate complex and time-consuming activities, thereby improving organizational efficiency, cutting costs, and reducing risks.

1.2 Basic concepts

1.2.1 Data Science

Data science integrates data mining, statistical analysis and machine learning with data integration, and combines data modeling capabilities to build predictive models and explore data content patterns.

1.2.2 The process of data science

  • Define big data strategy and business needs. Define some measurable needs that will generate real benefits.
  • Select a data source. Identify gaps in the current data asset library and find data sources to fill the gaps.
  • Collect and extract data. Collect data and load them for use.
  • Set data assumptions and methods. Explore data sources by profiling, visualizing, and mining data. Define inputs to model algorithms, categories, or model design and analysis methods.
  • Integrate and align data for analysis. The feasibility of a model depends in part on the quality of the source data. Utilize reliable data sources and apply appropriate data integration and data cleaning techniques to improve the quality and usability of prepared data sets.
  • Use models to explore data. Apply statistical analysis and learning algorithms to the integrated data to validate, train, and evolve models over time.
  • Deployment and monitoring. Those models that produce useful information can be deployed into a production environment to continuously monitor their value and effectiveness.

1.2.3 Big data

  • Big amount of data. Big data often has thousands of entities or elements in billions of records.
  • Data updates quickly. Refers to the speed at which data is captured, generated or shared.
  • Data types are diverse/variable. Refers to the form in which data is captured or transferred. Big data requires storage in multiple formats.
  • Data viscosity is high. Refers to the relatively high difficulty of using or integrating data.
  • The data is highly volatile. Refers to the frequency of data changes and the resulting short data validity time.
  • Data accuracy is low. It means that the reliability of the data is not high.

1.2.4 Data Lake

A data lake is an environment that can extract, store, evaluate and analyze massive data of different types and structures, and can be used in a variety of scenarios. If available:

  • An environment where data scientists can mine and analyze data.
  • A centralized storage area for raw data requiring only a minimal amount of transformation.
  • Alternate storage area for data warehouse detail historical data.
  • Online archiving of information records.
  • The environment in which streaming data is extracted can be identified through automated models.

A data lake can be implemented as a conforming configuration of data processing tools such as Hadoop or other data storage systems, cluster services, data transformation, and data integration.

1.2.5 Service-based architecture

Service-based architecture (SBA) is becoming a way to provide data immediately and use the same data source to update complete and accurate historical data sets. The SBA architecture is somewhat similar to the data warehouse architecture in that it sends data to operational data storage components.

  • Batch processing layer. The data lake serves as a batch processing layer, including recent and historical data.
  • Acceleration layer. Only real-time data is included.
  • service layer. Provides an interface to connect batch processing and acceleration layer data.

1.2.6 Machine Learning

Machine learning explores the construction and study of learning algorithms, and it can be viewed as a combination of unsupervised and supervised learning methods. Unsupervised learning is often referred to as data mining, while supervised learning is based on complex mathematical theories, especially statistics, combinatorics, and operations research. A third branch is in the process of being formed, called reinforcement learning, where objective optimization is achieved without the approval of a teacher, such as driving a vehicle. Programming machines to quickly learn from queries and adapt to changing data sets introduces a new field in big data called machine learning.

1.2.7 Semantic analysis

Media monitoring and text analytics are automated methods of deriving insights from large amounts of unstructured or semi-structured data to sense how people feel and think about a brand, product, service, or other type of topic. Use natural language processing (NLP) to analyze short sentences or sentences, semantically detect emotions, and reveal changes in emotions to predict possible scenarios.

1.2.8 Data and text mining

Data mining is a key activity in the exploration phase because it helps to quickly identify data elements that need to be studied, identify new relationships that were previously unknown, unclear, or unclassified, and provide a categorical structure to the data elements being studied. Text mining uses text analysis and data mining technology to analyze documents and automatically classify the content into a workflow-oriented and domain expert-oriented knowledge ontology. Data and text mining uses a range of techniques, including:

  • Analyze. Profiling attempts to describe the typical behavior of an individual, group, or population and is used to establish behavioral norms for anomaly detection applications such as fraud detection and computer system intrusion systems. The profiling results are input to a number of unsupervised learning components.
  • Data reduction. Data reduction is the replacement of a large data set with a smaller data set that contains most of the important information in the larger data set. Smaller data sets may be easier to analyze or process.
  • association. Correlation is an unsupervised learning process that studies the elements involved in a transaction to find correlations between them. Examples of correlation include frequent item set mining, rule discovery, and market-based analysis. Recommendation systems on the Internet also use this process.
  • clustering. Aggregate data elements into distinct clusters based on their shared characteristics. Customer segmentation is an example of clustering.
  • Self-organizing mapping. Self-organizing mapping is a neural network approach to cluster analysis, similar to multidimensional scaling. Dimensionality reduction is like removing a variable from an equation without affecting the outcome, making these problems easier to solve and the data easier to present.

1.2.9 Predictive analysis

Predictive analytics is a new field of supervised learning in which users attempt to model data elements and predict future outcomes by evaluating probability estimates. Predictive analytics exploration is rooted in data, specifically statistics, and shares many of the same components as unsupervised learning, with controllable differences in the measurement of expected predicted outcomes.

1.2.10 Normative analysis

Prescriptive analytics goes one step further than predictive analytics by defining actions that will affect outcomes, rather than just predicting outcomes based on actions that have already occurred. Prescriptive analysis predicts what will happen, when it will happen, and suggests why it will happen. Because prescriptive analytics can show the implications of various decisions, it can suggest how to exploit opportunities or avoid risks. Prescriptive analytics is constantly ingesting new data to re-forecast and re-prescribe. This process improves forecast accuracy and provides better scenarios.

1.2.11 Unstructured data analysis

Unstructured data analysis combines text mining, association analysis, cluster analysis, and other unsupervised learning techniques to process large data sets. Supervised learning techniques can also be used to provide direction, supervision, and guidance during programming, using human intervention to resolve ambiguities when necessary.

Scanning and tagging is a way to add "hooks" to unstructured data, allowing related structured data to be linked and filtered. It's difficult to know what tags are generated based on what conditions. Knowledge is an iterative process that starts when suggested label conditions are identified, assigns labels when ingesting data, and then analyzes the label data using these labels to verify the label conditions. This process may lead to changes in label conditions, or more label changes.

1.2.12 Operational analysis

Operational analytics includes user segmentation, sentiment analysis, geocoding, and other techniques applied to data sets for marketing campaign analysis, sales breakthroughs, product promotion, asset optimization, and risk management.

Operational analytics includes tracking and integrating real-time information flows, drawing conclusions based on behavioral predictive models, and triggering automated responses and alerts. Designing the models, triggers, and responses required for successful analysis requires more analysis of the data itself. Operational analytics solutions include the preparation of multi-behavioral models pre-populated with required historical data.

1.2.13 Data visualization

Data visualization helps understand underlying data through a visual overview. Data visualization compresses and encapsulates characteristic data, making it easier to view. In this way, it helps to discover business opportunities, identify risks or highlight information.

Data visualizations can be delivered in static formats or in more interactive online formats. Some formats support interaction with the end user, where drill-through or filtering capabilities facilitate analysis of the data in the visualization; others allow users to navigate through the data as needed. Innovative display methods to adjust visualization effects.

1.2.14 Data mashup

Data mashups combine data and services to visually display insights or analysis results. This technology can be easily applied to the Internet, and secure data mashing technology enables the sharing of personal or confidential information across vendors or providers. They can be combined with artificial intelligence learning algorithms to provide Internet-based public services through natural language interfaces.

2. Activities

2.1 Define big data strategy and business needs

A big data strategy must include the following evaluation criteria:

  • What problem does the organizational view solve and what needs to be analyzed. An organization can decide how to use this data to understand the business or business environment, prove an idea about the value of a new product, explore the unknown, or invent a new business approach.
  • What is the data source to use or obtain. Internal resources may be easy to use, but they may also be limited in scope. External resources may be useful but are outside of business control.
  • Provide timeliness and scope of data. There is a huge difference between computing algorithms for data-at-rest and algorithms for churn, and low-latency data is ideal, but often at the expense of a lot of machine learning power. Do not use the minimum level of integration required to meet downstream data consumption requirements.
  • Effects on and dependencies on other data structures. It may be necessary to modify the structure or content of other data structures to make them suitable for big data integration.
  • Impact on existing modeling data. Includes expanded knowledge of customers, products, or marketing methods.

2.2 Select data source

Big data environments capture large amounts of data quickly and require ongoing management over time, requiring an understanding of the following basic facts:

  • Data source
  • Data Format
  • What do data elements represent?
  • How to connect other data
  • Data update frequency

The value and reliability of the data need to be assessed:

  • Basic data. Consider underlying data components such as POS in sales analysis
  • granularity. Ideally, obtain the data in its most granular form so that it can be aggregated for various purposes.
  • consistency. If possible, select data that is appropriate and consistent beyond visualization and cognitive limitations.
  • reliability. Choose data sources that are stable and reliable over time. Use trusted data from authoritative sources.
  • Examine/analyze new data sources. Changes need to be tested before adding new data sets.

2.3 Obtaining and receiving data sources

Iteratively identify gaps in the current data asset base and those data sources, and exploit these data sources using analytics, visualization, mining, or other data science methods to define model algorithm inputs or model assumptions.

2.4 Develop data assumptions and methods

Developing data science solutions requires building statistical models to find correlations and trends within and between data elements and datasets. A question will have multiple answers based on the input to the model.

2.5 Integrate and align data for analysis

Preparing data for analysis includes understanding what's in the data, finding links between data from various sources, and adapting commonly used data for use. Typically, the data is examined in the initial stages to understand how to analyze it. Clustering helps determine groupings of data output, and other methods find correlations for building models and displaying results. Using these techniques in the initial stages can help understand how the model will display results once published.

2.6 Use models to explore data

2.6.1 Filling the prediction model

Configuring predictive models needs to be pre-populated with historical information about customers, markets, products, or factors other than the model triggers in the model. Prefill calculations are usually performed in advance to provide the fastest response to triggering events.

2.6.2 Training model

Need to be trained through data model. Training involves repeatedly running the model on the data to test hypotheses, which will cause the model to change. Training needs to be balanced to avoid overfitting by training on a limited data folder.

Model validation must be completed before switching to production. Address any padding imbalance or data bias issues with model offsets for training and validation. This can be adjusted in production as the initial offset is gradually adjusted by actually populating the data. Optimization of feature mixtures can be achieved through Bayesian co-selection, classifier inversion or rule induction. Models can also be combined for fusion learning to build more powerful predictive models by combining simple models. Identifying outliers or anomalies is critical to evaluating models.

2.6.3 Evaluation model

Data science begins once the data is put into the platform and ready for analysis. Build, evaluate, and validate models against training sets. Data scientists run queries and algorithms against the data to see if any insights emerge, often running many strange mathematical functions to see if any useful information is found. During this time, data scientists often discover new insights in batch iterations. Through these processes, models are developed that reveal correlations between data elements and insights.

There is an ethical component of data science practice that needs to be used when evaluating models. Models may produce unexpected results, or unintentionally reflect the assumptions and biases of the modelers.

2.6.4 Creating data visualizations

Data visualizations for a model must meet specific needs related to the purpose of the model, and each visualization should be able to answer a question or provide an insight. Set the purpose and parameters of the visualization: point-in-time status, trends and anomalies, relationships between moving parts, geographic differences, and others.

Choose the appropriate visual form to achieve the purpose, make sure the visualization meets the needs of the audience; adjust the layout and complexity for responsive highlighting and simplification, not all audiences are ready for complex interactive diagrams, should support Visualization.

Visualizations should take the form of storytelling. Data “storytelling” links new questions into the context of data exploration. The best results can be achieved only by telling data stories with relevant data visualizations.

2.7 Deployment and monitoring

Models that meet business needs must be deployed into production in a feasible way for continuous monitoring. Models can provide batch processing as well as real-time consolidated messages, and they can also be embedded in analytics software as input to decision management systems, historical analysis, or performance management dashboards.

2.7.1 Revealing insights and discoveries

Presenting findings and data insights through data visualization is the final step in data science research, and Insights English action projects are linked so that organizations can benefit from data science efforts. New relationships can be explored through data visualization techniques. When models are used, changes in the underlying data and data relationships may become apparent, telling new stories about the data.

2.7.2 Iterating with additional data sources

Demonstrating discoveries and data insights often leads to new questions, which in turn trigger new research processes. Data science is an iterative process, so big data development requires iterative support. The process of learning from a specific set of data sources often results in the need for different or additional data sources to support the conclusions drawn and to add insights to existing models.

3. Tools

Massively parallel processing (MPP) provides the means to analyze huge amounts of information in a relatively short period of time. Other technologies that are changing the way we view data and information include:

  • Advanced analytics within the database
  • Unstructured data analysis (Hadoop, MapReduce)
  • Integration of analysis results with operating systems.
  • Data visualization across multimedia and devices
  • Semantics for linking structured and unstructured information.
  • Use new data sources from the Internet of Things.
  • Advanced visualization capabilities.
  • Data expansion capability
  • Collaboration of technologies and toolsets.

3.1 MPP shared nothing technology and architecture

The shared-nothing database technology of massively parallel processing (MPP) has become a standard platform for big data set analysis for data science. In an MPP database, data is partitioned across multiple processing servers, each with its own dedicated memory for processing local data. Communication between processing servers is controlled by management nodes and occurs through network interconnections. Because this architecture has no disk sharing and no memory contention, it is called "no sharing";

3.2 Distributed file-based database

Distributed file solution technologies, such as the open source Hadoop, are a cheap way to store huge amounts of data in different formats. Hadoop stores any type of file - structured, semi-structured and unstructured. Ability to share files across processing servers using a configuration similar to MPP shared-nothing. Due to its relatively low cost, Hadoop has become the first choice for many organizations. The model used in price-based solutions is called MapReduce. The model has three main steps:

  • mapping. Identify and obtain data for analysis.
  • Shuffle. Combine the data according to the desired analytical model.
  • merge. Remove duplicates or perform aggregations to reduce the size of the resulting dataset to the desired size.

3.3 In-database algorithm

The in-database algorithm uses APP-like principles. Each processor in the MPP shared nothing architecture can run queries independently, so it can realize the analysis and processing of new situations at the computing node level, provide mathematical and statistical functions, and provide an open source library of scalable algorithms in the database for machine learning, statistics and other analytical tasks.

3.4 Big data cloud solution

Vendors provide cloud storage and integration capabilities for big data, including analytics. Based on defined standards, customers load data into the cloud environment. Vendors enhance the data through open data sets or those provided by other organizations. Customers can use combined datasets for analytics and data science activities.

3.5 Statistical Computing and Graphical Languages

The R language is an open source scripting language and environment for statistical computing and graphics. It provides a wide variety of statistical techniques such as linear and nonlinear modeling, classical statistical tests, time series analysis, classification and clustering.

3.6 Data Visualization Toolset

Advanced visualization and discovery tools use in-memory architectures to enable users to interact with data and reveal difficult-to-identify patterns in large data sets. Visual patterns can quickly capture visual patterns when thousands of data points are loaded into complex displays.

Many toolsets now support information visualization methods such as radar plots, parallel coordinate plots, label plots, heat maps, and data maps. Compared with traditional visualization tools, these tools have the following advantages:

  • Responsible analysis and visualization types such as trellis plots, sparklines, heat maps, histograms, waterfall charts and bullet charts.
  • Built-in visualization best practices.
  • Interactivity enables visual discovery.

4. Method

4.1 Analytical modeling

Analytical models are associated with different depths of analysis:

  • Descriptive modeling summarizes or represents data structures in a compact manner. This approach does not always verify causal hypotheses or predict outcomes, but it does enable the use of algorithms to define or improve relationships between variables, thereby providing input for such analysis.
  • Explanatory modeling is the application of statistical models to data, primarily to test causal hypotheses about theoretical constructs. While it uses techniques similar to data mining and predictive analytics, its purpose is different. It does not predict outcomes, it simply matches model results to existing data.

4.2 Big data modeling

The main driver for physically modeling a data warehouse is to initiate data population for query performance. The value of data modeling lies in its ability to understand the content of the data. Applying proven data modeling techniques requires simultaneously considering a variety of sources and developing a subject domain model, at least in a generalized manner.

5. Implementation Guide

Many of the same general rules for managing data warehouse data apply to managing big data: ensuring the data source is reliable, having sufficient metadata to support data usage, managing data quality, determining how to integrate data from disparate sources, and ensuring the data is secure and protected. The differences in implementing a big data environment relate to a set of unknown issues: how the data is used, which data is valuable, and how long it needs to be retained.

5.1 Strategic consistency

Any big data/data science project should be strategically aligned with organizational goals. The strategy documents the objectives, approach and governance principles. Leveraging big data requires building organizational skills and capabilities, using capacity management to organize business and IT plans and develop roadmaps. Strategic deliverables should consider managing the following elements:

  • information life cycle
  • metadata
  • Data quality
  • data collection
  • Data access and security
  • data governance
  • data privacy
  • Learn and adopt
  • Operations.

5.2 Readiness Assessment/Risk Assessment

Assess organizational readiness relative to critical success factors, including:

  • Business relevance. How aligned are big data/data science initiatives and their corresponding use cases with the company's business? To be successful, they need to execute business functions and processes powerfully.
  • Business readiness. Are business partners prepared for long-term incremental delivery? Are they committed to building a center of excellence to support the product in future releases? How big is the average knowledge or skills gap within the target group, and can it be bridged within a single increment?
  • Economic feasibility. Does the proposed solution conservatively consider both tangible and intangible benefits? Will the cost of ownership be assessed considering purchasing or leasing the item, or building it from scratch?
  • prototype. Is it possible to provide a built prototype solution to a small end-user group for a limited time to demonstrate the value of the proposal?
  • Perhaps the most challenging decisions will be around data procurement, platform development and resourcing.
  • There are many sources of digital material storage, not all of which need to be owned and operated in-house.
  • With multiple tools and technologies on the market, meeting general needs will be a challenge.
  • Protecting employees with specialized skills in a timely manner and retaining top talent during implementation may require consideration of alternatives, including professional services, cloud procurement, or collaboration.
  • Developing internal talent can take longer than the delivery window.

5.3 Organizational and cultural changes

Like DW/BI, a big data implementation will bring together many key cross-functional roles, including:

  • Big data platform architect. Hardware, operating systems, file systems and services.
  • Data Ingestion Architect. Data analysis, system logging, data modeling and data mapping. Provide or support mapping sources to Hadoop clusters for query and analysis.
  • Metadata Expert. Metadata interface, metadata schema and content.
  • Analysis and Design Supervisor. End-user analysis design, best practice relying on administrative toolsets to guide implementation, and end-user results set simplification.
  • Data scientist. Provide theoretical knowledge based on statistics and computability, deliver appropriate tools and techniques, and apply them to architectural and modeling design consulting for functional requirements.

6. Big data and data science governance

Like other data, big data also requires governance. Sourcing, provenance analysis, extraction, enrichment and publishing processes require business and technical controls to address the following issues:

  • Source. What are the sources, when to access them, and what are the best data sources for a particular study.
  • shared. Data sharing agreements and contracts, terms and conditions, both internal and external to the organization.
  • metadata. What the data means at the source and how to interpret the results at the output.
  • Rich. Whether to enrich data, how to enrich data, and the benefits of enriching data.
  • access. What to publish, to whom, how and when to publish.

6.1 Visual channel management

Depending on the size and nature of the organization, a number of different visualization tools may be applied to various processes. Make sure users understand the relative complexity of the visualization tool; experienced users will have increasingly complex needs. Coordination between enterprise infrastructure, portfolio management and operations teams is necessary to control visibility channels within and across the portfolio.

6.2 Data Science and Visualization Standards

Best practice is to establish a community that defines and publishes visualization standards and guidance, and reviews work within developed delivery methods. This is particularly important for client-facing and regulatory content. Standards may include:

  • Tool standards for analysis paradigms, user groups, and subject areas
  • Request for new data.
  • Dataset process standards.
  • Use a neutral, professional presentation process to avoid biasing your results and ensure that all elements are completed in a fair and consistent manner, including: data inclusion and exclusion, assumptions in the model, statistical validity of results, and valid interpretation of results sex, and use appropriate methods.

6.3 Data security

Having reliable data protection processes is an organizational asset in itself, and policies for handling and protecting big data should be established and monitored. Consideration should be given to how to prevent the misuse of personal data and protect it throughout its life cycle.

Securely provide appropriate levels of data to authorized personnel and deliver subscription data based on agreed upon levels. Arrange services on user communities so that special services can be created that provide private data to the communities that allow it to be extracted, and shield the data from others.

Restructuring measures the ability to reconstruct sensitive or private data and must be managed as part of big data security practices. Even if the actual data elements can only be inferred, the analysis results in a privacy violation. Understanding the consequences at the metadata management level is critical to avoiding this and other potential security breaches.

6.4 Metadata

Metadata needs to be carefully managed as part of data extraction, otherwise the data lake will quickly become a data swamp. Metadata characterizes the structure, content, and quality of data, including the origin of the data, the lineage of the data, the definition of the data, and the intended use of entities and data elements. Technical metadata can be obtained from a variety of big data tools, including data storage layers, data integration, MDM and even source file systems.

6.5 Data quality

Data quality is a measure of deviation from expected results: the smaller the difference, the better the data meets expectations and the higher the quality. An initial evaluation is necessary to understand the data, and from this evaluation, measurements are identified for subsequent instances of the data set. Data quality assessment will produce valuable metadata that will be an essential tool for any integration to consolidate data.

Most mature big data organizations use data quality toolsets to scan data input sources to understand the information contained within them. Most advanced data quality toolsets provide functionality that enables organizations to test hypotheses and build knowledge about their data, such as:

  • Discover. The location in the data set where the information resides.
  • Classification. What types of information exist based on standardized schemas.
  • analyze. How to populate and structure data.
  • mapping. What other data sets can be matched to these values.

6.6 Metrics

Metrics are critical to any management process, they not only quantify activity but also define the difference between what is observed and what is expected.

6.6.1 Technical usage indicators

Many big data tools offer insightful administrator reporting capabilities that interact directly with content queried by the user community. Use technical analysis to find data hot spots to manage data distribution and maintain performance. Growth rates also aid in capacity planning.

6.6.2 Loading and scanning indicators

Load and scan metrics define extraction rates and interactions with the user community. When ingesting a new data source, the expected loading metric peaks as the source is fully fetched, then plateaus. Live feeds are available via service queries, but can also be processed as scheduled extracts; for these feeds, expect increasing data loads.

The application layer may provide optimal data usage metrics from execution logs. Monitor consumption or access with available metadata, displaying the most frequently occurring query execution plans to guide usage analysis.

6.6.3 Learning and Story Scenarios

In order to show value, big data/data science projects must measure tangible results, and the costs of developing solutions and managing process changes in real name are justified. Metrics include quantification of benefits, costs prevented or avoided, and the length of time between initiation and realization of benefits. Commonly used measurement methods include:

  • Number and accuracy of developed models.
  • Revenue realized from identified opportunities.
  • Cost reduction from avoiding identified threats.

Guess you like

Origin blog.csdn.net/baidu_38792549/article/details/124978275