Introduction to Big Data Hadoop 01 - Introduction to Big Data

Table of contents

  • Enterprise data analysis direction
  • Basic steps of data analysis
  • Big Data Era
  • Distributed and clustered

what is the data

  • Data refers to the symbols that record and identify objective events. They are physical symbols or a combination of these physical symbols that record the nature, state, and relationship of objective things. They are identifiable and abstract symbols.
  • It not only refers to numbers in a narrow sense, but also can be a combination of words, letters, and digital symbols with certain meanings, graphics, images, videos, audios, etc. It is also an abstract representation of the attributes, quantities, positions, and relationships of objective things. For example, "0, 1, 2....", "overcast, rain, fall", "student's file records, shipment of goods", etc. are all data.

how data is generated

The measurement and recording of objective things generate data, such as the following situations:
insert image description here

1. The direction of enterprise data analysis

  • Concentrate and extract the information hidden behind the data, summarize the internal laws of the research object, and help managers make effective judgments and decisions.
  • Data analysis mainly has three directions in the daily operation analysis of enterprises:
    insert image description here
  • Current status analysis (analysis of current data): the overall situation at the current stage, the proportion, development, and changes of each part;
  • Cause analysis (analysis of past data): why a certain situation occurs, determine the cause, and make adjustments and optimizations;
  • Predictive analysis (combining data to predict the future): combining existing data to predict future development trends.

1.1 Status Analysis

  • Real-time analysis (Real Time Processing | Streaming (streaming analysis) )

Facing the present, analyze the data generated in real time;
the so-called real time refers to the short time interval from data generation to data analysis to data application, which can be subdivided into seconds and milliseconds.
insert image description here

1.2 Cause Analysis

  • Offline analysis (Batch Processing)

Facing the past, facing the history, analyzing the existing data; there
are obvious batch changes in the time dimension. One analysis per week (T+7) and one analysis per day (T+1), so it is also called batch processing.
insert image description here
1.3 Predictive Analysis

  • Machine Learning

Predict what will happen in the future based on historical data and current real-time data; focus on the application of mathematical algorithms, such as classification, clustering, association, and prediction.
insert image description here

2. Basic steps of data analysis

  • The importance of data analysis steps (processes) is reflected in: providing strong logical support for how to carry out data analysis;
  • Zhang Wenlin said in "Six Steps of Data Analysis" that a typical data analysis should include the following steps
    insert image description here

Step1: Clarify the analysis purpose and ideas

  • The purpose is the starting point of the entire analysis process, providing clear guidance for data collection, processing and analysis;
  • The idea is to systematize the analysis framework , such as what to analyze first, and what to analyze later, so that there is a logical connection between each analysis point, to ensure the integrity of the analysis dimension, and the validity and correctness of the analysis results , which requires the support of data analysis methodology ;
  • Data analysis methodology refers to some marketing management-related theories, such as user behavior theory, PEST analysis method , 5W2H analysis method, etc.
    insert image description here

Step2: Data collection

  • The process of data from scratch: for example, sensors collect weather data, and buried points collect user behavior
  • The process of data transmission and handling: such as collecting database data to the data analysis platform
    insert image description here

step3: data processing

  • To be precise, it should be called data preprocessing.
  • Data preprocessing needs to process and organize the collected data to form a style suitable for data analysis, mainly including data cleaning, data conversion, data extraction, and data calculation;
  • Data preprocessing can ensure the consistency and validity of the data, and turn the data into clean and regular structured data.
    insert image description here
    think:
  1. Does the data used for analysis in the current enterprise focus more on text data, or on pictures and video data?
  2. What is clean and regular structured data? Is there unstructured data?
    Professionally speaking, it is data in two-dimensional tables, with rows and columns corresponding;
    generally speaking, it is data with clear format and easy to interpret.

Step4: Data analysis

  • The process of analyzing the processed data with appropriate analysis methods and tools, extracting valuable information, and forming effective conclusions;
  • It is necessary to master various data analysis methods and be familiar with the operation of data analysis software;
    insert image description here

Step5: Data display

  • Data display is also called data visualization, which refers to the graphic display of analysis results, because human beings are visual animals;
  • Data visualization (Data Visualization) is a kind of data application;
  • Note that the results of data analysis are not only visualized, but also data mining (Data Mining), ad hoc query (Ad Hoc), etc. can be continued.

insert image description here

Step6: Report writing

  • The data analysis report is a summary and presentation of the entire data analysis process
  • Present the causes, processes, results and suggestions of data analysis in a complete manner for decision makers to participate in
  • Need to have a clear conclusion, ideally with a suggestion or solution
    insert image description here

Summarize

  1. - everything revolves around data
  2. Popular description: Where does the data come from and where does the data go
  3. Core steps: collection, processing, analysis, application

3. The era of big data

1. Background

  • McKinsey, a world-renowned consulting company, was the first to propose the arrival of the "big data" era. It said: "Data has penetrated into every industry and business function area today and has become an important factor of production. People's mining and application of massive data indicates that with a new wave of productivity growth and consumer surplus."
  • In 2019, CCTV launched the first domestic big data industry-themed documentary "Big Data Era". changes and impacts.insert image description here

2. Definition of big data

  • Big data (big data) refers to a collection of data that cannot be captured, managed and processed with conventional software tools within a certain time frame;
  • It is a massive, high-growth, and diverse information asset that requires a new processing model to have stronger decision-making power, insight and discovery, and process optimization capabilities.
    insert image description here

3. 5V characteristics of big data

  • Five words starting with V accurately, vividly and vividly introduce the characteristics of big data from five aspects.
    insert image description here

3.1. Large data volume (Volume)

  1. Large amount of collected data
  2. Large amount of stored data
  3. A large amount of calculation data
  4. TB, PB level start

3.2. Diversified types and sources (Volume)

  1. Type: structured, semi-structured, unstructured
  2. Source: Log text, image, audio, video

3.3. Low value density (Volume)

  1. Massive information but low value density
  2. Deep and complex mining analysis requires the participation of machine learning

3.4. Fast speed (Volume)

  1. Data growth is fast
  2. Get data fast
  3. Fast data processing

3.5. Data quality (Volume)

  1. data accuracy
  2. trustworthiness of data

Application Scenario

  • E-commerce field
    Accurate advertising space, personalized recommendation, big data familiarization

  • Media field
    Precise marketing, guess what you like, interactive recommendation
    insert image description here

  • In terms of finance
    , wealth management and investment, through the assessment of personal credit and risk bearing ability, gathers many wealth management products and recommends corresponding investment and wealth management products.

  • Transportation field
    Congestion prediction, intelligent traffic lights, optimal navigation planning

  • Telecom field Base
    station site selection optimization, public opinion monitoring, customer user portrait

  • Security Field
    Crime Prevention, Skynet Monitoring

  • Medical field
    Smart medical care, disease prevention, disease source tracking

Thinking: 1. How to store massive data
in big data scenarios ? 2. How to calculate massive data?

4. Distributed and clustered

concept

  • Distributed and clustered are two different concepts, but the spoken photo often confuses the two.
    insert image description here

point of confusion

  • Common features of distributed and clusters: they are composed of multiple machines (servers);
  • Therefore, when the two concepts are confused in spoken language, it is relative to a stand-alone machine.
    insert image description here

application

  • Big data explosion, massive data processing scenarios are facing problems
    insert image description here

Guess you like

Origin blog.csdn.net/gongzi_9/article/details/126412922