Review of Online Education Projects in Big Data Education Data Warehouse

Review of Online Education Projects in Big Data Education Data Warehouse

01: Online Education Project Requirements

  • Goal : Master the needs of online education programs

  • implement

    • Routine requirements : Through data analysis and processing of data, some indicators are obtained to reflect some facts and support operational decisions
    • Industry : Online Education Industry
    • Product: Course
    • Demand : Improve the conversion rate of student registration and achieve sustainable operational development
      • Requirement 1: Analyze the retention rate and loss rate of students in each link from visit to registration, find out the problems existing in each link, solve them, and increase the registration rate
        • Access Analysis
        • Consulting Analysis
        • intent analysis
        • Registration analysis
        • Through the analysis of each link, we can discover the reasons for the loss of each link, solve the problem, and improve the conversion rate of each step
      • Requirement 2: Sustainable development needs to build a good product reputation and control the quality of students' learning: through the management and control of exams, attendance, and homework
        • Attendance analysis
  • summary

    • Master online education program requirements
  • Interview: Project Introduction

02: Division of demand topics

  • Goal : Master the division of in-demand topics in online education
  • implement
    • Data management division of data warehouse
      • Data warehouse [DW]: stores all data of the entire company
        • Data mart/subject domain [DM]: divided according to certain business requirements: department, business, demand
          • Theme: Each theme is oriented towards the final business analysis requirement
    • In-demand topics in online education
      • Data warehouse: business system data [customer service system, CRM system, student management system]
        • Business Data Warehouse: Structured Data
      • Data Mart/Subject Domain
        • Operation Management Bazaar/Operation Domain
        • Sales Management Bazaar/Sales Domain
        • Learner Admin Bazaar/User Domain
        • Product Management Bazaar/Product Domain
        • advertising domain
        • ……
      • data subject
        • Source Analysis Topic, Access Analysis Topic, Consulting Analysis Topic
        • Sales Analysis Topic, Lead Analysis Topic, Intent Analysis Topic, Sign Up Analysis Topic
        • Attendance Analysis Topic, Exam Analysis Topic, Homework Analysis Topic
        • Product access theme, product sales theme, product payment theme
        • Table name: layer_[domain]_subject_dimension table
  • summary
    • Master the delineation of in-demand topics in online education
    • Interview: What subject domains are divided in the project and what are the themes?

03: Data sources

  • Objective : Master the data sources of online education platforms
  • implement
    • Access Analysis Topics, Consulting Analysis Topics
      • customer service system: customer service system database
      • Requirements: Count the number of visiting users and consulting users in different dimensions
        • Indicators: UV, PV, IP, Session, bounce rate, double bounce rate
        • Dimensions: time, region, source channel, search source, source page
      • web_chat_ems
      • web_chat_text_ems
    • Lead Analysis Topic, Intent Analysis Topic, Enrollment Analysis Topic
      • CRM system: marketing system database
      • Requirements: Count the number of intended users, registered users, and valid leads in different dimensions
        • Dimensions: time, region, source channel, online and offline, new and old students, campus, discipline, sales department
      • customer_relationship: Intent and registration information form
      • customer_clue: clue information table
      • customer: student information form
      • itcast_school: school district information form
      • itcast_subject: subject information table
      • employee: employee information form
      • scrm_deparment: department information table
      • itcast_clazz: registration class information form
    • Attendance Analysis Topic
      • Data Sources:student management system
      • Requirements: Statistics of student attendance indicators in different dimensions: attendance number, attendance rate, lateness, leave, absenteeism
      • tbh_student_signin_record: student signin information form
      • student_leave_apply: student leave information form
      • tbh_class_time_table: class schedule
      • course_table_upload_detail: class schedule
      • class_studying_student_count: table of the total number of students in the class
  • summary
    • Remember the core tables and fields
    • Interview: What are the data sources?

04: Data Warehouse Design

  • Goal : Master the implementation process of each topic data warehouse in the business analysis topic

  • implement

    image-20210518162724906

    • Access Analysis Topics

      • ODS:web_chat_ems、web_chat_text_ems
      • DWD: Merge two tables and implement ETL
      • DWS: Count the number of users, sessions, and IPs of all access data based on different dimensions
    • Consulting Analysis Topics

      • ODS:web_chat_ems、web_chat_text_ems
      • DWD: DWD that directly reuses access analysis
      • DWS: Count the number of users, sessions, and IPs of all consultation [msg_count > 0] data based on different dimensions
    • intent analysis topic

      • ODS:customer_relationship、customer_clue
      • DIM:customer、employee、scrm_department、itcast_shcool、itcast_subject
      • DWD: Implement ETL for customer_relationship
      • DWM: Realize the association of all tables, put all dimensions and fact fields in one table
      • DWS: Aggregate based on different dimensions to get the number of people interested
    • Sign up for analysis topics

      • ODS:customer_relationship
      • DIM:customer、employee、scrm_department、itcast_clazz
      • DWD: Implement ETL for customer_relationship and filter registration data
      • DWM: Realize the association of four tables, and put all dimensions and fact fields in one table
      • DWS: Aggregate other combined dimensions based on the hour dimension to obtain indicators
      • APP: Accumulate the results based on hours to get the results of facts in the dimensions of day, month and year
    • Attendance management topic

      • ODS:tbh_student_signin_record、student_leave_apply
      • DIM:tbh_class_time_table、course_table_upload_detail、class_studying_student_count
      • DWD: no
      • DWM
        • Student attendance status table: based on the student punch card information table
        • Class attendance status table: based on student attendance status table
        • Class leave information form: obtained based on the leave information form
        • Class truancy information form: total number - attendance number - leave number
      • DWS: Based on the day-to-day construction of attendance indicators under the day + class dimension: 24
      • APP: Recalculate the monthly and annual attendance indicators based on the Sum accumulation based on the number of people
  • summary

    • Master the implementation process of each topic data warehouse in the business analysis topic
    • Interview: How is layering designed?
      • ODS: raw data layer: store raw data
      • DWD: detailed data layer: detailed data after ETL
      • DWM: light summary layer: construct the transaction facts of the topic, associate all fact tables to obtain the topic facts, and construct some basic indicators
      • DWS: Summary Data Layer: Build wide tables of facts and dimensions for the entire subject domain
      • APP: Split sub-tables with different dimensions for each topic
      • DIM: dimension data layer: all dimension tables

05: Technical Architecture

  • Goal : Master the technical architecture of the entire project

  • implement

    image-20210518162810459

    • Data source: MySQL database
    • Data collection: Sqoop
    • Data Storage: Hive: Offline Data Warehouse
    • Data processing: HiveSQL [MapReduce] = "In the future, the resume will be changed to SparkSQL and other tools to achieve
    • Data result: MySQL
    • Data report: FineBI
    • Coordination service: Zookeeper
    • Visual interaction: Hue
    • Task flow scheduling: Oozie
    • Cluster Management Monitoring: Cloudera Manager
    • Project version management: Git
  • summary

    • Master the technical architecture of the entire project
    • Interview: Project introduction or technical architecture of the project?

06: Project optimization

  • Goal : Master the common optimizations of Hive

  • implement

    • property optimization

      • local mode

        hive.exec.mode.local.auto=true;
        
      • JVM reuse

        mapreduce.job.jvm.numtasks=10
        
      • speculative execution

        mapreduce.map.speculative=true
        mapreduce.reduce.speculative=true
        hive.mapred.reduce.tasks.speculative.execution=true
        
      • FetchCrawl

        hive.fetch.task.conversion=more
        
      • parallel execution

        hive.exec.parallel=true
        hive.exec.parallel.thread.number=16
        
      • compression

        hive.exec.compress.intermediate=true
        hive.exec.orc.compression.strategy=COMPRESSION
        mapreduce.map.output.compress=true
        mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
        
      • vectorized query

        hive.vectorized.execution.enabled = true;
        hive.vectorized.execution.reduce.enabled = true;
        
      • zero copy

        hive.exec.orc.zerocopy=true;
        
      • Correlation optimization

        hive.optimize.correlation=true;
        
      • CBO optimizer

        hive.cbo.enable=true;
        hive.compute.query.using.stats=true;
        hive.stats.fetch.column.stats=true;
        hive.stats.fetch.partition.stats=true;
        
      • small file processing

        #设置Hive中底层MapReduce读取数据的输入类:将所有文件合并为一个大文件作为输入
        hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
        #如果hive的程序,只有maptask,将MapTask产生的所有小文件进行合并
        hive.merge.mapfiles=true;
        hive.merge.mapredfiles=true;
        hive.merge.size.per.task=256000000;
        hive.merge.smallfiles.avgsize=16000000;
        
      • index optimization

        hive.optimize.index.filter=true
        
      • Predicate Pushdown PPD

        hive.optimize.ppd=true;
        

        [External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-4UGgQtXE-1690355726417) (J:/baidudownload/09-the ninth stage spark project-one-stop manufacturing/Day5 _Data Warehouse Fact Layer DWB Layer Construction/02_Lecture Notes/Day5_Data Warehouse Fact Layer DWB Layer Construction.assets/image-20210518184328346.png)]

        • Inner Join and Full outer Join, the condition is written after on or after where, there is no difference in performance
        • In Left outer Join, the table on the right is written after on, and the table on the left is written after where, which improves performance
        • In Right outer Join, the table on the left is written after on, and the table on the right is written after where, which improves the performance
        • If a function with an uncertain result appears in the SQL statement, it cannot be pushed down
      • Map Join

        hive.auto.convert.join=true
        hive.auto.convert.join.noconditionaltask.size=512000000
        
      • Bucket Join

        hive.optimize.bucketmapjoin = true;
        hive.auto.convert.sortmerge.join=true;
        hive.optimize.bucketmapjoin.sortedmerge = true;
        hive.auto.convert.sortmerge.join.noconditionaltask=true;
        
      • Task memory

        mapreduce.map.java.opts=-Xmx6000m;
        mapreduce.map.memory.mb=6096;
        mapreduce.reduce.java.opts=-Xmx6000m;
        mapreduce.reduce.memory.mb=6096;
        
      • buffer size

        mapreduce.task.io.sort.mb=100
        
      • Spill threshold

        mapreduce.map.sort.spill.percent=0.8
        
      • Merge thread

        mapreduce.task.io.sort.factor=10
        
      • Reduce pull parallelism

        mapreduce.reduce.shuffle.parallelcopies=8
        mapreduce.reduce.shuffle.read.timeout=180000
        
    • SQL optimization

      • Core idea: filter first and then process

        • where and having use
        • The use of on and where in join
        • Filter the large table into a small table and then join
    • Design Optimization

      • Partitioned tables: reduced MapReduce input, avoiding unnecessary filtering

      • Bucket table: reduce the number of comparisons, realize data classification, split big data, and build Map Join

      • File storage: columnar storage is preferred: parquet, orc

  • summary

    • Proficiency in optimization in Hive
    • Interview: What optimizations have been made in the project? What optimizations has Hive made?

07: Project Questions

  • Goal : Master the common optimizations of Hive

  • implement

    • Memory problem : Symptom program fails to run

      • OOM:out of memory

      image-20210518190318728

      • Insufficient heap memory: allocate more memory to the Task process

        mapreduce.map.java.opts=-Xmx6000m;
        mapreduce.map.memory.mb=6096;
        mapreduce.reduce.java.opts=-Xmx6000m;
        mapreduce.reduce.memory.mb=6096;
        
      • Insufficient physical memory

        • Allow NodeManager to use more memory
        • Hardware resources can be expanded: expand physical memory
        • Adjust the code: based on partition processing, avoid Map Join
      • Insufficient virtual memory: adjust the ratio of virtual memory, the default is 2.1

    • Data skew problem : the program runs for a long time and is stuck at 99% or 100%

      image-20210518190703801

      img

  • Phenomenon

    • Run a program, a certain task of this program has been running, and other tasks have finished running, and the progress is stuck at 99% or 100%
  • basic reason

    • Basic reason: the load of this ReduceTask is higher than that of other Tasks

      • The data distribution of ReduceTask is unbalanced

    image-20210518203414345

  • Root Cause : Rules for Partitioning

    • Default partition: get the number of remainder reduce according to the Hash value of K2

      • Advantage: the same K2 will be processed by the same reduce
      • Disadvantage: May lead to data skew
  • Scenarios with skewed data

    • group by / count(distinct)
    • join
  • solution

    • group by / count(distinct)

      • Open Combiner

        hive.map.aggr=true
        
      • random partition

        • Method 1: Enable parameters

          hive.groupby.skewindata=true
          
          • After enabling this parameter, the bottom layer will automatically run two MapReduce

          • The first MapReduce automatically implements random partitioning

          • The second MapReduce does the final aggregation

        • Method 2: Manually specify

          distribute by rand():将数据写入随机的分区中
          
          distribute by 1 :将数据都写入一个分区
          
    • join

      • Solution 1: Avoid using Reduce Join as much as possible

        • Map Join: Try to filter the data that does not need to participate in the Join, and convert the large table into a small table
        • Construct Bucket Map Join
      • Solution 2: skewjoin: Reduce Join process to avoid data skew

            --开启运行过程中skewjoin
            set hive.optimize.skewjoin=true;
            --如果这个key的出现的次数超过这个范围
            set hive.skewjoin.key=100000;
            --在编译时判断是否会产生数据倾斜
            set hive.optimize.skewjoin.compiletime=true;
            --不合并,提升性能
            set hive.optimize.union.remove=true;
            --如果Hive的底层走的是MapReduce,必须开启这个属性,才能实现不合并
            set mapreduce.input.fileinputformat.input.dir.recursive=true;
        

        image-20210518203545803

  • summary

    • Master the common memory overflow and data skew problems in Hive
    • Interview: How to solve data skew?
      • Increase the number of partitions: repartition
      • When joining, small data can be broadcasted
      • Custom partition rules: five characteristics of RDD: for RDD of binary type, you can specify a partitioner
        • reduceByKey(partitionClass = HashPartition)
  • Technical interview: theory-based

    • Hadoop: HDFS reading and writing principle, program running process, port number, which processes, MapReduce running process in YARN

    • Hive: SQL statement, function application

      • String functions, date functions, judgment functions, window functions

Guess you like

Origin blog.csdn.net/xianyu120/article/details/131940017