Big Data Training Camp Course Outline & Project Introduction

Course Outline

Module 1: "Troika" of big data: HDFS, MapReduce/YARN, HBase

Teaching objectives:
Hadoop is the cornerstone of the big data platform system. This module will take you through the study of the Hadoop ecological "troika" to:

  1. Master the distributed system framework from the perspective of storage and computing;
  2. Understand how to build, manage, use and monitor clusters;
  3. Understand how to effectively solve big data problems;
  4. Learn how to design and implement a distributed system by learning the excellent designs and source codes of HDFS, MapReduce, and YARN.

Pain points in study and work:

  1. I don’t have a deep understanding of the Hadoop technology ecosystem, I don’t know which scenarios will cause problems, and I don’t know how to avoid pitfalls and how to optimize the system;
  2. The level remains at the usage level, and there is no ability to design and implement a stable distributed system;
  3. The Hadoop ecosystem is complex, involving many components and a large amount of code, making it difficult to start learning;
  4. The efficiency of self-study is low. It is easy to forget the knowledge you have read. You do not have a deep understanding, and you cannot get to the point directly and apply what you have learned.

Core competencies acquired through learning:

  1. Comprehensive understanding of big data data processing frameworks and models;
  2. Systematically study the architectures of HDFS, MapReduce, and YARN, and understand technical principles such as block storage, read-write separation, scheduler, finite state automaton, and WAL;
  3. Learn the architecture of HDFS and YARN, HA model, Federation architecture, etc.;
  4. Master the ideas and techniques for trouble-shooting Hadoop, understand how to choose a suitable architecture, and how to monitor and manage the platform;
  5. Master the design principles of distributed systems and design and implement your own distributed system by learning the excellent source code of Hadoop;
  6. After completing the course, you will be able to take on the roles of Hadoop big data platform engineer and operation and maintenance engineer.

Detailed content: Introductory course (2 lessons) + 13 lessons

  1. Overview of Hadoop development history and ecosystem;
  2. Overview of the distributed file system HDFS, including its functions, roles, advantages, application status and development trends;
  3. Detailed explanation of the core key technologies, design essence and basic working principles of HDFS, including system architecture, file storage mode, storage expansion and throughput performance expansion, etc.;
  4. An overview of the data parallel technology MapReduce, and a detailed explanation of its working mechanism, underlying principles, performance tuning techniques, etc.;
  5. Analysis of parallel computing processing ideas and functional programming technology principles in big data platforms;
  6. System architecture, core functional modules, and MapReduce programming application development practices of the MapReduce parallel processing platform;
  7. Learn the architecture and various scheduling algorithms of the resource scheduler YARN;
  8. Explain YARN’s disaster recovery mechanism, multi-tenant model, etc.;
  9. Case: Taking a company's big data platform as an example, we share the actual configuration scheme of a PB-level capacity cluster and recommend the actual deployment topology of the cluster computer room.

Module 2: Data warehouse practice in the big data era: Hive

Teaching objectives:
Hive has become the standard for data warehouses under the big data system, and has also become a required solution for the construction of data warehouses for major Internet companies.
This module will take you:

  1. Relearn the basics and principles behind Hive;
  2. In-depth analysis of how to use Hive;
  3. Master HQL syntax and commonly used warehouse pattern design;
  4. Master Hive optimization methods;
  5. Understand Hive’s advanced features and future development trends;
  6. Consolidate learning content through case practice.

Pain points in study and work:

  1. The business side can only write SQL, but does not know the underlying implementation details of SQL and cannot write efficient SQL;
  2. The platform cannot understand SQL and cannot help the business optimize SQL, resulting in a waste of platform resources;
  3. The SQL error report is incomprehensible, and running slowly will only blame the lack of resources, and the root cause cannot be found.

Core competencies acquired through learning:

  1. Master the basic principles of Hive;
  2. Master the basic use of Hive;
  3. Master the basic syntax and common optimization measures of HiveQL;
  4. Understanding the Hive data warehouse design method will enable you to perform big data analysis and data development tasks in most Internet scenarios.

Details: 10 lessons

  1. Hive version evolution and current status, Hive installation and deployment, HiveServer and JDBC/ODBC, Hive’s basic architecture;
  2. The basic data types supported by Hive, the file formats supported by Hive and their advantages and disadvantages, and the design of Hive’s common patterns;
  3. HiveQL data definition, data operation and data query (Select/Where/Group By/Join/OrderBy/SortBy/Cl By/Join/OrderBy/SortBy/ClusterBy/DistributeBy);
  4. Hive tuning, Explain to view the execution plan, control the number of Map/Reduce;
  5. Hive speculative execution mechanism, Join optimization strategy, general solution to data skew problem, dynamic partition optimization;
  6. Case: integrate the content learned in this module through the practice of advertising user behavior analysis.

Module 3: Faster data processing engine: Spark

Teaching objectives:
Spark, as a new generation of big data processing engine, is the first choice of many Internet companies for offline data processing. It is also widely used in real-time computing, machine learning and other fields. This module will take you:

  1. Understand and master the basic concepts and underlying principles of Spark;
  2. Master Spark practical skills and be able to perform data processing analysis, data development and other tasks;
  3. Master knowledge on performance tuning, big data migration, etc. based on enterprise-level cases;
  4. Stimulate interest in using big data to solve practical problems.

Pain points in study and work:

  1. Don’t know the Scala language and can’t use the Spark API;
  2. Without understanding of the Spark programming model, it is impossible to convert MR jobs into Spark jobs;
  3. Not clear about the underlying working principles of each component of Spark, unable to use Spark flexibly for data analysis, or unclear about the reason why Spark tasks run slowly;
  4. Spark tasks often experience OOM, low resource usage, and run slowly in large jobs. However, it is impossible to quickly locate the cause and optimize Spark tasks.

Core competencies acquired through learning:

  1. Understand the design principles and working principles of Spark and the use and deployment of various basic components;
  2. Master how to query Spark's API and how to use the documentation to develop your own Spark application;
  3. Master the ability to identify performance bottlenecks by viewing Spark job runtime status and parameters, and conduct targeted analysis and optimization;
  4. Have preliminary ideas and skills to solve complex enterprise-level big data problems;
  5. After completing the course, you will be able to take on the roles of data development engineer and data application development engineer.

Details: 5 lessons

  1. What is Spark, Spark’s application scenarios and ideas;
  2. Spark basic concepts such as programming model, RDD, data processing flow, data storage format, resource allocation algorithm, etc.;
  3. Why Spark is 100 times faster than Hive;
  4. How to build a Spark cluster environment and how to monitor it;
  5. Lifecycle management and performance optimization of Spark jobs;
  6. Detailed explanation of Spark API and practical practice of writing Spark programs;
  7. Spark scheduler, Spark Shuffle optimization;
  8. Learn about Spark machine learning and Spark streaming computing.

Module 4: Reconstructing a modern data warehouse: Spark SQL

Teaching objectives:
Spark SQL is the most important module of Spark. More than 80% of Spark usage scenarios are SQL. At the same time, with its compatibility with HiveQL, it is gradually replacing Hive's position in the data warehouse. This module will take you:

  1. Relearn the basic concepts and principles behind SQL;
  2. Master practical skills such as Spark SQL syntax and common warehouse pattern design;
  3. Master Spark SQL optimization ideas and methods;
  4. Master the advanced features of Spark SQL and the new features of Spark 3.0.

Pain points in study and work:

  1. Data analysts who can write SQL do not understand the principles of Spark SQL, and SQL runs slowly or cannot be run;
  2. "Bad SQL" causes waste of platform resources and even application OOM;
  3. Faced with complex SQL execution plans, I have no idea where to start, I don’t know where the bottleneck is, and I don’t know how to solve it;
  4. I have a certain foundation in Spark development, but I can’t understand the source code of Spark SQL and don’t know how to fix bugs.

Core competencies acquired through learning:

  1. Master the basic concepts and underlying principles of SQL;
  2. Master Spark SQL practical tuning skills and the principles behind them;
  3. Can quickly locate the cause of slow SQL execution and write faster SQL;
  4. Have a certain understanding of Spark SQL's logical plan optimization, physical plan and code generation, and be able to add simple rules to transform the SQL optimizer;
  5. Have the ability to build and optimize data warehouse engines;
  6. It can carry out secondary development and transformation of the kernel to build a more stable and higher throughput platform;
  7. After learning, you will be able to perform SQL optimization and advanced data development tasks.

Details: 10 lessons

  1. Basic concepts of SQL, table connection methods;
  2. Spark SQL logical plan optimization and physical plan optimization;
  3. Optimization of Spark data skew;
  4. Spark 3.0 new features;
  5. Spark SQL best practices;
  6. Practice: How to smoothly migrate petabyte-level commercial data warehouses to Spark;
  7. Spark TPC benchmark;
  8. Spark Web UI Debug;
  9. Case: Spark job management enterprise case;
  10. Case: Spark data warehouse migration enterprise case.

Module 5: OLAP battle: Presto, Kylin, ClickHouse

Teaching objectives:
OLAP technology is a very important part of the field of big data analysis. There are many excellent computing engines. How to choose one has become a big problem. This module will take you:

  1. Based on Spark SQL, expand the OLAP knowledge system;
  2. Learn about three commonly used OLAP engines: Presto, Kylin, and ClickHouse;
  3. Compare and study the application scenarios, principles behind and selection points of these three OLAP engines.

Pain points in study and work:

  1. The platform provides multiple OLAP technologies, but it is not clear which one to use;
  2. If you don’t understand the differences between engines, it is easy to create the misunderstanding that some queries are fast and all queries are fast;
  3. There are too many technology stacks, and you forget them once you learn them. There is no horizontal comparison and systematic summary.

Core competencies acquired through learning:

  1. Master the basics of OLAP technology;
  2. Master the basic working principles of Presto, Kylin, and ClickHouse engines;
  3. Understand the usage scenarios and comparison of advantages and disadvantages of different engines;
  4. Be able to grasp the commonalities and characteristics among various OLAP engines, and better select and use them in practice;
  5. Understand the future development direction and core issues of OLAP technology.

Details: 7 lessons

  1. What is OLAP and common operations;
  2. Analysis of the architecture of Presto, Kylin, and ClickHouse;
  3. Introduction to Presto, Kylin, and ClickHouse query optimizers and analysis of execution processes;
  4. Compare the characteristics and usage scenarios of these three OLAP engines horizontally, and grasp their commonalities and characteristics;
  5. Master the key points in selecting OLAP engine technology solutions.

Module 6: Streaming processing and real-time computing: Kafka, Flink
Teaching objectives:
When scale computing is no longer the core issue at present, real-time requirements have become a development focus in the field of big data. This module will take you:

  1. Master the application practices and principles behind Kafka, a typical representative of stream processing technology;
  2. Master the application practices and principles behind the popular real-time computing engine Flink;
  3. Get away from the details and master the working principles and system essence of real-time computing systems.

Pain points in study and work:

  1. How to develop real-time applications to meet business indicators;
  2. What to do if you encounter "jitter", how to do disaster recovery and application downgrade;
  3. How does the platform ensure high performance and reliability;
  4. For high real-time scenarios, how to make good use of the real-time computing engine.

Core competencies acquired through learning:

  1. Master the basic principles of Kafka and Flink;
  2. Master the development process of real-time applications;
  3. Monitor and alert real-time applications and the system itself;
  4. By learning the enterprise application practices of real-time computing, you can design your own real-time applications and form best practices.

Details: 5 lessons

  1. What is Kafka, Kafka’s application scenarios and ideas;
  2. The design principles behind Kafka’s high performance and reliability;
  3. Tips on using Kafka API, how to use Kafka API for program development;
  4. Monitoring, operation and maintenance and performance optimization of Kafka;
  5. The basic principles and architecture of Flink;
  6. The idea of ​​using Flink API and how to use Flink API for program development;
  7. Practice Flink in multiple application scenarios and make good use of Flink in practice.

Module 7: Data development system: ETL, Data Visualization

Teaching objectives:
The data development system is an indispensable and important component in the field of big data. It involves many knowledge points, the most basic and important of which are ETL and data visualization. This module will take you:

  1. Master ETL application practice and selection ideas;
  2. Master the design principles and operating principles of ETL;
  3. Master the application skills and platform construction practices of data visualization;
  4. Integrate with previous modules and master the ability to develop closed-loop big data applications.

Pain points in study and work:

  1. How to better select and customize the development of ETL framework;
  2. How to optimize the scheduling system and task dependencies;
  3. How to design metadata management;
  4. How to combine data visualization with applications;
  5. How to simplify the development costs of data visualization;
  6. How to reduce data development costs through data visualization.

Core competencies acquired through learning:

  1. Master the operating principles of ETL's scheduling system and task dependency design principles;
  2. Select multiple open source scheduling systems based on your own business needs;
  3. Understand the design of task templates and metadata management technology in ETL;
  4. Master how to display data visually and build a data visualization platform based on cases;
  5. After learning, you will be able to complete the closed-loop construction of the entire big data platform.

Details: 3 lessons

  1. Selection of scheduling systems in ETL, introduction to scheduling systems such as Oozie, Azkaban, and Airflow;
  2. Task scheduling system design, scheduled task design and processing solutions in ETL;
  3. How does the scheduling system automatically resolve ETL task dependencies;
  4. ETL task design, data extraction, and implementation of loading tools;
  5. Implementation of ETL task templates and metadata design;
  6. Introduction to data visualization tools;
  7. HUE construction and use;
  8. Case: Introduction to the Superset architecture developed by Airbnb and its use cases.

Module 8: Data Lake, the next revolution in big data: DeltaLake, Hudi, Iceberg
Teaching objectives:
Data lake technology is one of the hottest technologies in the field of big data in the past two years, and it is likely to change the current enterprise's overall view of data planning, and also optimized the current data warehouse system. This module will take you:

  1. Understand the knowledge system of data lakes, and compare and master the differences between data warehouses and data lakes;
  2. Master the basic principles and application practices of three current excellent data lake software: DeltaLake, Hudi, and Iceberg;
  3. Master the key points of technology selection and better use and optimize it in data lake development practice.

Pain points in study and work:

  1. Staying at the concept and not understanding the essence of data lake technology;
  2. Performance degradation caused by naked use of data lake software;
  3. Rejection, abuse and misuse of new technologies;
  4. It is impossible to effectively combine the previously learned knowledge, systems and data lake technology.

Core competencies acquired through learning:

  1. Understand the nature of a data lake and the difference between a data warehouse;
  2. Master the basic principles of DeltaLake, Hudi, and Iceberg;
  3. Ability to appropriately select different data lake software;
  4. Able to carry out secondary development of data lake software;
  5. Can combine Spark, Flink and data lake software to build data lake applications.

Detailed content: 5 to 10 class hours

  1. Introduce the nature and differences between data lakes and data warehouses;
  2. Conduct architectural analysis and comparison on DeltaLake, Hudi, and Iceberg respectively;
  3. Practice: Apply Spark + DeltaLake to build data lake applications;
  4. Practice: Use Flink + Hudi/Iceberg to build data lake applications;
  5. DeltaLake’s performance optimization and secondary development practices.

Module 9: Hadoop and Spark core source code explanation

Teaching objectives:
Hadoop and Spark are the cornerstones of the entire big data system, and source code learning will allow us to have a deeper understanding of their underlying design principles, and also improve our development capabilities and system architecture capabilities. This module will:

  1. Explain the core source code of Hadoop and Spark and teach you how to perform secondary development;
  2. Debug Hadoop and Spark to help your development capabilities advance and make breakthroughs.

Pain points in study and work:

  1. Eager to learn new projects but don’t know where to start when faced with a lot of code;
  2. Can't figure out the logic of the code and always get stuck in details;
  3. I don’t know how to debug open source code, and I don’t know how to contribute to the community;
  4. I don’t know how to use Git, and the secondary development of source code is not standardized.

Core competencies acquired through learning:

  1. Understand the structure of Hadoop and Spark core source code and be able to quickly locate source code problems;
  2. Focus on understanding and mastering the coding principles of Spark SQL;
  3. Learn how to use Git for secondary development of code;
  4. Learn how to contribute patches to the community and participate in community work.

Details: 3 lessons

  1. Introduce the use and techniques of Git;
  2. Obtain the source code and prepare the code environment to solve some dependency issues;
  3. Introducing the code structure and system of Hadoop, and reading key modules;
  4. Introduce the code structure and system of Spark, and walk through the SQL module;
  5. How to submit patches to the Hadoop and Spark open source communities, how to participate in community work, etc.

Module 10: Interview Clearance: How to Become an Excellent Big Data Development Engineer

Teaching objectives:
The field of big data is changing with each passing day, and new technology stacks are emerging one after another. Therefore, the requirements for big data developers are different from traditional application development or software development. It requires developers to have more excellent learning abilities and research spirit. This module will bring you:

  1. Understand the necessary hard skills, soft skills and growth paths of big data engineers;
  2. Master practical learning methods about big data, as well as experience in growing and avoiding pitfalls;
  3. Develop real big data thinking and write more efficient and usable code.

Pain points in study and work:

  1. Believing in the "use doctrine" and being accustomed to "copying and pasting", it has dug a lot of holes;
  2. Lack of big data thinking, always misestimates the scale of the problem, the code seems to be fine, but crashes as soon as it is put into the production environment;
  3. Often there is no direction for learning, no idea what needs to be done next, and lack of research spirit;
  4. It is easy to fall into the complicated technology stack of big data, and do not know how to locate the problem and determine the solution;
  5. I lack practical experience in big data and don’t know how to stand out in big data-oriented interviews.

Core competencies acquired through learning:

  1. Real big data development thinking; ideas for locating problems and solving them;
  2. Quickly research and master new technology methodologies;
  3. How to get nutrients from old technology;
  4. How to read literature;
  5. How to solve production problems by combining theory with practice.

Details: 2 lessons

  1. Overview of the history, development and future directions of the field of big data;
  2. Extract and summarize future trends and career directions in the field of big data;
  3. The necessary soft skills and experience in avoiding growth pitfalls for big data engineers;
  4. Sharing experience in developing big data technical thinking;
  5. How to go from being able to solve big data problems to being able to ask questions;
  6. How to search and read literature, how to go from paper to implementation;
  7. Skill trees and techniques required for interviewing big data development.
    Insert image description here
    Insert image description here

Practical projects

Project 1: Hadoop cluster cloud host construction and health management

Practice goals:
Basic skills of big data platform engineers. Through the cluster construction of cloud hosts, you can deepen your knowledge and understanding of open source big data components and understand the operating principles of each module. While mastering the basics, you can also understand some advanced knowledge practices such as HA and Federation. , RBF et al.
Core points:

  1. Complete the construction of Hadoop cluster and related components based on Alibaba Cloud server and Docker container respectively;
  2. Through the above construction practice, you can understand the relationship and role between the entire Hadoop platform modules, and understand the health of the cluster by learning various indicators of the cluster;
  3. Through this cluster, learn the processes of the data platform and the automation work of the enterprise big data platform;
  4. Technologies involved: HDFS, MapReduce, YARN, Docker, Hive, Spark, Prometheus.

Insert image description here

Project 2: Construction of data visualization and interactive self-service analysis platform

Practice goals:
Basic skills for big data platform engineers, learn how to build interactive self-service analysis platforms and data visualization services, and how to provide high-availability and high-performance query services.
Core points:

  • Complete the construction of the interactive analysis platform;
  • Build an offline analysis platform using OLAP engines such as Spark SQL, Presto, and Kylin;
  • Use Airflow for job flow management;
  • Use ThriftServer, JDBC, etc. for Session management;
  • Use HUE, Tableau, etc. to build data visualization applications;
  • Technologies involved: Presto, ClickHouse, Kylin, Spark SQL, Airflow, Superset, HUE, ThriftServer, HiveServer2.

Insert image description here

Project 3: Using Spark to analyze large-scale e-commerce user data

Practical goal:
Data developers and data analysts use the Spark computing engine to master big data analysis methods in practice. Through JDBC, learn the use and optimization of SQL, discover and improve performance problems, etc.
Core points:

  • Use Kafka, Spark Streaming, Flink, Spark SQL, Hive and other computing engines to analyze real large-scale e-commerce user data;
  • Simulate the usage scenarios of real enterprises in the field of big data, and complete corresponding data analysis tasks by writing programs or SQL.
  • Technologies involved: Spark SQL, Hive, JDBC.
    Insert image description here

Project 4: Hadoop and Spark source code learning

Practice goals:
Explain the core source code, learn how to perform secondary development, debug Hadoop and Spark, and help students advance and make breakthroughs in development capabilities.
Core points:

  • Learn the core source code of Hadoop and Spark, focusing on the structure of Spark SQL source code;
  • By learning the Spark SQL source code structure, the database architecture and design patterns are derived, and based on this, you can learn more complex systems, especially the underlying design language of the data system;
  • Technologies involved: Hadoop, Spark Internals.
    Insert image description here

Guess you like

Origin blog.csdn.net/qq_41638851/article/details/124233111