Big Data Theoretical System

preamble

During the period from 2018 to 2021, the author read 200+ books and columns related to big data.

This column is the author's painstaking work based on years of reading notes, combined with his own experience in big data development.

Come and pay attention, thank you very much!

Companion to this column

1000 questions to solve the big data technology system

100 questions to solve the Java virtual machine

Technology sharing PPT

Here to share the technology sharing PPT about how to get started with big data that the author has used:

Big data from 0 to 1 .pptx

Directory Structure

  1. What is big data?
  2. What are the characteristics of big data?
  3. What is the relationship between big data and cloud computing?
  4. What is the relationship between big data and artificial intelligence?
  5. How did big data develop?
  6. What is the basic process of big data processing?
  7. How to measure data quality?
  8. What is ETL?
  9. What does big data development mainly do?
  10. What are the types of big data technical frameworks?
  11. Why do you say that the data does not move and the code moves? Is mobile computing more cost-effective than mobile data?
  12. What are the benefits of DAG for big data processing?
  13. How to distinguish between batch processing and stream processing? How to distinguish between bounded data and unbounded data?
  14. How to improve CPU utilization in batch processing?
  15. What are event time and processing time?
  16. What does the Workflow design pattern refer to?
  17. What is a distributed lock? How to achieve?
  18. What are distributed transactions? How to achieve?
  19. What is the difference between distributed locks and distributed transactions?
  20. What is the CAP theorem?
  21. What is BASE theory?
  22. What are the metrics for distributed systems?
  23. What are the consistency models?
  24. What are SLAs?
  25. How to estimate system QPS?
  26. What do you think of the publish-subscribe model?
  27. What is the difference between publish and subscribe pattern and observer pattern?
  28. What are the methods of data sharding in distributed systems?
  29. What is consistent hashing?
  30. Why serialize data?
  31. How to choose a data compression algorithm?
  32. How to choose a serialization framework in a distributed system?
  33. What is Protobuf?
  34. What is Apache Thrift?
  35. What is Apache Avro?
  36. What is Kryo?
  37. What is the difference between columnar storage and row storage?
  38. How to choose a columnar storage format?
  39. What is ORCFile?
  40. What is Parquet
  41. What is a data warehouse?
  42. Difference between data warehouse and database?
  43. What is the difference between OLTP and OLAP?
  44. How is the data warehouse layered?
  45. How is a data warehouse modeled?
  46. What are fact tables and dimension tables?
  47. What is Business Intelligence (BI)?
  48. From the perspective of system architecture, how should servers be classified?
  49. What is MPPDB?
  50. What is the difference between MPPDB and Hadoop?
  51. Which server architecture should a data warehouse choose?
  52. What are the parallel computing models?
  53. What is the difference between BSP and MapReduce?
  54. What are the implementation methods of OLAP?
  55. What is Cube technology?
  56. What is NoSQL?
  57. What is load balancing?
  58. What are the load balancing algorithms?
  59. How to implement forwarding in a distributed system?
  60. What is the role of the big data resource scheduling framework?
  61. What are the technical difficulties in resource scheduling?
  62. What is multi-tenancy technology?
  63. What do you think are the flaws in the traditional Yarn and Mesos scheduling schemes?
  64. What is an inverted index?
  65. What is enterprise data?
  66. What is a data lake? Why do you need a data lake?
  67. What is the lifecycle of data in a data lake?
  68. What is the difference between data warehouse, data mart and data lake?
  69. What is Lambda architecture?
  70. What is the Kappa architecture?
  71. How to apply the Lambda architecture to the data lake? What are the functional modules in the data lake?
  72. What challenges do enterprise data lakes face?
  73. What exactly is RAID technology?
  74. Why do you need a workflow scheduling system?
  75. Why have a message queue/message engine system?
  76. What is a cloud native database?
  77. What is the future development trend of the database field?

references

  1. Geek Time column "Learning Big Data from 0" Li Zhihui
  2. Geek Time Column "Large-Scale Data Processing Actual Combat" Cai Yuannan
  3. "Big Data Technology and Application in Cloud Computing" by Liang Fan
  4. "Big Data Development and Application" edited by Qingdao Yinggu Education Technology Co., Ltd., Shandong Business and Technology College
  5. "Detailed Explanation of Big Data Technology System: Principles, Architecture and Practice" by Dong Xicheng
  6. "Hadoop big data mining from entry to advanced practice: video teaching version" edited by Deng Jie
  7. "Detailed Explanation of Big Data Architecture: From Data Acquisition to Deep Learning" edited by Zhu Jie and Luo Hualin
  8. "Kafka Definitive Guide" (US) Neha Narkhede (Neha Narkhede), (US) Gwen Shapira (Gwen Shapira) (US) Todd Palinuo (Todd Paino); translated by Xue Mingdeng / (US) Neha Narkhede, (US) Gwen Shapira (US), (US) Todd Palino (Todd Paino); translated by Xue Mingdeng
  9. "Hadoop Big Data Technology Principles and Applications" written by dark horse programmers
  10. "Enterprise Data Lake" (India) Tomcy John (India), (India) Pankaj Misra (India) Pankaj Misra (India) written by Zhang Shiwu, Li Xiang, translated by Zhang Haolin
  11. "Big Data Technology and Application Research" by Hu Pei, Han Pu
  12. "Hadoop & Spark Big Data Development Practice" edited by Xiao Rui and Lei Gangyue
  13. CS-Notes
  14. ClickHouse official website
  15. ClickHouse in-depth reveal
  16. What are distributed transactions and what are the solutions?
  17. Distributed theory (2) - Base theory
  18. Distributed System Metrics
  19. Baidu Encyclopedia Sequential Consistency Model
  20. Easy to understand the difference and connection between strong consistency, weak consistency, final consistency, read-write consistency, monotonic reading, and causal consistency
  21. Distributed system learning - data sharding
  22. Learning data sharding in distributed systems with questions
  23. Baidu Encyclopedia Consistent Hash
  24. Detailed Explanation of Apache Thrift Series (1) - Overview and Getting Started
  25. A preliminary study on the use of Protostuff
  26. High-performance serialization and deserialization: simple use of kryo
  27. Small perspective of big data 2: ORCFile and Parquet, the business behind the open source circle
  28. A new generation of columnar storage format Parquet
  29. Those things about Parquet (1) Basic principles
  30. Let’s talk about the Parquet columnar storage format
  31. Introduction to MPP (Massively Parallel Processing)
  32. MPP architecture
  33. Baidu Encyclopedia NoSQL
  34. Compression of several common compression formats in big data
  35. zstd, the future data compression algorithm
  36. Is zstd splitabble in hadoop/spark/etc?
  37. Aliyun Li Feifei: What is a cloud-native database

Guess you like

Origin blog.csdn.net/Shockang/article/details/115609804