preamble
During the period from 2018 to 2021, the author read 200+ books and columns related to big data.
This column is the author's painstaking work based on years of reading notes, combined with his own experience in big data development.
Come and pay attention, thank you very much!
Companion to this column
1000 questions to solve the big data technology system
100 questions to solve the Java virtual machine
Technology sharing PPT
Here to share the technology sharing PPT about how to get started with big data that the author has used:
Directory Structure
- What is big data?
- What are the characteristics of big data?
- What is the relationship between big data and cloud computing?
- What is the relationship between big data and artificial intelligence?
- How did big data develop?
- What is the basic process of big data processing?
- How to measure data quality?
- What is ETL?
- What does big data development mainly do?
- What are the types of big data technical frameworks?
- Why do you say that the data does not move and the code moves? Is mobile computing more cost-effective than mobile data?
- What are the benefits of DAG for big data processing?
- How to distinguish between batch processing and stream processing? How to distinguish between bounded data and unbounded data?
- How to improve CPU utilization in batch processing?
- What are event time and processing time?
- What does the Workflow design pattern refer to?
- What is a distributed lock? How to achieve?
- What are distributed transactions? How to achieve?
- What is the difference between distributed locks and distributed transactions?
- What is the CAP theorem?
- What is BASE theory?
- What are the metrics for distributed systems?
- What are the consistency models?
- What are SLAs?
- How to estimate system QPS?
- What do you think of the publish-subscribe model?
- What is the difference between publish and subscribe pattern and observer pattern?
- What are the methods of data sharding in distributed systems?
- What is consistent hashing?
- Why serialize data?
- How to choose a data compression algorithm?
- How to choose a serialization framework in a distributed system?
- What is Protobuf?
- What is Apache Thrift?
- What is Apache Avro?
- What is Kryo?
- What is the difference between columnar storage and row storage?
- How to choose a columnar storage format?
- What is ORCFile?
- What is Parquet
- What is a data warehouse?
- Difference between data warehouse and database?
- What is the difference between OLTP and OLAP?
- How is the data warehouse layered?
- How is a data warehouse modeled?
- What are fact tables and dimension tables?
- What is Business Intelligence (BI)?
- From the perspective of system architecture, how should servers be classified?
- What is MPPDB?
- What is the difference between MPPDB and Hadoop?
- Which server architecture should a data warehouse choose?
- What are the parallel computing models?
- What is the difference between BSP and MapReduce?
- What are the implementation methods of OLAP?
- What is Cube technology?
- What is NoSQL?
- What is load balancing?
- What are the load balancing algorithms?
- How to implement forwarding in a distributed system?
- What is the role of the big data resource scheduling framework?
- What are the technical difficulties in resource scheduling?
- What is multi-tenancy technology?
- What do you think are the flaws in the traditional Yarn and Mesos scheduling schemes?
- What is an inverted index?
- What is enterprise data?
- What is a data lake? Why do you need a data lake?
- What is the lifecycle of data in a data lake?
- What is the difference between data warehouse, data mart and data lake?
- What is Lambda architecture?
- What is the Kappa architecture?
- How to apply the Lambda architecture to the data lake? What are the functional modules in the data lake?
- What challenges do enterprise data lakes face?
- What exactly is RAID technology?
- Why do you need a workflow scheduling system?
- Why have a message queue/message engine system?
- What is a cloud native database?
- What is the future development trend of the database field?
references
- Geek Time column "Learning Big Data from 0" Li Zhihui
- Geek Time Column "Large-Scale Data Processing Actual Combat" Cai Yuannan
- "Big Data Technology and Application in Cloud Computing" by Liang Fan
- "Big Data Development and Application" edited by Qingdao Yinggu Education Technology Co., Ltd., Shandong Business and Technology College
- "Detailed Explanation of Big Data Technology System: Principles, Architecture and Practice" by Dong Xicheng
- "Hadoop big data mining from entry to advanced practice: video teaching version" edited by Deng Jie
- "Detailed Explanation of Big Data Architecture: From Data Acquisition to Deep Learning" edited by Zhu Jie and Luo Hualin
- "Kafka Definitive Guide" (US) Neha Narkhede (Neha Narkhede), (US) Gwen Shapira (Gwen Shapira) (US) Todd Palinuo (Todd Paino); translated by Xue Mingdeng / (US) Neha Narkhede, (US) Gwen Shapira (US), (US) Todd Palino (Todd Paino); translated by Xue Mingdeng
- "Hadoop Big Data Technology Principles and Applications" written by dark horse programmers
- "Enterprise Data Lake" (India) Tomcy John (India), (India) Pankaj Misra (India) Pankaj Misra (India) written by Zhang Shiwu, Li Xiang, translated by Zhang Haolin
- "Big Data Technology and Application Research" by Hu Pei, Han Pu
- "Hadoop & Spark Big Data Development Practice" edited by Xiao Rui and Lei Gangyue
- CS-Notes
- ClickHouse official website
- ClickHouse in-depth reveal
- What are distributed transactions and what are the solutions?
- Distributed theory (2) - Base theory
- Distributed System Metrics
- Baidu Encyclopedia Sequential Consistency Model
- Easy to understand the difference and connection between strong consistency, weak consistency, final consistency, read-write consistency, monotonic reading, and causal consistency
- Distributed system learning - data sharding
- Learning data sharding in distributed systems with questions
- Baidu Encyclopedia Consistent Hash
- Detailed Explanation of Apache Thrift Series (1) - Overview and Getting Started
- A preliminary study on the use of Protostuff
- High-performance serialization and deserialization: simple use of kryo
- Small perspective of big data 2: ORCFile and Parquet, the business behind the open source circle
- A new generation of columnar storage format Parquet
- Those things about Parquet (1) Basic principles
- Let’s talk about the Parquet columnar storage format
- Introduction to MPP (Massively Parallel Processing)
- MPP architecture
- Baidu Encyclopedia NoSQL
- Compression of several common compression formats in big data
- zstd, the future data compression algorithm
- Is zstd splitabble in hadoop/spark/etc?
- Aliyun Li Feifei: What is a cloud-native database