"Data-intensive applications," the study notes

It took about half the time, the "design data-intensive applications," read it again, is indeed a good book watercress 10.0.

The author of a book which explain almost all components of data storage, data processing MySQL, Redis, Solr, Mongodb, Elasticsearch, Kafka, Hive, Hbase, Spark, Flink, Mapreduce, Neo4j, Titan, InfiniteGraph and so on. In a book inside, covering virtually all types of databases, queues, NoSQL, batch processing, streaming components, rather, "pointing country, passionate character, dirt Hou million that year" feeling.

This book explains the problem from the document data model to the relational data model to the data model and transition diagram faced; explain the core problem of distributed systems; introduced business scenarios batch to streaming able to cope. Of course, because the book covers the technical point too ambitious, not introduced one by one in detail the characteristics and functions of the various components of the point, is standing in a more macro level describes the characteristics of the individual components and can cover business scenarios.

I was impressed by the book has the following points:

1 Data Model

本书介绍了文档型数据库,关系型数据库,图数据库。 文档型数据库的schemaless, 典型的就是Mongodb和Elasticsearch; 关系型数据库不必说了,应用最广泛的MySQL; 图数据库用得比较少,Neo4j算是典型了。 通过从文档到图,从无关联到处处关联。让Mongodb, Elasticsearch, MySQL, Neo4j在我的知识体系中不再是孤立的点,而是有内在联系的知识链条。

2 shell version of the database

作者通过一个基于shell实现的数据库,讲解索引。 从Hash索引到树索引, 从B tree到 LSM tree。Hash索引无法解决区间查询的问题, 二叉树面对硬盘索引读取性能问题, B Tree的写入性能问题...  估计读完本书后,我一度陷入困惑,应该是希望更深入探索索引细节的想法和原定计划的冲突。

3 shell version of mapreduce

作者通过组合cat, awk, sort, uniq, head 几个简单的命令分析日志,让后讲解mapreduce。 这个切入点相当经典, 比mapreduce的word-count有意思多了。

4 consistency of distributed systems

作者讲解了分布式系统存在的问题,主要是一致性问题。 然后引出选举算法的核心: 共识。相当精辟。 可惜分布式算法的细节我一向敬而远之,觉得这个坑有点深,不急着入坑。

Too many bright spot, perhaps to understand some of the details of the system, look at the book, the harvest will be more different.

On the context is clear:

A system must face the following three questions: reliability, scalability, maintainability. Then, based on the three core points, explain how data-intensive applications to achieve these three objectives.

From the data model level, a unified abstraction of access methods, such as SQL.
Model-based design, implement various indexes, guarantee performance of the system.
Network transmission level, the design of a variety of data encoding, balance conflicting ease of use and performance.

When a single node can not meet the business needs of concurrency, by a copy of the extension system. Here consistency problem already emerged.
When a single storage node can not meet a service requirement by fragment extension system, to solve the bottleneck of storage, either capacity or performance bottlenecks.
By atomic transaction mechanism to ensure that, to solve the problem of inconsistent data.
It lists all kinds of trouble and algorithms to solve the problem of distributed systems.

Finally, the derived data, and streaming explain batch process.

Author equivalent to the data processing component painted a panorama. So much data component development system architecture selection will also be doing a great test. Fortunately able to understand the similarities and differences of the various components, can learn by analogy, analogy is good.

Guess you like

Origin blog.51cto.com/sbp810050504/2406541