Apache Cassandra and Apache Ignite: Architecture Simplified Thinking

Apache Cassandra is becoming an increasingly popular database among software architects and engineers, and many are willing to hand their data to it. Currently, this well-known NoSQL database has a large number of deployments. There is no doubt that Cassandra lives up to its name and achieves its goal in a simple way: infinite scalability and high availability while ensuring fast writes .

These two basic functions have helped Cassandra rise rapidly, and it solves problems that traditional relational databases cannot. These problems require horizontal scaling, high availability, fault tolerance, and 24/7 trouble-free operation, which traditional relational databases cannot fully meet, at least for now (except for possible distributed relational databases, such as Google Spanner and CockroachDB).

However, scalability and high availability do not come without a price. Spoiled by simplified design principles and relational database features, we have to learn how to properly use Cassandra, how to properly model data, and how to do it without advanced SQL. feature to solve the problem.

In this article, I'll show the lesser-known side of Cassandra's data modeling concepts, an overview of Cassandra's architecture, and then give advice on how to simplify the architecture by learning about the richness of modern database features that Cassandra brings to us.

Get data modeling right

Of course, mastering Cassandra's data modeling ideas takes time (not a big deal), and there are plenty of resources out there. This idea is based on an unconventional strategy that requires us to guess in advance all queries executed on the database. Frankly, that's also possible, imagine having a set of queries, have Cassandra's tables optimized for those queries, and then go into production.

This approach is called a query-driven approach, which means that application development is driven by queries, and you cannot develop an application without knowing what the query looks like. This mode is a bit tricky, but it results in faster and cheaper writes in a Cassandra environment.

For example, suppose an application wants to track all vehicles produced by a manufacturer and then evaluate the production capacity of each manufacturer. In a relational world, the data model might be as follows:

1

It's not technically impossible to use the same schema in Cassandra, but from an architectural point of view it's not feasible because Cassandra can't relate the different tables - we definitely need to link the vehicle, manufacturer, and production Data is aggregated into a result set. To do this, you need to abandon the relational model and take advantage of unconventional strategies.

This strategy leads us to form a list of query operations that the application needs, and then design the model around them. Actually, it's pretty simple. The following example illustrates this strategy for those unfamiliar with Cassandra.

Suppose the application needs to support the following queries:

Q1: Get the models produced by a manufacturer within a specific time window (latest, first)

To execute this query efficiently in Cassandra, the following creates a table, vendor_namepartitions it, and uses the production_yearsum car_modelas the clustering column:

CREATE TABLE cars_by_vendor_year_model (
 vendor_name text,
 production_year int,
 car_model text,
 total int,
 PRIMARY KEY ((vendor_name), production_year, car_model)
) WITH CLUSTERING ORDER BY (production_year DESC, car_model ASC);

After that, you can execute a Cassandra query based on Q1 defined earlier:

select car_model, production_year, total from cars_by_vendor_year_model where vendor_name = 'Ford Motors' and production_year >= 2017

On this basis, the table can also satisfy the following operations:

  • Get a model made by a manufacturer:
select * from cars_by_vendor_year_model where vendor_name = 'Ford Motors'
  • Get the production of a specific model in a year:
select * where vendor_name = 'Ford Motors' and production_year = 2016 and car_model = 'Explorer'

Next, you need to do this exercise for every query your application needs to support, and when all the work is done, you can deploy the application to production. Well, the job is done! Get ready for a bonus!

shortcoming

Well, one more may not get the bonus.

Applications in the production environment often face the deficiencies of the Cassandra-based architecture. If someone jumps out of the box we set, and hopes to quickly add a new operation to enhance the application, such a thing will often happen. This is Cassandra of insufficiency.

If the data model is relational, then you can prepare an SQL query, create an index (if necessary), and patch the production environment, but Cassandra is not that simple. If the query cannot be executed or cannot be executed efficiently due to the constraints of the schema, then you need to create a new Cassandra table, configure the primary key and clustering key to meet the needs of the specific query, and copy the necessary data from the existing table.

Let's go back to the previous vehicle and manufacturer's application, which has been used by a large number of users, and then to meet new needs:

Q2: Get the output of a specific model from a manufacturer.

After thinking about it for a while, you might come to the conclusion that, based on the cars_by_vendor_year_modeltable created earlier, construct a new query, well, the query is ready and then executed:

select production_year, total from cars_by_vendor_year_model where vendor_name = 'Ford Motors' and car_model = 'Edge'

However, the query goes wrong with the following exception:

InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column "car_model" cannot be restricted (preceding column "production_year" is not restricted)"

The exception is presumably saying that the car_modelproduction year needs to be specified before filtering the data through! But the year is unknown, then a different new table needs to be created to satisfy Q2 :

CREATE TABLE cars_by_vendor_model (
 vendor_name text,
 car_model text,
 production_year int,
 total int,
 PRIMARY KEY ((vendor_name), car_model, production_year)
);

Finally, the query that satisfies Q2 is executed successfully:

select production_year, total from cars_by_vendor_model where vendor_name = 'Ford Motors' and car_model = 'Edge'

Now take a step back and look at the structure of the cars_by_vendor_year_modelsum and cars_by_vendor_modelsee how many differences you can spot. Well, just one point, mainly tweaking the clustering keys! In this way, just to satisfy Q2 , the following work is done:

  • Create a new table and copy the previous data;
  • After the batch is embedded in the application, pay attention to the atomic update of both;
  • Application architecture becomes complex;

This story will happen again and again, unless the app stops evolving. In practice at least for the first few years, this was not possible. This means being prepared to plunge into infinitely complex architectures. Is there a way to avoid it? Absolutely, can miracles happen in the existing Cassandra functionality? can not.

Can Apache Ignite solve it?

SQL queries with associations are not cheap. Even a relational database running on a single machine may "block" as the workload increases, which is why many people turn to Cassandra even if they have to endure the shortcomings of the Cassandra data model technology. It's the reason, but it complicates the architecture.

The market for distributed storage, databases and platforms is experiencing substantial growth. It has both scalability and high availability of Cassandra, but at the same time applications can also be built on relational models. At present, find such a database It works.

Looking around in the Apache Software Foundation (ASF) project, we found Apache Ignite , a memory-centric data store, a distributed cache or database with built-in SQL, key-value and There are computing APIs. Ignite is still in the shadow of Cassandra, an old friend of ASF. Although some people still choose it for scalability, high availability and persistence reasons, it has been confirmed that Ignite is invincible in SQL, distributed transactions and memory storage. Additionally, those who trust Cassandra very much in production can also use Ignite as a caching layer to speed it up -- this could be an intermediate step in replacing Cassandra with Ignite's own storage.

Have you joined the Ignite community like me? Then look forward to the next article where Ignite will be used to build an architecture based on a simple relational model, which will use relational collocation, partitioning concepts, efficient collocated SQL associations and other techniques to create a Example. If you can't wait or want to figure this out yourself, you can read Parts 1 and 2 to learn about the main features and concepts of Ignite.

This article is translated from Denis Magda's blog .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324393801&siteId=291194637