nodetool repair for daily maintenance of cassandra

The premise
is that the operation of cassandra according to the partition key is very fast, which is also its advantage, but its multi-condition query is very weak, especially if you have delete operations, it is even more pitiful. The deletion operation of cassandra is not actually a deletion. It is an insertion operation. The inserted data is called tombstone, which records the information and deletion time of the deleted record. When you query based on conditions, if it will query records that meet the conditions, including tombstone. Then filter out the deleted records and return the result to you.

Phenomenon
If your table mykeyspace.t_table has 3 copies, the primary key is (a,b,c). You insert 4000 pieces of data (a=1), then delete 3999 pieces of data, and then query according to a=1, you will find a warning log in the cassandra log.

WARN [ReadStage:18926] 2015-02-05 07:18:02,869 SliceQueryFilter.java Read 1 live and 11997 tombstoned cells in mykeyspace.t_table (see tombstone_warn_threshold)....

 This warning message is based on the tombstone_warn_threshold configured in your cassandra.yaml. That is to say, it searched for 11997=3999*3 pieces of data, and only found one live. This is the danger of tombstones. When you delete more data. When the value of the configuration item tombstone_failure_threshold is reached, the query fails. You will see the ERROR log below.

ERROR [ReadStage:219774] 2015-02-04 00:31:55,713 SliceQueryFilter.java (line 200) Scanned over 100000 tombstones in mykeyspace.t_table; query aborted (see tombstone_fail_threshold)
ERROR [ReadStage:219774] 2015-02-04 00:31:55,713 CassandraDaemon.java (line 199) Exception in thread Thread[ReadStage:219774,5,main]
java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1916)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.cassandra.db.filter.TombstoneOverwhelmingException
        at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:202)

 The actual data is actually deleted, which happens in two periods:

The first epoch is when a new SSTable file is generated. And it only removes tombstones inserted before gc_grace_seconds.
gc_grace_seconds is an extra parameter of the table structure, which can be modified by alter table. So if one of your nodes hangs, the hang time exceeds gc_grace_seconds. The data that may have resulted in deletion reappeared.

The second period comes from daily Nodetool repair operations. Repair at least once every gc_grace_seconds cycle.

The basic syntax of nodetool repair is as follows: nodetool
-h host repair [keyspace] [cfnames] It will repair the data (including master and slave data) whose primary key token value falls on this node in the table partition keyspace.cfnames. After you repair all the nodes, if you have three replicas, you are equivalent to repairing the data three times, so this time will be very long.



To repair only the master data
, you can use the -pr option, which means to repair only the master data whose range falls on this node: of course, you need to repair all the nodes.
nodetool -h host -pr repair [keyspace] [cfnames]

根据token段repair
当你的数据量足够大的时候,这个过程还是很慢,你可能焦急的不知道这个过程什么时候结束。你可以通过token段的方式,
nodetool -h host -st xxx -et xxxx repair [keyspace] [cfnames]
至于token段的获取,可以使用nodetool ring命令得到。

Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: datacenter1
==========
Address         Rack        Status State   Load            Owns                Token
                                                                               9192997390010868737
192.168.1.101  rack1       Up     Normal  98.1 MB         35.79%              -9196740398802827883
192.168.1.102  rack1       Up     Normal  98.55 MB        30.03%              -9124289757510820389
192.168.1.103  rack1       Up     Normal  98.09 MB        34.18%              -9088595326201594476
192.168.1.102  rack1       Up     Normal  98.55 MB        30.03%              -9084487345633494070
192.168.1.101  rack1       Up     Normal  98.1 MB         35.79%              -9061596030643872312
192.168.1.102  rack1       Up     Normal  98.55 MB        30.03%              -9056941391849010003
192.168.1.101  rack1       Up     Normal  98.1 MB         35.79%              -9055818090063560183

 这就表示token 从-9088595326201594476到-9084487345633494070的master数据分布在192.168.1.102机器上。

多线程并发repair
可以加一个-par选项 例如:
nodetool -h host -pr -par repair [keyspace] [cfnames]
nodetool -h host -st xxx -et xxxx -par repair [keyspace] [cfnames]
这个repair速度会提高数倍。但是你要考虑你集群的负载。

增量repair
可以加一个-inc选项(注:这个选项只能在2.1以后的版本使用)
例如
每条数据都有自己的时间戳,当你repair一次之后,你第二次repair的时候,其实只要repair上次repair数据最后一个时间戳之后的数据就可以了。
nodetool -h host -pr -inc repair [keyspace] [cfnames]
nodetool -h host -st xxx -et xxxx -inc repair [keyspace] [cfnames]
这样也可以增加repair速度,注意-inc和-par是不能一起使用的。

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326831877&siteId=291194637