How to debug disk complete errors in redshift

When I first used Amazon's Redshift, I quickly realized that it was different from other relational databases.
You have new options such as COPY and UNLOAD, and you lose familiar helpers such as key constraints.
Compared with traditional databases, you can process larger data sets faster, but to make full use of it, you need to learn.

One of the problems we encountered at the beginning was the helpless disk full error, especially when we knew we had free disk space.
Last year, we collected many resources on how to manage disk space in Redshift.
We will share what we have learned to help you quickly debug your Redshift cluster and make the most of it.

If you encounter a disk full error when running a query, something must have happened-one or more nodes in the cluster ran out of disk space while running the query.
This may be because the query uses a lot of memory and overflows to the disk, or the query is good and there is too much data on the hard disk of the cluster.
You can find out which situation is by querying the stv_partitions table to see how much space the table uses.
I like to use this query from FlyData.

Ideally, you will not use more than 70% of the mysql database synchronization capacity.
Even if the capacity exceeds 80%, Redshift can still continue to work normally, but it may still cause your problems.
If you seem to have enough space, continue to the next section, but if you use more than 90% of the space, you definitely need to skip to the "coding" section.


If the failed query contains a join clause, it is most likely the cause of your error.
When Redshift performs joins, it has several strategies to join rows from different tables together.
By default, it performs a "hash join" by creating a hash value of the join key in each table, and then assigns them to every other node in the cluster.
This means that each node will have to store a hash for each row of the table.
When joining large tables, this quickly fills up disk space.
However, if the two join keys of the query are on the same node, the entire query can be performed in-situ without using any additional memory.
By setting the tables so that their dist keys are the same, you can avoid disk full errors.
However, please pay attention to the skew when setting the dist key, which will be discussed in the next section.

If the dist key cannot be changed because the dist key has been optimized for other queries, the new key may cause skew issues or other reasons, and you may be able to make some changes to the query so that it can still be executed.
Here are some things you can try:

Use subqueries instead of joins.
Some queries that use joins only require data from one table, but are using joins to verify certain information.
In these cases, joins can usually be replaced with IN clauses and subqueries.
For example, for us, a common query is to get some information about subscribed users.
In addition to selecting two tables, we can also select users whose IDs are in the subscriptions table.
Retaining the results of the subquery will take up some memory, but it usually requires much less memory than a hash join.

创建并联接子表。
在许多情况下,我们仅从要联接的表中检索数据的一小部分子集,而对整个表进行哈希联接。
在这种情况下,您可以创建一个表,通常是一个临时表,该表是要连接的表的子集,但已应用了需要进行的任何筛选。
这样,通过连接两个小表,哈希值要小得多。
您甚至可以分发它们,以便根本没有哈希联接。
同样,此选项会占用一些内存,但比整个表的哈希联接要少得多。

如果您设置了dist键,则可能是在处理偏斜,这是表中一个节点上的行多于其他节点上的行。
即使在常规查询中,严重的偏差也可能导致磁盘已满错误,因为使用的任何额外磁盘空间都可能导致一个超载节点抛出错误。
来自Amazon的此查询非常适合检查倾斜的表。
如链接中所述,如果“偏斜”列中的值较高,或者“填充切片”列中的值较低,尤其是对于大型表,则可能需要重新考虑这些表的dist策略。

在群集上为大型表设置分布时,我们选择了一个键,该键具有大量可能的值,因此行应该在节点之间均匀分布。
但是,我们没有意识到的是,对于许多行,此列均为空。
然后,所有这些行都存储在群集的同一节点上,即使我们只使用了75%的磁盘空间,该节点也会在几乎所有查询中引发磁盘已满错误。

与传统的SQL数据库相比,Redshift的一个独特功能是可以对列进行编码以占用更少的空间。
但是,没有自动编码,因此用户必须选择在创建表时如何对列进行编码。
您可以在Amazon文档中阅读许多编码选项。
开始编码的最简单方法是使用Amazon的Python脚本分析表并获取建议。
如果磁盘空间不足,并且尚未对表进行编码,则可以通过这种方式恢复相当大的空间。
如果您已对表进行了编码,则值得检查svv_table_info表以查看是否已添加了未编码的任何表,或者重新运行上述脚本以查看是否应更改任何表的编码。

将新行添加到Redshift时,它们不会按指定的排序顺序添加,这对于某些编码类型起作用很重要,并且当删除行时,不会自动释放空间。
吸尘处理这两个问题。
在表上运行vacuum命令时,将对它进行排序,并释放已删除行使用的空间。
如果您从表中添加或删除了大量行,则对该表进行清理将释放一些空间。
您可以在此处阅读有关如何运行vacuum命令以及哪些选项的信息。

吸尘的想法来自Redshift的父项目Postgres,但是如果您熟悉Postgres,您可能会惊讶地发现吸尘不是自动发生的,并且该命令必须手动运行。
还值得注意的是一次只能抽一张桌子,因此您需要仔细安排抽真空时间。
To save you from having to vacuum, you should prefer dropping a table or using the “truncate” command rather than the “delete” command when deleting large amounts of data, since those commands automatically free up disk space and a vacuum won’t be 需要。

如果您按照本指南进行操作,希望您的群集上有足够的空间,并且不再看到磁盘已满的错误。
但是,如果仍然存在,实际上只剩下两个选择:删除数据或购买另一个节点。

不要害怕考虑删除数据。
我们有时会审核表并清除实验和现已失效的项目中使用的数据,从而节省了一些空间。
只要记住清理从中删除行的表即可。
如果您对管理Redshift集群有任何疑问,或者找到了另一种空间管理方法,请随时告诉我们。

Guess you like

Origin blog.csdn.net/weixin_49470452/article/details/107507055