起源
本文的起因源于现场问题的排查。
首先要说明的是ES的数据默认存储在nodes目录下,假设ES配置文件中指定的数据目录为/tmp/elasticsearch,那么ES会把数据存在如下目录:
/tmp/elasticsearch/nodes
正常情况下nodes目录下只有一个目录0,即
/tmp/elasticsearch/nodes/0
ES的所有数据就存在0目录下。
然而ES的现场环境经常会在nodes目录下创建目录1
/tmp/elasticsearch/nodes/0
/tmp/elasticsearch/nodes/1
其实这里的一个目录就表示一个ES实例。目录/tmp/elasticsearch/nodes/0是ES实例0存放数据的路径,目录/tmp/elasticsearch/nodes/1是ES实例1存放数据的路径。
奇怪的是现场ES在每个节点只会启动一个实例,这里为什么会创建出两个呢?原来现场的ES是由看门狗管理的,而看门狗会有重启启动ES的情况,从而导致创建多个实例。这种情况容易导致ES数据的丢失。
解决
那么如何避免多个实例的创建呢?
ES的最大实例个数由elasticsearch.yml中的参数max_local_storage_nodes指定。如果将这个参数的值设置为1,在已经有一个ES实例的情况下,看门狗如果试图再拉起一个ES实例是不会成功的,从而保证的节点最多只能有一个ES实例。在ES2.x中这个配置的默认为50,现场的ES版本就是2.x,又没有配置这个参数才会创建多个实例。值得一提的是ES5.x已经意识到这个默认值得危险性,因此将默认值修改为1,请参考https://github.com/elastic/elasticsearch/pull/19964
源码分析
接下来让我们深入一下ES多实例的关键源码
这段源码位于NodeEnvironment类的构造方法中
ES2.3
// 读取node.max_local_storage_nodes 配置,如果没有值取默认值50
int maxLocalStorageNodes = settings.getAsInt("node.max_local_storage_nodes", 50);
for (int possibleLockId = 0; possibleLockId < maxLocalStorageNodes; possibleLockId++) {
for (int dirIndex = 0; dirIndex < environment.dataWithClusterFiles().length; dirIndex++) {
// 创建实例存储数据的目录举个例子:/tmp/elasticsearch/nodes/0
Path dir = environment.dataWithClusterFiles()[dirIndex].resolve(NODES_FOLDER).resolve(Integer.toString(possibleLockId));
Files.createDirectories(dir);
try (Directory luceneDir = FSDirectory.open(dir, NativeFSLockFactory.INSTANCE)) {
logger.trace("obtaining node lock on {} ...", dir.toAbsolutePath());
try {
//当前实例试图去获取这个实例目录的文件锁,如果这个目录已经被其他实例使用则,获取失败,继续循环
locks[dirIndex] = luceneDir.obtainLock(NODE_LOCK_FILENAME);
nodePaths[dirIndex] = new NodePath(dir, environment);
localNodeId = possibleLockId;
} catch (LockObtainFailedException ex) {
logger.trace("failed to obtain node lock on {}", dir.toAbsolutePath());
// release all the ones that were obtained up until now
releaseAndNullLocks(locks);
break;
}
} catch (IOException e) {
logger.trace("failed to obtain node lock on {}", e, dir.toAbsolutePath());
lastException = new IOException("failed to obtain lock on " + dir.toAbsolutePath(), e);
// release all the ones that were obtained up until now
releaseAndNullLocks(locks);
break;
}
}
// 如果获取到文件锁,就跳出循环
if (locks[0] != null) {
// we found a lock, break
break;
}
}
ES5.6.4
ES5.6.4 这部分的代码相较于ES2.3除了MAX_LOCAL_STORAGE_NODES_SETTING的默认值改为1之外,并没有做什么重要的改动
int maxLocalStorageNodes = MAX_LOCAL_STORAGE_NODES_SETTING.get(settings);
for (int possibleLockId = 0; possibleLockId < maxLocalStorageNodes; possibleLockId++) {
for (int dirIndex = 0; dirIndex < environment.dataFiles().length; dirIndex++) {
Path dataDirWithClusterName = environment.dataWithClusterFiles()[dirIndex];
Path dataDir = environment.dataFiles()[dirIndex];
// TODO: Remove this in 6.0, we are no longer going to read from the cluster name directory
if (readFromDataPathWithClusterName(dataDirWithClusterName)) {
DeprecationLogger deprecationLogger = new DeprecationLogger(startupTraceLogger);
deprecationLogger.deprecated("ES has detected the [path.data] folder using the cluster name as a folder [{}], " +
"Elasticsearch 6.0 will not allow the cluster name as a folder within the data path", dataDir);
dataDir = dataDirWithClusterName;
}
Path dir = resolveNodePath(dataDir, possibleLockId);
Files.createDirectories(dir);
try (Directory luceneDir = FSDirectory.open(dir, NativeFSLockFactory.INSTANCE)) {
startupTraceLogger.trace("obtaining node lock on {} ...", dir.toAbsolutePath());
try {
locks[dirIndex] = luceneDir.obtainLock(NODE_LOCK_FILENAME);
nodePaths[dirIndex] = new NodePath(dir);
nodeLockId = possibleLockId;
} catch (LockObtainFailedException ex) {
startupTraceLogger.trace(
new ParameterizedMessage("failed to obtain node lock on {}", dir.toAbsolutePath()), ex);
// release all the ones that were obtained up until now
releaseAndNullLocks(locks);
break;
}
} catch (IOException e) {
startupTraceLogger.trace(
(Supplier<?>) () -> new ParameterizedMessage("failed to obtain node lock on {}", dir.toAbsolutePath()), e);
lastException = new IOException("failed to obtain lock on " + dir.toAbsolutePath(), e);
// release all the ones that were obtained up until now
releaseAndNullLocks(locks);
break;
}
}
if (locks[0] != null) {
// we found a lock, break
break;
}
}