Beats: Introducing Filestream fingerprint mode

Author: Denis Rechkunov

In Filebeat 8.10.0 and 7.17.12, we introduced a new fingerprint mode that gives users the option to identify files using a hash of their contents rather than relying on file system metadata. This change is available in file stream input.

What is a file stream?

Filestream is an input type in Filebeat that is used to ingest files from a given path.

File stream architecture

To explain what fingerprint mode is and where exactly we introduce it in Filestream, we first explain the basic architecture of Filestream input:

Peel off the onion skin of the top assembly:

  • File Scanner collects information about all files matching the input path .
  • File Watcher scans the file system every few seconds, as specified in the prospector.scanner.check_interval setting, and then compares the file system status between checks. If a change occurs, it emits an event describing the change.
  • Prospector decides how to utilize these file system events: start/stop collecting files, add/update/delete file status, etc.
    • In order to start processing a file and manage its status in the registry, the Prospector requires the file's unique ID, which is obtained from the file identity provider configured with the file_identity parameter entered.
    • All file state (such as offsets) is stored in the registry - an in-memory storage that is flushed to disk every registry.flush interval configured in Filebeat . It is stored on disk as a log of operations.
    • The collectors (havestors) do the actual file ingestion and send the lines they read to the event processing pipeline, which does some enrichment, transformation, queuing, batching, and finally passes the events to the output.

When default methods are not enough

By default, the file scanner uses file system metadata to compare files when searching for renames/moves, for example: the <inode>-<device_id> string on Unix systems (more information about inodes can be found here ) and <idxhi>-<idxlo> - <vol> strings on Windows (nFileIndexHigh, nFileIndexLow and dwVolumeSerialNumber respectively - see Microsoft's official documentation for more information). The same string is used as the unique file identifier returned by the file identity provider, and this value is used as the key for each file in the registry to find the current status of the file.

The whole point of the unique file identifier returned by the file identity provider is that it must be stable, meaning that it does not change during Filestream ingestion of the file. It must be stable because Filestream uses this identifier to keep track of file metadata, including the current offset of the file, so it knows where to continue ingesting.

What if the identifier is unstable? It can lead to data loss or data duplication.

Data loss example:

  • File IDs now match different files (not previously ingested).
  • Filestream did not read this file from offset 0, but applied incorrect offset information to this file.
  • Filestream continues reading log lines too far forward in the file, skipping log lines. These lines never reach the output.

Example of data duplication:

  • The file ID of an existing file has been changed.
  • It now appears as a new file to Filestream.
  • The file stream is read (re-ingested) starting at offset 0.

Unfortunately, not all file systems produce stable device_id and inode values.

The file system caches inodes and reuses them

If you try to run this script on a different file system, you may see different results:

#!/bin/bash

FILENAME=inode-test

touch $FILENAME
INODE=$(ls -i "$FILENAME")
echo "$FILENAME created with inode '$INODE'"

COPY_FILENAME="$FILENAME-copy"
cp -a $FILENAME $COPY_FILENAME
COPY_INODE=$(ls -i "$COPY_FILENAME")
echo "Copied $FILENAME->$COPY_FILENAME, the new inode for the copy '$COPY_INODE'"

rm $FILENAME
echo "$FILENAME has been deleted"

ls $FILENAME

cp -a $COPY_FILENAME $FILENAME
NEW_INODE=$(ls -i "$FILENAME")

echo "After copying $COPY_FILENAME back to $FILENAME the inode is '$NEW_INODE'"

rm $FILENAME $COPY_FILENAME

For example, on Mac (APFS) you will see:

inode-test created with inode '112076744 inode-test'
Copied inode-test->inode-test-copy, the new inode for the copy '112076745 inode-test-copy'
inode-test has been deleted
After copying inode-test-copy back to inode-test the inode is '112076746 inode-test'

As you can see, on APFS, all three files have different inode values: 112076744, 112076745 and 112076746. So this works as expected.

However, if you run the same script in an Ubuntu Docker container:

inode-test created with inode '1715023 inode-test'
Copied inode-test->inode-test-copy, the new inode for the copy '1715026 inode-test-copy'
inode-test has been deleted
ls: cannot access 'inode-test': No such file or directory
After copying inode-test-copy back to inode-test the inode is '1715023 inode-test'

You can see that the file system cached the inode value from the first file we deleted and reused it for the second copy with the same file names: 1715023, 1715026, and 1715023.

It doesn't even have to be the same filename; different files can reuse the same inode:

# touch x
# ls -i x
1715023 x # <-
# rm x
# touch y
# ls -i y
1715023 y # <-

We mostly observe these issues in container/virtualized environments, but whether inodes are cached and reused depends on the file system implementation. In theory, it can happen anywhere.

inode values ​​may change on non-Ext file systems

Ext file systems (such as ext4) store the inode number in the i_ino file within the struct inode and write it to disk. In this case, if the file is the same (not another file with the same name), the inode number is guaranteed to be the same.

If the file system is not Ext, the inode number is generated by inode operations defined by the file system driver. Since they have no concept of what an inode is, they have to mimic all the inode's internal fields to comply with the VFS , so this number may be different after a reboot - in theory, even after closing and opening the file again.

source:

Some file processing tools change inode values

  • We've seen our customers have issues using rsync and changing inodes.
  • Also, not everyone knows that sed -i creates a temporary file, then moves it to the location of the original file, changing the inode value (it's basically a new file). For example, some users may use sed -i to mask credentials from the logs.

Device ID can be changed

In addition to inode issues, the device_id may change after a reboot depending on how the disk drive is mounted. However, we have launched a solution to this problem some time ago: file_identity: inodemarker .

What is a fingerprint pattern?

A new fingerprint mode has been implemented in the file scanner component to avoid the above issues.

The new fingerprint mode switches the default file scanner behavior from using file system metadata to using SHA256 hashing for a given file byte range. By default, the range is 0 to 1024, but can be configured via the offset and length config parameters.

Now that we have this fingerprint information in the file scanner, it is also propagated with every file system event, and this fingerprint hash can be used as a unique file identifier in the file identity provider. Therefore, there is now also a new file_identity: fingerprint option, which also allows using the fingerprint value as the main file identifier in the registry.

What should you pay attention to when using fingerprint mode + fingerprint file identity?

The following points must be considered before starting to use this new feature:

  • All log files must be unique within the configured byte range. Due to the timestamps and the pure nature of the log, this is true for most log files, but the log must be inspected and the offset and length of the fingerprint determined.
  • Once you start using file_identity: fingerprint , you can no longer change the offset and length of the fingerprint; it will cause a complete re-ingestion of all files matching the input path.
  • Performance takes a hit - The performance aspects of this feature deserve their own discussion in this article:

performance

From the early stages of developing this feature, there were concerns about the performance impact it would have on the File Scanner. Finally, we need to open a file, read the number of bytes set by the prospector.scanner.fingerprint.length configuration option, and calculate SHA256 from it. We need to do this for every file that matches a glob in the path.

One thing to note here: In order to implement this new functionality, File Scanner had to undergo extensive changes. Therefore, when I was writing code, I took this opportunity to reconstruct some parts of File Scanner and made some optimizations, mainly reducing syscalls . I also added a lot of tests to verify expected behavior. So one suspects that the new File Scanner (with fingerprint mode disabled) is faster because it doesn't make as many system calls anymore.

After the fingerprint pattern was finally delivered to main, I ran some benchmarks and the results were interesting to say the least:

Several conclusions can be drawn here:

  • By performing the above optimizations in File Scanner, performance increased by 84%.
  • The new fingerprint mode is 76% slower than the default device_id+inode (when using the new scanner) mode, which makes it 8% faster than the default mode in the old File Scanner. As a result, our customers will experience faster file streaming even with fingerprint mode enabled.
  • Neither the hashing algorithm nor the fingerprint length have a big impact on overall performance - most of the time is spent opening and closing the file for reading. So, the default value of fingerprint mode seems to be fine.

in conclusion

This new fingerprint mode solves many problems with unstable metadata on file systems and is even faster than the default mode compared to previous versions of Filebeat.

Additionally, the default mode has become faster in new Filebeat versions, so it seems very beneficial to refactor some old code every now and then and run benchmarks/analysis to see how performance changes.

We will continue to monitor the performance of Filebeat. stay tuned.

What else is new in Elastic 8.10? Check out the 8.10 announcement post for more information.

Read more: Beats: Read activity log files faster and easier using filestream input in Filebeat

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/133353975