In recent years, cloud computing has become the mainstream. For the sake of self-interest, or unwilling to be locked by a single cloud service provider, or because of business and data redundancy, or for cost optimization, enterprises will try to transfer some or all of their business from Offline computer rooms are migrated to the cloud or migrated from one cloud platform to another. Service migration involves data migration. It just so happens that JuiceFS has been connected to various object storage APIs and has also implemented the logic of data synchronization. Let's learn about the sync command of JuiceFS.
What is JuiceFS Sync
JuiceFS's sync subcommand is a full-featured data synchronization utility that can synchronize or migrate data concurrently with multiple threads between all JuiceFS-supported object stores. It supports both data migration between "object storage" and "JuiceFS", and Migrate data across clouds and regions between Object Storage and Object Storage. Similar to rsync, in addition to object storage, it also supports synchronizing local directories, accessing remote directories through SSH, HDFS, WebDAV, etc., and provides advanced functions such as full synchronization, incremental synchronization, and conditional pattern matching.
Basic usage
command format
juicefs sync [command options] SRC DST
That is, if you SRC
synchronize to DST
, you can synchronize both directories and files.
in:
SRC
Represents the data source address and pathDST
Represents the destination address and path[command options]
Represents optional synchronization options, see the command reference for details .
The address format is[NAME://][ACCESS_KEY:SECRET_KEY@]BUCKET[.ENDPOINT][/PREFIX]
in:
NAME
is the storage type, egs3
,oss
. See all supported storage services for detailsACCESS_KEY
andSECRET_KEY
are the API access keys for Object StorageBUCKET[.ENDPOINT]
is the access address of the object storagePREFIX
is optional and qualifies the directory name prefix to be synchronized.
The following is an example address for Amazon S3 object storage:
s3://ABCDEFG:[email protected]
In particular, and are treated as directories if theySRC
end with , for example: . If it does not end with , it will be regarded as a "prefix" and will be matched according to the rules of prefix matching. For example, there are two directories in the current directory and , which can be synchronized to the target path by using the following command :DST
/
movies/
/
test
text
~/mnt/
juicefs sync ./te ~/mnt/te
In this way, the sync
command te
will match all directories or files in the current path containing the prefix with the prefix, i.e. test
and text
. And in the target path ~/mnt/te
is te
also a prefix, it will replace the prefix of all synchronized directories and files, in this example, te
replace with te
, that is, keep the prefix unchanged. If you adjust the prefix of the target path, for example change the target prefix to ab
:
juicefs sync ./te ~/mnt/ab
The test
directory will become abst
, text
will become abxt
.
List of resources
This assumes the following storage resources:
-
Object Storage A <span id="bucketA" />
- Bucket name: aaa
- Endpoint:
https://aaa.s3.us-west-1.amazonaws.com
-
Object Storage B <span id="bucketB" />
- Bucket name: bbb
- Endpoint:
https://bbb.oss-cn-hangzhou.aliyuncs.com
-
JuiceFS file system <span id="bucketC" />
- Metadata storage:
redis://10.10.0.8:6379/1
- Object storage:
https://ccc-125000.cos.ap-beijing.myqcloud.com
- Metadata storage:
All stored access keys are:
- ACCESS_KEY:
ABCDEFG
- SECRET_KEY:
HIJKLMN
Sync between Object Storage and JuiceFS
Sync object store A's movies
directory to the JuiceFS file system:
# 挂载 JuiceFS
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# 执行同步
juicefs sync s3://ABCDEFG:[email protected]/movies/ /mnt/jfs/movies/
Sync the images
directory to Object Storage A:
# 挂载 JuiceFS
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# 执行同步
juicefs sync /mnt/jfs/images/ s3://ABCDEFG:[email protected]/images/
Sync between object storage and object storage
Synchronize all data of object store A to object store B:
juicefs sync s3://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]
Advanced usage
Incremental synchronization and full synchronization
The sync command works by default in incremental synchronization, that is, first compares the differences between the source and target paths, and then synchronizes only the differences. The file can be updated using the --update
or -u
option mtime
.
For a full sync, i.e. to resync regardless of whether the same file exists on the target path, use --force-update
or -f
. For example, to fully synchronize the movies
directory to the JuiceFS file system:
# 挂载 JuiceFS
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# 执行全量同步
juicefs sync --force-update s3://ABCDEFG:[email protected]/movies/ /mnt/jfs/movies/
pattern matching
sync
The pattern matching function of the command is similar to that of rsync. It can exclude or include certain types of files through rules, and achieve synchronization of any set through the combination of multiple rules. The rules are as follows:
/
A pattern ending in will match only directories, otherwise it will match files, links or devices;- When it contains
*
,?
or[
characters , it will be matched by wildcard pattern, otherwise it will be matched by regular string; *
matches any non-empty path component, stopping/
at ;?
matches any character/
except ;[
matches a set of characters, such as[a-z]
or[[:alpha:]]
;- In wildcard mode, backslashes can be used to escape wildcards, but in the absence of wildcards, it will be matched literally;
- Always match recursively with a pattern as a prefix.
Exclude files/directories
Use the --exclude
option to set directories or files to exclude. For example, to fully sync JuiceFS filesystem to object store A, but not sync hidden files and folders:
All names
.
starting
# 挂载 JuiceFS
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# 完整同步,排除隐藏文件和目录
juicefs sync --exclude '.*' /mnt/jfs/ s3://ABCDEFG:[email protected]/
This option can be repeated to match more rules, for example, to exclude all hidden files, pic/
directories and 4.png
files :
juicefs sync --exclude '.*' --exclude 'pic/' --exclude '4.png' /mnt/jfs/ s3://ABCDEFG:[email protected]
Include files/directories
Use the --include
option to set directories or files to be included (not excluded), e.g. to sync only the pic/
and 4.png
two files and exclude the others:
juicefs sync --include 'pic/' --include '4.png' --exclude '*' /mnt/jfs/ s3://ABCDEFG:[email protected]
When using include/exclude rules, options that are placed first take precedence.
--include
It should be in the front. If all files are--exclude '*'
excluded , then the following--include 'pic/' --include '4.png'
include rules will not take effect.
Multithreading and bandwidth throttling
JuiceFS sync
enables 10 threads to perform synchronization tasks by default, and you can set --thread
options .
In addition, if you need to limit the bandwidth occupied by synchronization tasks, you can set the --bwlimit
option , unit Mbps
, and the default value 0
is no limit.
Directory structure and file permissions
By default, the sync command only synchronizes file objects and directories containing file objects. Empty directories are not synchronized. To sync empty directories, you can use the --dirs
option .
Also, if you want to maintain file permissions when synchronizing between file systems like local, sftp, hdfs, etc., you can use the --perms
option .
copy symbolic link
sync
When synchronizing between local directories, JuiceFS supports setting the --links
option enable the function of synchronizing itself instead of the object it points to when encountering a symbolic link. The path pointed to by the synchronized symbolic link is the original path stored in the source symbolic link, regardless of whether the path is reachable before or after synchronization, it will not be converted.
A few other details to note
- The symlink's own
mtime
will not be copied; --check-new
The behavior of the and--perms
options is ignored when a symbolic link is encountered.
Multi-machine concurrent synchronization
In essence, synchronizing data between two object stores is to pull data from one end and push it to the other end. As shown in the figure below, the efficiency of synchronization depends on the bandwidth between the client and the cloud.
When synchronizing a large amount of data, the bandwidth of a single machine is often occupied and a bottleneck occurs. In response to this situation, JuiceFS Sync provides multi-machine concurrent synchronization support, as shown in the following figure.
The Manager executes sync
commands , and --worker
defines multiple Worker hosts through the parameter. JuiceFS will dynamically split the synchronization workload according to the total number of Workers and distribute them to each host for simultaneous execution. That is, the amount of synchronization tasks originally processed on one host is divided into multiple copies and distributed to multiple hosts for simultaneous processing. The amount of data that can be processed per unit time is larger, and the total bandwidth is doubled.
When configuring multi-machine concurrent synchronization tasks, you need to configure the SSH password-free login from the Manager host to the Worker host in advance to ensure that clients and tasks can be successfully distributed to Workers.
The Manager will distribute the JuiceFS client program to the Worker hosts. To avoid client compatibility issues, please ensure that the Manager and Worker use the same type and architecture of operating systems.
For example, to synchronize object store A to object store B, using multi-master parallel synchronization:
juicefs sync --worker [email protected],[email protected] s3://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]
The current host and the two Worker hosts [email protected]
and [email protected]
will share the data synchronization task between the two object stores.
If the SSH service of the Worker host is not the default port number 22, please set the SSH service port number of the Worker host through the
.ssh/config
configuration .
Scenario application
Data offsite disaster recovery backup
The off-site disaster recovery backup is aimed at the files themselves, so the files stored in JuiceFS should be synchronized to other object storages. For example, the files in the JuiceFS file system should be synchronized to object storage A:
# 挂载 JuiceFS
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# 执行同步
sudo juicefs sync /mnt/jfs/ s3://ABCDEFG:[email protected]/
After synchronization, all files can be seen directly in Object Storage A.
Create a JuiceFS data copy
Different from the disaster recovery backup for the file itself, the purpose of establishing a JuiceFS data copy is to create a mirror with the same content and structure for the JuiceFS data storage. When the object storage in use fails, you can switch to the data copy by modifying the configuration. continue working. It should be noted that only the data of the JuiceFS file system is copied here, and the metadata is not copied. The data backup of the metadata engine is still required.
This requires directly manipulating the underlying object store of JucieFS and synchronizing it with the target object store. For example, to use object store B as a data copy for the JuiceFS file system:
juicefs sync cos://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]
After synchronization, what you see in Object Store B is exactly the same content and structure as the Object Store used by JuiceFS.
If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)