Data: Copy or sync?

I have often heard this question from students and colleagues: should I back-up my files to an external disk using copy (e.g. from the graphical interface) or are there better ways (e.g. from the terminal)?

Why copying files is NOT the way to go

When you copy a file, you clone an existing file into a new file. I.e. the content of the new file is identical to the old file, but the metadata of the new file is different. Simple example: the creation date. In many cases (e.g. when backing up large amount of files), you would like to preserve the original creation date/modification date of your files, because it can help in finding specific files. A typical example with academics and students is “which one was the latest version of my manuscript/essay?” (psst! There’s version control for that, you know?). Being able to check the original creation date of a file, might help you find the right file.

Furthermore, when you copy a file (from terminal with command ‘cp’ or from the graphical interface with copy and paste) the operating system does not check if the file has been copied correctly. Of course in most cases (e.g. no space left on device) the system will provide an error to the user if things go wrong, but things can get tricky when there’s tens of thousands of files: 1) you don’t want to start manually checking that each file went fine 2) if the copy dies midway, you don’t want to manually check which file was successfully copied and which didn’t out of thousands of files.

Do not copy, sync!

Nobody loves sync more than I do. I even did a PhD about sync! Well in this case we are not talking about synchrony of minds, but more simply synchrony between your source data and your copied data. If you sync a file instead of copying it, you make sure that 1) metadata are preserved (creation time, modification time, etc) 2) identity is preserved, i.e. the copied file is actually identical bit-by-bit to the original one 3) if a sync process dies in the middle, you can easily resume from where it stopped (very useful when syncing data over the internet)

How to do that? There are graphical programs for Mac, Windows and Linux

However, here I just show the non-graphical solution with a one liner from the terminal (works on Linux and Mac).

For local files:

rsync -avztcp /path/to/source/ /path/to/destination/

Syncing over the internet from remote to local:

rsync -avztcp -e "ssh" /local_destination/

or from local to remote:

rsync -avztcp -e "ssh"  /local_source/

To disentangle the options of rsync is beyond the scope of this simple explanation, but basically the options I wrote here make sure that folder 1 (source) will be identical to folder 2 (destination). Furthermore, the sync is incremental: a file that is deleted in source folder, will NOT be deleted in destination folder (however a file that is modified in source folder, will be modified in destination folder). Check rsync manual pages for more explanations. Please remember the ending slash in your folder paths (see here what are the differences

Fingerprint of a file

A side note: how does rsync manage to do that? It is simple, rsync – before and after copying all the files – computes a “checksum” for each file, i.e. a fingerprint of the file. It is useful to learn how to compute the fingerprint of a file, for example to check that two files you downloaded are identical or to make sure the content you have downloaded is identical to the content stored in the remote server.

To do that from the terminal just type:

md5sum  filename

and that will generate an output similar to:

c5dd35983645a36e18d619d865d65846  filename

The long string is the MD5 fingerprint of the file. There are other types of fingerprints e.g. sha1sum, read more about checksums on the internetz. When downloading data from open repositories, data files are usually accompanied by MD5 checksums so that you can actually check that the download was successful. For example, if you have downloaded the Human Connectome Project data, you should have a folder with many zip files and for each file there is a corresponding containing the fingerprint of the zip file. The following for loop just checks that the fingerprint of the downloaded file is identical to the original fingerprint of the file.

for f in $(ls *.md5); do md5sum -c $f;done

If you keep a log of the output you can then search for those files that are not ok:

## loop through all md5 checksum files and stores the output on a log file
for f in $(ls *.md5); do md5sum -c $f;done 2>&1|tee md5checks.log

## show which lines of the log file are not OK
cat md5checks.log|grep -v OK$

Do I really need to stop using copy???

Well, let’s be clear, you don’t HAVE to always use rsync. If you are just storing a copy of a single file to your USB stick, it is easier to just copy and paste and (maybe) check manually that the file was copied correctly. But you agree with me that when there’s more than a dozen of files, rsync is the way to go.

Advanced options

There are plenty of websites with advanced options for rsync, here a couple of ones I havefound convenient:

  1. Sync all files with extension “.m” and include all subfolders. Do not include the “Downloads” subfolder
    rsync -avztcp --include '*/' --include '*.m' --exclude '*' --exclude 'Downloads' /sourcefolder/ /destinationfolder/
  2. Sync files that are smaller than 100MB.
    rsync -avztcp –max-size=100m /sourcefolder/ /destinationfolder/
  3. When you need to set specific group permissions to the destination files
    sync -avztcp -e "ssh" --chmod=g+s,g+rw --group=braindata/sourcefolder/ /destinationfolder/

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s