Data: Copy or sync?

I have often heard this question from students and colleagues: should I back-up my files to an external disk using copy (e.g. from the graphical interface) or are there better ways (e.g. from the terminal)?

Why copying files is NOT the way to go

When you copy a file, you clone an existing file into a new file. I.e. the content of the new file is identical to the old file, but the metadata of the new file is different. Simple example: the creation date. In many cases (e.g. when backing up large amount of files), you would like to preserve the original creation date/modification date of your files, because it can help in finding specific files. A typical example with academics and students is “which one was the latest version of my manuscript/essay?” (psst! There’s version control for that, you know?). Being able to check the original creation date of a file, might help you find the right file.

Furthermore, when you copy a file (from terminal with command ‘cp’ or from the graphical interface with copy and paste) the operating system does not check if the file has been copied correctly. Of course in most cases (e.g. no space left on device) the system will provide an error to the user if things go wrong, but things can get tricky when there’s tens of thousands of files: 1) you don’t want to start manually checking that each file went fine 2) if the copy dies midway, you don’t want to manually check which file was successfully copied and which didn’t out of thousands of files.

Do not copy, sync!

Nobody loves sync more than I do. I even did a PhD about sync! Well in this case we are not talking about synchrony of minds, but more simply synchrony between your source data and your copied data. If you sync a file instead of copying it, you make sure that 1) metadata are preserved (creation time, modification time, etc) 2) identity is preserved, i.e. the copied file is actually identical bit-by-bit to the original one 3) if a sync process dies in the middle, you can easily resume from where it stopped (very useful when syncing data over the internet)

How to do that? There are graphical programs for Mac http://www.chriswrites.com/how-to-sync-files-and-folders-on-the-mac/, Windows http://www.online-tech-tips.com/free-software-downloads/sync-two-folders-windows/ and Linux http://www.opbyte.it/grsync/.

However, here I just show the non-graphical solution with a one liner from the terminal (works on Linux and Mac).

For local files:

rsync -avztcp /path/to/source/ /path/to/destination/

Syncing over the internet from remote to local:

rsync -avztcp -e "ssh" user@server.com:/remote_source/ /local_destination/

or from local to remote:

rsync -avztcp -e "ssh"  /local_source/ user@server.com:/remote_destination/

To disentangle the options of rsync is beyond the scope of this simple explanation, but basically the options I wrote here make sure that folder 1 (source) will be identical to folder 2 (destination). Furthermore, the sync is incremental: a file that is deleted in source folder, will NOT be deleted in destination folder (however a file that is modified in source folder, will be modified in destination folder). Check rsync manual pages for more explanations. Please remember the ending slash in your folder paths (see here what are the differences http://qdosmsq.dunbar-it.co.uk/blog/2013/02/rsync-to-slash-or-not-to-slash/)

Fingerprint of a file

A side note: how does rsync manage to do that? It is simple, rsync – before and after copying all the files – computes a “checksum” for each file, i.e. a fingerprint of the file. It is useful to learn how to compute the fingerprint of a file, for example to check that two files you downloaded are identical or to make sure the content you have downloaded is identical to the content stored in the remote server.

To do that from the terminal just type:

md5sum  filename

and that will generate an output similar to:

c5dd35983645a36e18d619d865d65846  filename

The long string is the MD5 fingerprint of the file. There are other types of fingerprints e.g. sha1sum, read more about checksums on the internetz. When downloading data from open repositories, data files are usually accompanied by MD5 checksums so that you can actually check that the download was successful. For example, if you have downloaded the Human Connectome Project data, you should have a folder with many zip files and for each file.zip file there is a corresponding file.zip.md5 containing the fingerprint of the zip file. The following for loop just checks that the fingerprint of the downloaded file is identical to the original fingerprint of the file.

for f in $(ls *.md5); do md5sum -c $f;done

If you keep a log of the output you can then search for those files that are not ok:

## loop through all md5 checksum files and stores the output on a log file
for f in $(ls *.md5); do md5sum -c $f;done 2>&1|tee md5checks.log

## show which lines of the log file are not OK
cat md5checks.log|grep -v OK$

Do I really need to stop using copy???

Well, let’s be clear, you don’t HAVE to always use rsync. If you are just storing a copy of a single file to your USB stick, it is easier to just copy and paste and (maybe) check manually that the file was copied correctly. But you agree with me that when there’s more than a dozen of files, rsync is the way to go.

Advanced options

There are plenty of websites with advanced options for rsync, here a couple of ones I havefound convenient:

  1. Sync all files with extension “.m” and include all subfolders. Do not include the “Downloads” subfolder
    rsync -avztcp --include '*/' --include '*.m' --exclude '*' --exclude 'Downloads' /sourcefolder/ /destinationfolder/
  2. Sync files that are smaller than 100MB.
    rsync -avztcp –max-size=100m /sourcefolder/ /destinationfolder/
  3. When you need to set specific group permissions to the destination files
    sync -avztcp -e "ssh" --chmod=g+s,g+rw --group=braindata/sourcefolder/ /destinationfolder/

Linux for (data) scientists – my collection of Linux FUC = frequently used commands

This is just a dump of Linux commands that I often use. They should be useful to the average (data) scientist, and a source of inspiration for those who are learning Linux. Do not fear the terminal, it’s the most powerful computer tool for a scientist! Sometimes I don’t remember the correct options, so this page is also useful for me rather than always googling for ad-hoc solutions. This page will be updated frequently and the list will grow.

P.S.: A useful resource is also http://www.bashoneliners.com/.

 

  1. List contents of zip file without unzipping
    unzip -l filename.zip
  2. List contents of tar file without untarring
    tar -tf filename.tar.gz
  3. Extract content of a file in a zip
    unzip -c filename.zip file1.txt| less
  4. Save a remote file locally (-O), resume download if connectivity breaks (-C -), limit the download speed (–limit-rate). This is useful for downloading huge datasets in the background.
    curl -L -O -C - --limit-rate 200k http://remotefile
  5. Replace new lines with a symbol (e.g. a semicolon)
    cat file.txt|tr '\n' ';'
  6. Run a command and store the output to a log file (including standard errors)
    ./command.sh 2>&1|tee logfile.log
  7. Get a list of who is connected to a machine and how many processes that user is running (root is usually the user with most processes… but not always :)). Note the command sed. It replaces any number of consecutive spaces with a semicolumn, i.e. it creates a csv out of your previous command. Then cut, extracts column number 1. sort+uniq -c counts multiple occurences and a final sort, sorts the output.
    ps -fA|sed 's/ \+/;/g'|cut -d\; -f1|sort|uniq -c|sort -nr
  8. This one is to test md5 checksum for downloaded data (this works for the Human Connectome Project). Imagine you have downloaded file1.zip and file1.zip.md5. The second file contains the so called md5checksum, an unique fingerprint of your file. You want to test that the file you downloaded gives the same identical fingerprint that is written in the md5 file. To do this for all the hundreds of files of the HCP, you can just run:
    for n in $(ls *.md5);do f=$(echo $n|sed 's/.md5//g');echo -n $f" ";nn=$(md5sum $f);nnf=$(cat $n);if [ "$nn" == "$nnf" ];then echo "OK!";else echo "$nn $nnf";fi;done 2>&1|tee hcp_checks.log
  9. Commands to make sure that shared folders/files are accessible by the group. 1st one is for folders and the second is for files.
    find . -type d -exec chmod g+rwxs {} \;
    
    find . -type f -exec chmod g+rw {} \;
  10. Get Nth line from a txt file. NUM is a variable with the number.
    sed "${NUM}q;d" file
  11. Get the number of a column of a csv file, based on the header (the header contains the string that is grepped)
    cat bigfile.csv |grep header_of_a_column|tr '|' '\n'|sed '/./='|sed '/./N; s/\n/ /'