Project management == Data management

Over the past 2 years there has been a growing number of initiatives to address data management issues in scientific research. While this is no news in some fields like genetics, where input data as well as derived data (results) are shared in standardised digital formats, in other fields researchers have been mainly left alone to develop their own data structure and data management plan, i.e. “reinventing the wheel” multiple times. In Finland, the last round of Academy of Finland application was – for the first time – explicitly asking for a data management plan (DMP). To help the applicants, web tools such as https://www.dmptuuli.fi/ were developed to specifically follow Academy of Finland guidelines as well as other DMP templates such as ERC’s Horizon 2020, NIH, Wellcome Trust. DMPtuuli is based on https://dmponline.dcc.ac.uk/ delivered by http://www.dcc.ac.uk/. Another important resource in Europe is the EUDAT platform https://www.eudat.eu/ that goes beyond DMP, to also include data sharing, data preservation, data processing, meta-data indexing. For those more interested in the hot topic of data management in academia (or other data initiatives), I recommend following my dedicated Twitter list about data.

TL;DR summary of the post

  1. Use GIT for each project
  2. Physical storage separation between
    • source data
    • code/text
    • scratch data

Let’s get to work!

Management plans are important, but when plans translate into doing the actual work, it is clear that data management is highly intertwined with project management. Although project management is another monster by itself (ps: I cannot recommend enough Basecamp, the best project and intranet communication management tool for small companies and small labs), in scientific research data and project management are fusing together when it is about to start planning how to actually store your raw data, derived data, figures, manuscripts, etc and share the workload with other collaborators or (for PIs) monitor the work of your lab members.

I often notice tweets on the topic of project+data management (here a recent one that comes to my mind) and, so far, there is no optimal agreed way to organize folder structure for scientific projects. Clearly, raw data have similar issues and, specifically to neuroscience, data formats like BIDS are finally solving the issue on how to structure raw data in a standardised format to facilitate data sharing and pipelined data processing. Similarly for results, a new format called NIDM (NeuroImaging Data Model) aims at providing a structured machine-readable description of neuroimaging statistical results.

Everyone agrees however that a project should live in a single folder, and this is the approach of packages like cookiecutter https://github.com/audreyr/cookiecutter. With cookiecutter you can create a standardised directory tree to store relevant parts of your project. Interesting cookiecutter templates for data scientists are https://github.com/drivendata/cookiecutter-data-science and https://github.com/mkrapp/cookiecutter-reproducible-science (figure below is the folder tree strucutre for coockiecutter-data-science).

cookiecutter-data-science

Tree folder structure from https://drivendata.github.io/cookiecutter-data-science/

Similarly, the project folder structure by Nikola Vukovic gained some popularity http://nikola.me/folder_structure.html (contains script for automatically creating the folder structure, see figure below).

folder_structure

Figure from http://nikola.me/folder_structure.html

Richard Darst at Aalto university computer science dept. has suggested a simple guideline for project folder management; a folder structure like:

  • PROJECT/code/ – backed up and tracked in a version control system.
  • PROJECT/original/ – original and irreplaceable data. Backed up at the same time it is placed here.
  • PROJECT/scratch/ – bulk data, can be regenerated from code+original
  • PROJECT/doc/ – final outputs, which should be kept for a very long term.
  • PROJECT/doc/paper1/
  • PROJECT/doc/paper2/
  • PROJECT/doc/opendata/

and variations for individual sub-projects within a project:

  • PROJECT/USER1/…. – each user directory has their own code/, scratch/, and doc/ directories. Code is synced via the version control system. People use the original data straight from the shared folder in the project.
  • PROJECT/USER2/….
  • PROJECT/original/ – this is the original data.
  • PROJECT/scratch/ – shared intermediate files, if they are stable enough to be shared.

(for Aalto users, more details in the wiki page https://wiki.aalto.fi/display/Triton/Data+management).

The crunch: what should I do for project+data management?

Q: I am a new PI and I am starting a new lab, which folder structure should I use?

I personally like the barebone approach proposed by Aalto’s computer scientists which aims at keeping the folder structure as simple as possible, yet standardised across projects so that a new person joining an existing project already knows where the relevant bits are stored. Specifically, at Aalto Science there are three types of storage systems:

  1. /archive/ – long term preservation, backed-up periodically, as much disk space as it is needed
  2. /project/ – backed up daily, ideals for storing smaller files related to a project, limited disk space
  3. /scratch/ – not backed up, derived data related to a project, “infinite” disk space

For those who are not at Aalto University, a similar system at home would be:

  1. Archive folder is an external hard disk, backed up twice, with only raw data
  2. Project folder is a directory under your local Google Drive or Dropbox folder for automatic back up of (smallish) files
  3. Scratch folder is a huge external disk (e.g. 1TB) of derived data, no back-up but easy to recreate by running code from project folder.

The project+data management procedure I am using goes as following:

1) start a git repository

Go to github or version.aalto.fi and start a repository with a meaningful project name that we will call myprojectnickname. Make sure that the same name has not been used in the shared project folder (in our group, folder /m/nbe/project/braindata/, in your home computer /Users/username/Google\ Drive/project/). Git is scary for some people (check my 10 minutes introduction to GIT), but what I require my colleagues to AT LEAST do is to just create an empty repository with just a single file called README.md. They can edit it via github web interface for example. README.md is the barebone digital brother of the “lab notebook” so that anyone joining the project immediately gets an idea what is this project is about and other important notes (where the data are, what has been done so far, a to-do list of next steps).

2) go to the main project storage system and clone the newly created git repository

From the command line in Linux or Mac

cd /m/nbe/project/braindata/
git clone

This will make sure that your code and other relevant documents are backed up daily (so that even if you do not want to use GIT, you still get your files and code backed up).

3) Create subfolder code

Here goes all the code that is needed. This means that everything you do is done with scripts. If you use graphical interfaces (e.g. to create pictures) you should write down the steps.

Side note: always use simple text files!
Use simple text file when possible and avoid using MS-Word or Open Office for writing notes about your project and your data. Similarly CSV or TSV are better than Excel files. The reason is that in 10 or 20 years those file formats might not be easily readable while text files will always stay with future us. The so called markdown format (as used in README.md) is a good example of simple text file with a bit of formatting explicitly written in the file. See a markdown quick introduction here: https://guides.github.com/features/mastering-markdown/

4) Create subfolder original

For the original data, what you actually do is a link to a folder on /archive/ the long term back up disk system.

cd /m/nbe/project/braindata/myprojectnickname/
ln -s /m/nbe/archive/braindata/myprojectnickname/ original

Side note for those using Google Drive or Dropbox
Google drive ignores links to folders and does not back up them (which is good in our case since they contain huge files). Dropbox however also backups links, so you need to explicitly tell Dropbox to not synchronise your “big data” folders.

Now the subfolder orginal is not a real subfolder but just a link to a subfolder in the archive file system. Note: before running the above, make sure the subfolder on archive exists.

5) Create subfolder scratch

Similarly, the subfolder for scratch is just a link to a real folder under the scratch filesystem

mkdir /m/nbe/scratch/braindata/myprojectnickname/
cd /m/nbe/project/braindata/myprojectnickname/
ln -s /m/nbe/scratch/braindata/myprojectnickname/ scratch

6) That’s it. You have the bare minimum needed for starting the project!

Reward yourself with a cookie.


I think it is then up to the user or research group to go as deep as they want to define standards for the subfolder structures. A PI with many lab members will want to make sure that also other aspects of the projects folder structure are standardised (results, figures, etc etc, see the cookiecutter figure above) so that it is easier to check the status of a project without asking the project owner where file X is (and please remember that the README.md file should explain the subfolder structure for the project for exactly these special cases).

What if I want to just play with data and don’t have a project?

In our group I have also thought about those cases where a user just needs a sandbox to play with data without having a clear project. For this, each user can create a folder:

mkdir /m/nbe/project/braindata/shared/username

and then what is under there is just up to the user (as long as disk space is not big, e.g. only few gigas). Similarly for “infinite” disk space sandbox:

mkdir /m/nbe/scratch/braindata/shared/username/

When sharing a file system with many others, the shared folder is useful to share resources that everybody uses. So for this case we have a folder

/m/nbe/scratch/braindata/shared/toolboxes/

where all the external tools (SPM, FSL, Freesurfer, etc) are stored and kept up to date (i.e. a single user does not have to re-download a toolbox that is already present in the system).

Conclusions

I think this blog entry is just a starting point and I believe I will edit this in the future with useful comments from colleagues and internet people. At this stage the procedure I described is manual, which is fine for a small lab (always remember https://xkcd.com/1205/), but young PIs might want to seriously consider using cookiecutter with cookiecutter-data-science template from day zero to automate the creation of multiple subfolders [ps: there is even one template for fMRI or more in general neuroscience projects https://github.com/fatmai/cookiecutter-fmri].

Data: Copy or sync?

I have often heard this question from students and colleagues: should I back-up my files to an external disk using copy (e.g. from the graphical interface) or are there better ways (e.g. from the terminal)?

Why copying files is NOT the way to go

When you copy a file, you clone an existing file into a new file. I.e. the content of the new file is identical to the old file, but the metadata of the new file is different. Simple example: the creation date. In many cases (e.g. when backing up large amount of files), you would like to preserve the original creation date/modification date of your files, because it can help in finding specific files. A typical example with academics and students is “which one was the latest version of my manuscript/essay?” (psst! There’s version control for that, you know?). Being able to check the original creation date of a file, might help you find the right file.

Furthermore, when you copy a file (from terminal with command ‘cp’ or from the graphical interface with copy and paste) the operating system does not check if the file has been copied correctly. Of course in most cases (e.g. no space left on device) the system will provide an error to the user if things go wrong, but things can get tricky when there’s tens of thousands of files: 1) you don’t want to start manually checking that each file went fine 2) if the copy dies midway, you don’t want to manually check which file was successfully copied and which didn’t out of thousands of files.

Do not copy, sync!

Nobody loves sync more than I do. I even did a PhD about sync! Well in this case we are not talking about synchrony of minds, but more simply synchrony between your source data and your copied data. If you sync a file instead of copying it, you make sure that 1) metadata are preserved (creation time, modification time, etc) 2) identity is preserved, i.e. the copied file is actually identical bit-by-bit to the original one 3) if a sync process dies in the middle, you can easily resume from where it stopped (very useful when syncing data over the internet)

How to do that? There are graphical programs for Mac http://www.chriswrites.com/how-to-sync-files-and-folders-on-the-mac/, Windows http://www.online-tech-tips.com/free-software-downloads/sync-two-folders-windows/ and Linux http://www.opbyte.it/grsync/.

However, here I just show the non-graphical solution with a one liner from the terminal (works on Linux and Mac).

For local files:

rsync -avztcp /path/to/source/ /path/to/destination/

Syncing over the internet from remote to local:

rsync -avztcp -e "ssh" user@server.com:/remote_source/ /local_destination/

or from local to remote:

rsync -avztcp -e "ssh"  /local_source/ user@server.com:/remote_destination/

To disentangle the options of rsync is beyond the scope of this simple explanation, but basically the options I wrote here make sure that folder 1 (source) will be identical to folder 2 (destination). Furthermore, the sync is incremental: a file that is deleted in source folder, will NOT be deleted in destination folder (however a file that is modified in source folder, will be modified in destination folder). Check rsync manual pages for more explanations. Please remember the ending slash in your folder paths (see here what are the differences http://qdosmsq.dunbar-it.co.uk/blog/2013/02/rsync-to-slash-or-not-to-slash/)

Fingerprint of a file

A side note: how does rsync manage to do that? It is simple, rsync – before and after copying all the files – computes a “checksum” for each file, i.e. a fingerprint of the file. It is useful to learn how to compute the fingerprint of a file, for example to check that two files you downloaded are identical or to make sure the content you have downloaded is identical to the content stored in the remote server.

To do that from the terminal just type:

md5sum  filename

and that will generate an output similar to:

c5dd35983645a36e18d619d865d65846  filename

The long string is the MD5 fingerprint of the file. There are other types of fingerprints e.g. sha1sum, read more about checksums on the internetz. When downloading data from open repositories, data files are usually accompanied by MD5 checksums so that you can actually check that the download was successful. For example, if you have downloaded the Human Connectome Project data, you should have a folder with many zip files and for each file.zip file there is a corresponding file.zip.md5 containing the fingerprint of the zip file. The following for loop just checks that the fingerprint of the downloaded file is identical to the original fingerprint of the file.

for f in $(ls *.md5); do md5sum -c $f;done

If you keep a log of the output you can then search for those files that are not ok:

## loop through all md5 checksum files and stores the output on a log file
for f in $(ls *.md5); do md5sum -c $f;done 2>&1|tee md5checks.log

## show which lines of the log file are not OK
cat md5checks.log|grep -v OK$

Do I really need to stop using copy???

Well, let’s be clear, you don’t HAVE to always use rsync. If you are just storing a copy of a single file to your USB stick, it is easier to just copy and paste and (maybe) check manually that the file was copied correctly. But you agree with me that when there’s more than a dozen of files, rsync is the way to go.

Advanced options

There are plenty of websites with advanced options for rsync, here a couple of ones I havefound convenient:

  1. Sync all files with extension “.m” and include all subfolders. Do not include the “Downloads” subfolder
    rsync -avztcp --include '*/' --include '*.m' --exclude '*' --exclude 'Downloads' /sourcefolder/ /destinationfolder/
  2. Sync files that are smaller than 100MB.
rsync -avztcp --max-size=100m /sourcefolder/ /destinationfolder/

GIT in 10 minutes

This short tutorial gets you started with GIT from zero. It is intended for non-tech audience who wants to adopt GIT as an efficient way to track a project. Here some slides on the topic: https://users.aalto.fi/~eglerean/git.pdf

Table of contents
What?
Why?
Set me up
Get things done
Fix errors

What is GIT?

  • GIT is a version control system: a simple way to put a time-stamp to your files so that you can keep track of all the changes.
  • A copy of the files is stored remotely on the GIT repository server, so GIT can also function as a backup for your files.
  • You can always revert to a previous version of a file.
  • GIT was born to track code for software projects, so it works best with simple text files. It also works with binary files.
  • GIT forces you to add a comment every time you want to back-up the status of your project; If you write meaningful comments, it will be easier to track changes.
  • GIT was born to work with other collaborators, however the two persons you are collaborating the most are your past self and your future self.

Why using GIT?

Some motivating scenarios

  1. You are a programmer: most likely you are working in a team and/or releasing software for public use. Then you need git to keep colleagues and users updated on changes and new features
  2. You want to become a programmer: there is no better CV than showing that you can actually write code and solve problems. Companies look for your github profile to see how you code. Start using GIT for all your school homework or class projects and with no effort you have a huge programming portfolio.
  3. You are not a programmer, you do not want to become a programmer, but you are working on a project: even if you work alone, you are still collaborating with your past self and with your future self. GIT keeps a status of your files, your notes and your project so that you don’t have to rely on file names such as projectReport_final_verylastversion_thisisthefinal.doc.
  4. You are a (data) scientist: it doesn’t matter if you code or not, you need to have a lab notebook to keep track of anything that happened throughout your project, from the request of ethical permit, to piloting, to final analysis and writing the paper. GIT can become your digital lab notebook.

GIT in practice part 1: set-up your project

This guide is for Linux/Mac users: it requires the use of the terminal to fully understand the steps involved. Once you understood what happens under the hood, you can switch to a graphic-user-interface solution if you prefer.

  1. Create an account on github, bitbucket or any other GIT repository (for my colleagues and studends, click here).
    If you deal with sensitive data, make your project private (e.g. by asking your IT team to set up a git repository for you).
  2. Set-up ssh keys.
    Ssh keys are just a way to avoid using passwords. From experience, using GIT via SSH is i) faster, ii) less prone to server limitations (important when uploading large files) and iii) more secure. Check if you already have ssh keys in your computer. Alternatively you can generate a public/private ssh keys pair. Follow steps here: https://help.github.com/articles/generating-an-ssh-key/
    Then upload the public key to the website of the GIT repository, usually under your account settings. Then open the terminal and validate the key by typing something like:
ssh -T git@github.com

For my colleagues and students:

ssh -T git@git.becs.aalto.fi

 

  • Create a new repository from the web interface.
    Give it a meaningful name (imagine you want to share it with others). Do not use spaces or hyphens. With GITLAB, if you are out of own projects, create a project under the group.
  • Set up the local copy of the repository on your machine.
    From the web-interface, copy the SSH string for your project (something like: git@github.com:username/projectname.git); Open a terminal, go to the folder where you want your project to live and type:

 

mkdir projectname
cd projectname
git init
git remote add origin git@github.com:username/projectname.git

 

  • Create a file called README.md and store it in the projectname folder. This is your main project file. I use it to let others know what is inside this project and where to look for references. I also use it to write down experiment choices, parameters used in the analysis, todo lists. The extension .md is for “markdown”, a formatting language to set things like headers, lists etc. The file README.md is displayed by default on your project GIT page, so it is nice to add some formatting. Here a simple formatting to copy paste:
    # Project main title
    *A small one-line subtitle*
    
    Here a link with a picture. The picture is stored in a subfolder called figures.
    [
    ![NAME FOR THE LINK](figures/demo_figure.png)
    ](http://the-actual-link-whatever-it-is.com)
    
    Code released under [MIT License](https://en.wikipedia.org/wiki/MIT_License) (see LICENSE file).
    
    ## WHAT IS IT
    Describe what is it about
    
    ## WHAT IS WHERE
    Describe organization of subfolders and files
    
    ## HOW TO INSTALL
    A manual, if it's a software.
    
    ## HOW TO CITE
    A link to the publication
    
    ### 15/07/2015 TODO
    A simple todo list for your future self.
    
  • Add the file to the repository.
    Tell GIT that we want to keep track of this file. This is done only once per file with command:

 

git add README.md
  • Commit the file to the repository.
    Tell GIT that we want to permanently store the current version of the file, with command:

    git commit -m "A meaningful message that explains what you did" README.md
  • Synchronize your local copy of the repository with the remote one
    This is where the file is uploaded to the remote repository and backed up. Use command

    git push -u origin master

    Sometimes it’s enough to just type

    git push
  • Check that the web repository has the new file
    Hurray! You have a fully functioning repository

GIT in practice part 2: working on an existing repository

You are all set-up and just want to start adding files, editing them, keeping track of them.

  1. Before doing anything, make sure your local repository is up to date with the remote one.
    Open the terminal and go to the local GIT folder for the project. To get a simple summary of the local status, just type.
git status

. To see if there are any differences from local versus remote, type:

git remote update

and/or

git diff origin/master

Update local version with the remote one:

git pull
  • Add a new file. Create a new file or copy an existing file in a folder (or subfolder) of your local GIT repository and tell GIT that you want to track this file (just once per file):
    git add subfolder/name_of_file.txt
  • Commit file. Tell remote GIT repository to store the current version of the file.
    git commit -m "useful message here" subfolder/name_of_file.txt
  • Push the file to the remote repository
    git push
  • Editing. When editing an existing file, you just need to run git commit and git push. Before committing you can always the last changes of the local file by running:
    git diff name_of_file.txt
  • Delete. Remove a file with
    git rm filename
    git commit -m "deleting because..." filename
    git push
  • Re-start from scratch or get somebodyelse’s code
    Sometimes your local configuration gets messy, sometimes you change machine and you need to start from where you have left it. Go to an empty folder and just type

    git clone git@github.com:username/projectname

    This will create a subfolder called projectname and will download all remote files into it. The same command can be used to download locally someonelse’s git repository. In the latter case however you might not have permissions to add changes if the repository owner doesn’t give you write access. For these cases you can also “fork” somebodyelse’s project. Just visit an existing project and you will see the fork button.

GIT troubleshooting

In general, google the error you get: there is always an answer. If local configuration gets messy, it is easier to start from scratch with git clone (see above).  Sometimes there are cases where the remote repository is modified (e.g. via web interface) and also the local repository has changes. This can happen especially when collaborating with others. When both repositories have changes, git will try to merge them. When you locally run git push, the command will complain that it needs to do a merge. You can merge these situations with:

  1. Run
    git merge origin/master
  2. Open the conflicting files. Conflicts will look like:
    <<<<<<< HEAD
    code on your version that is not in the remote version
    =======
    code on the remote version that is not in the master version
    >>>>>>> origin/master
  3. Remove the “<<<” “===” “>>>” lines and fix the conflicting changes. You can then git commit and git push again.

GIT dump of some advanced commands

From https://help.github.com/articles/changing-a-remote-s-url/ sometimes you have an existing local repository and want to back it up to a new remote repo. You first create your remote empty repo with the new location and then:

cd existing_repo
git remote set-url origin URLOFNEWREPO
git push -u origin --all
git push -u origin --tags