Can you really get resting state in the brain?

I am currently teaching a course on Human Brain Connectivity and I have asked students to ask me questions to go deeper into what we discussed during the contact sessions. Here one of the questions that came out.

HCP_course


Can you really get resting state in the brain, isn’t everyone thinking about something and that way doing something with their brain all the time?

As you feel it yourself when you lay down and stop moving, your mind is still wandering and doing all sorts of mental tasks, so what is it that resting state really measures? There is indeed a whole field of neuroscience that explores this along with mind-wandering, rumination, and what it means to measure the brain at rest.

In the jargon of the field, people talk about intrinsic activity (that is not driven by a stimulus) versus extrinsic (driven and/or time-locked with external stimuli). There is one network called the default mode network (DMN, more on this in contact session #4) that is found to be active during the lack of task. For this reason it has been called “task negative” in the early days of resting state research, but today we know that it is actually involved in many tasks.

So what is the brain doing during rest or lack of stimuli? Well clearly mental tasks with no external stimuli such as 1) remembering the events of your day 2) subtracting numbers, or 3) (silently) singing lyrics are producing different connectivity patterns that can be decoded with 84% accuracy (https://www.ncbi.nlm.nih.gov/pubmed/21616982).

However when asking the subjects to rest in the scanner they are usually not given specific instructions like “subtract numbers”. Some people like Jonathan Smalwood took things a bit further and started exploring the possibility that the activity at rest is involved with active rumination and mind-wadering (here a recent review https://www.ncbi.nlm.nih.gov/pubmed/25293689). Things however are not that simple, another recent meta-analysis has found that it’s not only in the DMN but other networks seems to be involved during mind wandering, also networks involved in active task taking (https://www.sciencedirect.com/science/article/pii/S1053811915001408).

A recent paper in PNAS http://www.pnas.org/content/113/48/13899 seems to actually propose that the active mind wandering is not DMN specific, leaving DMN to be more related to stable factors (put it simply, DMN seems to be more related to who you are rather than what you are specifically thinking about). [Check also the reply to that PNAS paper trying to reconcile past and current literature, http://www.pnas.org/content/early/2017/07/12/1705108114.short]

When it comes to clinical conditions, pain rumination at rest has been associated with stronger connectivity within DMN regions http://www.jneurosci.org/content/34/11/3969.short Here another paper comparing active rumination and “free” rest finds differences between healthy controls and major depressive disorder patients, with a note regarding the mood of the subjects having a stronger effect than the rumination itself. (https://www.sciencedirect.com/science/article/pii/S1053811914007691)

Finally, one might argue that the true resting brain is the sleeping brain. Tagliazucchi and others first learned the EEG pattern that sets in when people drift into sleep and then collected simultaneous fMRI + EEG to find when people were falling asleep during fMRI resting state paradigm. When applying this classifier to open datasets they also found that 30% of public resting state data is about people falling asleep in the scanner! https://www.sciencedirect.com/science/article/pii/S0896627314002505

So… is resting state the way to go? I think the answer with everything is always yes and no. Resting state is good, if combined with other basic tasks, as well as mood sampling of the participant and active control of the participants’ mental state (e.g. by monitoring eye movements during rest to see if person falls asleep and by asking the subject about her mental state after scanning). On the other hand, it is also very helpful that even somebody who cannot perform any task (a sedated patient or somebody in a coma [although it is more intriguing – and somewhat shocking – to put people in a coma in a scanner and show them a movie http://www.pnas.org/content/111/39/14277]) can also undergo resting state fMRI: the resulting network is anyway mostly related to the person itself rather than fluctuations so it can still work out as a “fingerprint” (https://www.nature.com/articles/nn.4135) for future diagnostic purposes.


Please contribute to the answer with comments below.

Advertisement

Assessing the similarity between time series – Part 1

In this series of post I will discuss how to assess the similarity between a pair of time series using frequentists methods. I.e. all these methods provide at the end a p-value and a similarity score (e.g. a correlation) using parametric and/or non-parametric approaches. Most explanations below are intuitive, you can read the details in signal processing and random processes books. I will mostly show examples using Pearson’s correlation this is because: 1) Pearson’s r is equal to the beta coefficient of the linear regression obtained between two signals with zero mean and variance 1 (i.e. z-scored signals). 2) if you think that Pearson’s is not good because you have few extreme spikes (= noise) in your data, then I’d rather remove the spikes (e.g. with a median sliding window filter or with linear regression) and work with Pearson. However if spikes are actually part of your signal of interest, then you need to work with Spearman’s r and it is not difficult to adapt the below to the Spearman’s case.

1. Similarity between two “white” signals

If the time series we are dealing with are “basically white” (i.e. the time points are a gaussian process with no relationship between time points) then a simple correlation and the parametric p value obtained by it are enough to claim statistical similarity.

rng(0)
T=1000; % number of time points
x = randn(T,1);
y = randn(T,1);
[r p]=corr(x,y)

Which if you copy paste it to Matlab will give:

r =
-0.0093

p =
0.7695

How to assess that the signal you are dealing with is white? There are tests to see if the points can be well fit into a gaussian distribution (Enrico to add link to examples here). A white signal means that the frequency power spectrum (i.e. power spectral density PSD) is roughly a flat line (i.e. all frequencies have approximately the same power). However, the inverse Fourier transform of the PSD is the autocorrelation of the time series. As the PSD is a flat line, then the autocorrelation must be (approximately) a Dirac delta function, i.e. a “stick” function that is different from zero at the origin and “almost zero” in other time points.

Autocorrelation

Put it intuitively, every real signal is an autocorrelated signal and autocorrelation means information. That means that signals measured from nature tend to have a 1/f power spectrum: this means that what happens at a specific time point, is very correlated to what happens to the next (= immediately following) time point. Some signals are faster than other so the “immediately following” is also know as sampling frequency and you want to make sure that it is fast enough to keep most of the frequencies of the signals within your sampling range (go read Shannon’s theorem if you haven’t heard it before).

Long story short, you can estimate the autocorrelation of a time series by computing the cross correlation between the signal and itself lagged. Repeat for multiple lags and you get the autocorrelation plot. With Matlab:

subplot(2,1,1)
autocorr(x)
subplot(2,1,2)
autocorr(y)

autocorrelations

And indeed we see that the two time series have a correlation that basically looks like a Dirac delta, i.e. maximum correlation with lag = 0 (obvious result) and very tiny correlation with lag different from 0.

2. Relationship between r, p, t-distribution, autocorrelation, and degrees of freedom

Now to understand how the p value is computed from an r value parametrically, one has to look at the distribution used. Matlab uses this kind of mapping:

Code: https://version.aalto.fi/gitlab/BML/bramila/blob/master/external/sandbox/pvalPearson.m

function p = pvalPearson(tail, rho, n)
%PVALPEARSON Tail probability for Pearson's linear correlation.
% p = pvalPearson(tail, rho, n)
% tail = 'b' 'r' 'l' 
t = sign(rho) .* Inf;
k = (abs(rho) < 1);
t(k) = rho(k).*sqrt((n-2)./(1-rho(k).^2));
switch tail
case 'b' % 'both or 'ne'
 p = 2*tcdf(-abs(t),n-2);
case 'r' % 'right' or 'gt'
 p = tcdf(-t,n-2);
case 'l' % 'left' or 'lt'
 p = tcdf(t,n-2);
end

Where n is the total number of time points which corresponds to n-2 degrees of freedom. So if your signals are white, indeed the degrees of freedom in your data are very close to the number of time points since each time points is independent from the other time points. But real signals are autocorrelated, which is like saying intuitively that even if you downsampled the signal (e.g. keeping 1 time point every 10) you would still retain most of the information.

This means that if your signals are not white (i.e. all the signals you sample in real life) then the p value returned by the corr function is wrong because it overestimates the degrees of freedom used in the t-distribution. How to solve this problem? One solution is to estimate the actual degrees of freedom of an autocorrelated signal.

Prof. Petri Toiviainen spent some time digging the literature for a paper we were coauthoring years ago, and found this gem in the Canadian journal of Fisheries and Aquatic Sciences “Comparison of methods to account for autocorrelation in correlation analyses of fish data”  and summed it up in the Appendix B of our paper.

You can test the formula yourself with the code I wrote here: https://version.aalto.fi/gitlab/BML/bramila/blob/master/bramila_autocorr.m

using our white signals above:

bramila_autocorr(x,y)
ans =
999.3657

Indeed the cross correlation between the two time series has the number of degrees of freedom almost equal as the total number of time points T.

Example with two autocorrelated signals

Here a final example using two autocorrelated time series and how to compute the correct p-value parametrically using the estimation of the degrees of freedom with the t-distribution function.

T=1000;
rng(0)
x=zscore(cumsum(randn(T,1)));
y=zscore(cumsum(randn(T,1)));

[r p]=corr(x,y)

df=bramila_autocorr(x,y)

p_actual = pvalPearson('b', r, df+2)

which produces

r =
-0.5026

p =
3.8770e-65

df =
8.0402

p_actual =
0.1376

As we can see, the WRONG p-value obtained by blindingly applying Matlab’s corr function would tell us that the two time series are almost identical with a minuscle p value of 10^-65. In reality the degrees of freedom between time series are just 8 which makes the p-value non significant (as it should be since they are just coming from random noise).

In the next parts I will show how to obtain p-values without recurring to parametric distributions as I showed here. Specifically one can generate surrogate signals that have properties that are similar to the original signals (e.g. same autocorrelation) but in which the temporal order of events is destroyed. Examples are: resampling in frequency by randomising the phase part of the fourier transformed signal, circular shifting in time, block resampling in time.

 

Project management == Data management

Over the past 2 years there has been a growing number of initiatives to address data management issues in scientific research. While this is no news in some fields like genetics, where input data as well as derived data (results) are shared in standardised digital formats, in other fields researchers have been mainly left alone to develop their own data structure and data management plan, i.e. “reinventing the wheel” multiple times. In Finland, the last round of Academy of Finland application was – for the first time – explicitly asking for a data management plan (DMP). To help the applicants, web tools such as https://www.dmptuuli.fi/ were developed to specifically follow Academy of Finland guidelines as well as other DMP templates such as ERC’s Horizon 2020, NIH, Wellcome Trust. DMPtuuli is based on https://dmponline.dcc.ac.uk/ delivered by http://www.dcc.ac.uk/. Another important resource in Europe is the EUDAT platform https://www.eudat.eu/ that goes beyond DMP, to also include data sharing, data preservation, data processing, meta-data indexing. For those more interested in the hot topic of data management in academia (or other data initiatives), I recommend following my dedicated Twitter list about data.

TL;DR summary of the post

  1. Use GIT for each project
  2. Physical storage separation between
    • source data
    • code/text
    • scratch data

Let’s get to work!

Management plans are important, but when plans translate into doing the actual work, it is clear that data management is highly intertwined with project management. Although project management is another monster by itself (ps: I cannot recommend enough Basecamp, the best project and intranet communication management tool for small companies and small labs), in scientific research data and project management are fusing together when it is about to start planning how to actually store your raw data, derived data, figures, manuscripts, etc and share the workload with other collaborators or (for PIs) monitor the work of your lab members.

I often notice tweets on the topic of project+data management (here a recent one that comes to my mind) and, so far, there is no optimal agreed way to organize folder structure for scientific projects. Clearly, raw data have similar issues and, specifically to neuroscience, data formats like BIDS are finally solving the issue on how to structure raw data in a standardised format to facilitate data sharing and pipelined data processing. Similarly for results, a new format called NIDM (NeuroImaging Data Model) aims at providing a structured machine-readable description of neuroimaging statistical results.

Everyone agrees however that a project should live in a single folder, and this is the approach of packages like cookiecutter https://github.com/audreyr/cookiecutter. With cookiecutter you can create a standardised directory tree to store relevant parts of your project. Interesting cookiecutter templates for data scientists are https://github.com/drivendata/cookiecutter-data-science and https://github.com/mkrapp/cookiecutter-reproducible-science (figure below is the folder tree strucutre for coockiecutter-data-science).

cookiecutter-data-science

Tree folder structure from https://drivendata.github.io/cookiecutter-data-science/

Similarly, the project folder structure by Nikola Vukovic gained some popularity http://nikola.me/folder_structure.html (contains script for automatically creating the folder structure, see figure below).

folder_structure

Figure from http://nikola.me/folder_structure.html

Richard Darst at Aalto university computer science dept. has suggested a simple guideline for project folder management; a folder structure like:

  • PROJECT/code/ – backed up and tracked in a version control system.
  • PROJECT/original/ – original and irreplaceable data. Backed up at the same time it is placed here.
  • PROJECT/scratch/ – bulk data, can be regenerated from code+original
  • PROJECT/doc/ – final outputs, which should be kept for a very long term.
  • PROJECT/doc/paper1/
  • PROJECT/doc/paper2/
  • PROJECT/doc/opendata/

and variations for individual sub-projects within a project:

  • PROJECT/USER1/…. – each user directory has their own code/, scratch/, and doc/ directories. Code is synced via the version control system. People use the original data straight from the shared folder in the project.
  • PROJECT/USER2/….
  • PROJECT/original/ – this is the original data.
  • PROJECT/scratch/ – shared intermediate files, if they are stable enough to be shared.

(for Aalto users, more details in the wiki page https://wiki.aalto.fi/display/Triton/Data+management).

The crunch: what should I do for project+data management?

Q: I am a new PI and I am starting a new lab, which folder structure should I use?

I personally like the barebone approach proposed by Aalto’s computer scientists which aims at keeping the folder structure as simple as possible, yet standardised across projects so that a new person joining an existing project already knows where the relevant bits are stored. Specifically, at Aalto Science there are three types of storage systems:

  1. /archive/ – long term preservation, backed-up periodically, as much disk space as it is needed
  2. /project/ – backed up daily, ideals for storing smaller files related to a project, limited disk space
  3. /scratch/ – not backed up, derived data related to a project, “infinite” disk space

For those who are not at Aalto University, a similar system at home would be:

  1. Archive folder is an external hard disk, backed up twice, with only raw data
  2. Project folder is a directory under your local Google Drive or Dropbox folder for automatic back up of (smallish) files
  3. Scratch folder is a huge external disk (e.g. 1TB) of derived data, no back-up but easy to recreate by running code from project folder.

The project+data management procedure I am using goes as following:

1) start a git repository

Go to github or version.aalto.fi and start a repository with a meaningful project name that we will call myprojectnickname. Make sure that the same name has not been used in the shared project folder (in our group, folder /m/nbe/project/braindata/, in your home computer /Users/username/Google\ Drive/project/). Git is scary for some people (check my 10 minutes introduction to GIT), but what I require my colleagues to AT LEAST do is to just create an empty repository with just a single file called README.md. They can edit it via github web interface for example. README.md is the barebone digital brother of the “lab notebook” so that anyone joining the project immediately gets an idea what is this project is about and other important notes (where the data are, what has been done so far, a to-do list of next steps).

2) go to the main project storage system and clone the newly created git repository

From the command line in Linux or Mac

cd /m/nbe/project/braindata/
git clone

This will make sure that your code and other relevant documents are backed up daily (so that even if you do not want to use GIT, you still get your files and code backed up).

3) Create subfolder code

Here goes all the code that is needed. This means that everything you do is done with scripts. If you use graphical interfaces (e.g. to create pictures) you should write down the steps.

Side note: always use simple text files!
Use simple text file when possible and avoid using MS-Word or Open Office for writing notes about your project and your data. Similarly CSV or TSV are better than Excel files. The reason is that in 10 or 20 years those file formats might not be easily readable while text files will always stay with future us. The so called markdown format (as used in README.md) is a good example of simple text file with a bit of formatting explicitly written in the file. See a markdown quick introduction here: https://guides.github.com/features/mastering-markdown/

4) Create subfolder original

For the original data, what you actually do is a link to a folder on /archive/ the long term back up disk system.

cd /m/nbe/project/braindata/myprojectnickname/
ln -s /m/nbe/archive/braindata/myprojectnickname/ original

Side note for those using Google Drive or Dropbox
Google drive ignores links to folders and does not back up them (which is good in our case since they contain huge files). Dropbox however also backups links, so you need to explicitly tell Dropbox to not synchronise your “big data” folders.

Now the subfolder orginal is not a real subfolder but just a link to a subfolder in the archive file system. Note: before running the above, make sure the subfolder on archive exists.

5) Create subfolder scratch

Similarly, the subfolder for scratch is just a link to a real folder under the scratch filesystem

mkdir /m/nbe/scratch/braindata/myprojectnickname/
cd /m/nbe/project/braindata/myprojectnickname/
ln -s /m/nbe/scratch/braindata/myprojectnickname/ scratch

6) That’s it. You have the bare minimum needed for starting the project!

Reward yourself with a cookie.


I think it is then up to the user or research group to go as deep as they want to define standards for the subfolder structures. A PI with many lab members will want to make sure that also other aspects of the projects folder structure are standardised (results, figures, etc etc, see the cookiecutter figure above) so that it is easier to check the status of a project without asking the project owner where file X is (and please remember that the README.md file should explain the subfolder structure for the project for exactly these special cases).

What if I want to just play with data and don’t have a project?

In our group I have also thought about those cases where a user just needs a sandbox to play with data without having a clear project. For this, each user can create a folder:

mkdir /m/nbe/project/braindata/shared/username

and then what is under there is just up to the user (as long as disk space is not big, e.g. only few gigas). Similarly for “infinite” disk space sandbox:

mkdir /m/nbe/scratch/braindata/shared/username/

When sharing a file system with many others, the shared folder is useful to share resources that everybody uses. So for this case we have a folder

/m/nbe/scratch/braindata/shared/toolboxes/

where all the external tools (SPM, FSL, Freesurfer, etc) are stored and kept up to date (i.e. a single user does not have to re-download a toolbox that is already present in the system).

Conclusions

I think this blog entry is just a starting point and I believe I will edit this in the future with useful comments from colleagues and internet people. At this stage the procedure I described is manual, which is fine for a small lab (always remember https://xkcd.com/1205/), but young PIs might want to seriously consider using cookiecutter with cookiecutter-data-science template from day zero to automate the creation of multiple subfolders [ps: there is even one template for fMRI or more in general neuroscience projects https://github.com/fatmai/cookiecutter-fmri].

Data: Copy or sync?

I have often heard this question from students and colleagues: should I back-up my files to an external disk using copy (e.g. from the graphical interface) or are there better ways (e.g. from the terminal)?

Why copying files is NOT the way to go

When you copy a file, you clone an existing file into a new file. I.e. the content of the new file is identical to the old file, but the metadata of the new file is different. Simple example: the creation date. In many cases (e.g. when backing up large amount of files), you would like to preserve the original creation date/modification date of your files, because it can help in finding specific files. A typical example with academics and students is “which one was the latest version of my manuscript/essay?” (psst! There’s version control for that, you know?). Being able to check the original creation date of a file, might help you find the right file.

Furthermore, when you copy a file (from terminal with command ‘cp’ or from the graphical interface with copy and paste) the operating system does not check if the file has been copied correctly. Of course in most cases (e.g. no space left on device) the system will provide an error to the user if things go wrong, but things can get tricky when there’s tens of thousands of files: 1) you don’t want to start manually checking that each file went fine 2) if the copy dies midway, you don’t want to manually check which file was successfully copied and which didn’t out of thousands of files.

Do not copy, sync!

Nobody loves sync more than I do. I even did a PhD about sync! Well in this case we are not talking about synchrony of minds, but more simply synchrony between your source data and your copied data. If you sync a file instead of copying it, you make sure that 1) metadata are preserved (creation time, modification time, etc) 2) identity is preserved, i.e. the copied file is actually identical bit-by-bit to the original one 3) if a sync process dies in the middle, you can easily resume from where it stopped (very useful when syncing data over the internet)

How to do that? There are graphical programs for Mac http://www.chriswrites.com/how-to-sync-files-and-folders-on-the-mac/, Windows http://www.online-tech-tips.com/free-software-downloads/sync-two-folders-windows/ and Linux http://www.opbyte.it/grsync/.

However, here I just show the non-graphical solution with a one liner from the terminal (works on Linux and Mac).

For local files:

rsync -avztcp /path/to/source/ /path/to/destination/

Syncing over the internet from remote to local:

rsync -avztcp -e "ssh" user@server.com:/remote_source/ /local_destination/

or from local to remote:

rsync -avztcp -e "ssh"  /local_source/ user@server.com:/remote_destination/

To disentangle the options of rsync is beyond the scope of this simple explanation, but basically the options I wrote here make sure that folder 1 (source) will be identical to folder 2 (destination). Furthermore, the sync is incremental: a file that is deleted in source folder, will NOT be deleted in destination folder (however a file that is modified in source folder, will be modified in destination folder). Check rsync manual pages for more explanations. Please remember the ending slash in your folder paths (see here what are the differences http://qdosmsq.dunbar-it.co.uk/blog/2013/02/rsync-to-slash-or-not-to-slash/)

Fingerprint of a file

A side note: how does rsync manage to do that? It is simple, rsync – before and after copying all the files – computes a “checksum” for each file, i.e. a fingerprint of the file. It is useful to learn how to compute the fingerprint of a file, for example to check that two files you downloaded are identical or to make sure the content you have downloaded is identical to the content stored in the remote server.

To do that from the terminal just type:

md5sum  filename

and that will generate an output similar to:

c5dd35983645a36e18d619d865d65846  filename

The long string is the MD5 fingerprint of the file. There are other types of fingerprints e.g. sha1sum, read more about checksums on the internetz. When downloading data from open repositories, data files are usually accompanied by MD5 checksums so that you can actually check that the download was successful. For example, if you have downloaded the Human Connectome Project data, you should have a folder with many zip files and for each file.zip file there is a corresponding file.zip.md5 containing the fingerprint of the zip file. The following for loop just checks that the fingerprint of the downloaded file is identical to the original fingerprint of the file.

for f in $(ls *.md5); do md5sum -c $f;done

If you keep a log of the output you can then search for those files that are not ok:

## loop through all md5 checksum files and stores the output on a log file
for f in $(ls *.md5); do md5sum -c $f;done 2>&1|tee md5checks.log

## show which lines of the log file are not OK
cat md5checks.log|grep -v OK$

Do I really need to stop using copy???

Well, let’s be clear, you don’t HAVE to always use rsync. If you are just storing a copy of a single file to your USB stick, it is easier to just copy and paste and (maybe) check manually that the file was copied correctly. But you agree with me that when there’s more than a dozen of files, rsync is the way to go.

Advanced options

There are plenty of websites with advanced options for rsync, here a couple of ones I havefound convenient:

  1. Sync all files with extension “.m” and include all subfolders. Do not include the “Downloads” subfolder
    rsync -avztcp --include '*/' --include '*.m' --exclude '*' --exclude 'Downloads' /sourcefolder/ /destinationfolder/
  2. Sync files that are smaller than 100MB.
    rsync -avztcp –max-size=100m /sourcefolder/ /destinationfolder/
  3. When you need to set specific group permissions to the destination files
    sync -avztcp -e "ssh" --chmod=g+s,g+rw --group=braindata/sourcefolder/ /destinationfolder/

Linux for (data) scientists – my collection of Linux FUC = frequently used commands

This is just a dump of Linux commands that I often use. They should be useful to the average (data) scientist, and a source of inspiration for those who are learning Linux. Do not fear the terminal, it’s the most powerful computer tool for a scientist! Sometimes I don’t remember the correct options, so this page is also useful for me rather than always googling for ad-hoc solutions. This page will be updated frequently and the list will grow.

P.S.: A useful resource is also http://www.bashoneliners.com/.

 

  1. List contents of zip file without unzipping
    unzip -l filename.zip
  2. List contents of tar file without untarring
    tar -tf filename.tar.gz
  3. Extract content of a file in a zip
    unzip -c filename.zip file1.txt| less
  4. Save a remote file locally (-O), resume download if connectivity breaks (-C -), limit the download speed (–limit-rate). This is useful for downloading huge datasets in the background.
    curl -L -O -C - --limit-rate 200k http://remotefile
  5. Replace new lines with a symbol (e.g. a semicolon)
    cat file.txt|tr '\n' ';'
  6. Run a command and store the output to a log file (including standard errors)
    ./command.sh 2>&1|tee logfile.log
  7. Get a list of who is connected to a machine and how many processes that user is running (root is usually the user with most processes… but not always :)). Note the command sed. It replaces any number of consecutive spaces with a semicolumn, i.e. it creates a csv out of your previous command. Then cut, extracts column number 1. sort+uniq -c counts multiple occurences and a final sort, sorts the output.
    ps -fA|sed 's/ \+/;/g'|cut -d\; -f1|sort|uniq -c|sort -nr
  8. This one is to test md5 checksum for downloaded data (this works for the Human Connectome Project). Imagine you have downloaded file1.zip and file1.zip.md5. The second file contains the so called md5checksum, an unique fingerprint of your file. You want to test that the file you downloaded gives the same identical fingerprint that is written in the md5 file. To do this for all the hundreds of files of the HCP, you can just run:
    for n in $(ls *.md5);do f=$(echo $n|sed 's/.md5//g');echo -n $f" ";nn=$(md5sum $f);nnf=$(cat $n);if [ "$nn" == "$nnf" ];then echo "OK!";else echo "$nn $nnf";fi;done 2>&1|tee hcp_checks.log
  9. Commands to make sure that shared folders/files are accessible by the group. 1st one is for folders and the second is for files.
    find . -type d -exec chmod g+rwxs {} \;
    
    find . -type f -exec chmod g+rw {} \;
  10. Get Nth line from a txt file. NUM is a variable with the number.
    sed "${NUM}q;d" file
  11. Get the number of a column of a csv file, based on the header (the header contains the string that is grepped)
    cat bigfile.csv |grep header_of_a_column|tr '|' '\n'|sed '/./='|sed '/./N; s/\n/ /'
  12. List what is going on with your network using TCP (you can remove the -sTCP option for a full list)
    sudo lsof -PiTCP -sTCP:LISTEN

GIT in 10 minutes

This short tutorial gets you started with GIT from zero. It is intended for non-tech audience who wants to adopt GIT as an efficient way to track a project. Here some slides on the topic: https://users.aalto.fi/~eglerean/git.pdf

Table of contents
What?
Why?
Set me up
Get things done
Fix errors

What is GIT?

  • GIT is a version control system: a simple way to put a time-stamp to your files so that you can keep track of all the changes.
  • A copy of the files is stored remotely on the GIT repository server, so GIT can also function as a backup for your files.
  • You can always revert to a previous version of a file.
  • GIT was born to track code for software projects, so it works best with simple text files. It also works with binary files.
  • GIT forces you to add a comment every time you want to back-up the status of your project; If you write meaningful comments, it will be easier to track changes.
  • GIT was born to work with other collaborators, however the two persons you are collaborating the most are your past self and your future self.

Why using GIT?

Some motivating scenarios

  1. You are a programmer: most likely you are working in a team and/or releasing software for public use. Then you need git to keep colleagues and users updated on changes and new features
  2. You want to become a programmer: there is no better CV than showing that you can actually write code and solve problems. Companies look for your github profile to see how you code. Start using GIT for all your school homework or class projects and with no effort you have a huge programming portfolio.
  3. You are not a programmer, you do not want to become a programmer, but you are working on a project: even if you work alone, you are still collaborating with your past self and with your future self. GIT keeps a status of your files, your notes and your project so that you don’t have to rely on file names such as projectReport_final_verylastversion_thisisthefinal.doc.
  4. You are a (data) scientist: it doesn’t matter if you code or not, you need to have a lab notebook to keep track of anything that happened throughout your project, from the request of ethical permit, to piloting, to final analysis and writing the paper. GIT can become your digital lab notebook.

GIT in practice part 1: set-up your project

This guide is for Linux/Mac users: it requires the use of the terminal to fully understand the steps involved. Once you understood what happens under the hood, you can switch to a graphic-user-interface solution if you prefer.

  1. Create an account on github, bitbucket or any other GIT repository (for my colleagues and studends, click here).
    If you deal with sensitive data, make your project private (e.g. by asking your IT team to set up a git repository for you).
  2. Set-up ssh keys.
    Ssh keys are just a way to avoid using passwords. From experience, using GIT via SSH is i) faster, ii) less prone to server limitations (important when uploading large files) and iii) more secure. Check if you already have ssh keys in your computer. Alternatively you can generate a public/private ssh keys pair. Follow steps here: https://help.github.com/articles/generating-an-ssh-key/
    Then upload the public key to the website of the GIT repository, usually under your account settings. Then open the terminal and validate the key by typing something like:
ssh -T git@github.com

For my colleagues and students:

ssh -T git@git.becs.aalto.fi

 

  • Create a new repository from the web interface.
    Give it a meaningful name (imagine you want to share it with others). Do not use spaces or hyphens. With GITLAB, if you are out of own projects, create a project under the group.
  • Set up the local copy of the repository on your machine.
    From the web-interface, copy the SSH string for your project (something like: git@github.com:username/projectname.git); Open a terminal, go to the folder where you want your project to live and type:

 

mkdir projectname
cd projectname
git init
git remote add origin git@github.com:username/projectname.git

 

  • Create a file called README.md and store it in the projectname folder. This is your main project file. I use it to let others know what is inside this project and where to look for references. I also use it to write down experiment choices, parameters used in the analysis, todo lists. The extension .md is for “markdown”, a formatting language to set things like headers, lists etc. The file README.md is displayed by default on your project GIT page, so it is nice to add some formatting. Here a simple formatting to copy paste:
    # Project main title
    *A small one-line subtitle*
    
    Here a link with a picture. The picture is stored in a subfolder called figures.
    [
    ![NAME FOR THE LINK](figures/demo_figure.png)
    ](http://the-actual-link-whatever-it-is.com)
    
    Code released under [MIT License](https://en.wikipedia.org/wiki/MIT_License) (see LICENSE file).
    
    ## WHAT IS IT
    Describe what is it about
    
    ## WHAT IS WHERE
    Describe organization of subfolders and files
    
    ## HOW TO INSTALL
    A manual, if it's a software.
    
    ## HOW TO CITE
    A link to the publication
    
    ### 15/07/2015 TODO
    A simple todo list for your future self.
    
  • Add the file to the repository.
    Tell GIT that we want to keep track of this file. This is done only once per file with command:

 

git add README.md
  • Commit the file to the repository.
    Tell GIT that we want to permanently store the current version of the file, with command:

    git commit -m "A meaningful message that explains what you did" README.md
  • Synchronize your local copy of the repository with the remote one
    This is where the file is uploaded to the remote repository and backed up. Use command

    git push -u origin master

    Sometimes it’s enough to just type

    git push
  • Check that the web repository has the new file
    Hurray! You have a fully functioning repository

GIT in practice part 2: working on an existing repository

You are all set-up and just want to start adding files, editing them, keeping track of them.

  1. Before doing anything, make sure your local repository is up to date with the remote one.
    Open the terminal and go to the local GIT folder for the project. To get a simple summary of the local status, just type.
git status

. To see if there are any differences from local versus remote, type:

git remote update

and/or

git diff origin/master

Update local version with the remote one:

git pull
  • Add a new file. Create a new file or copy an existing file in a folder (or subfolder) of your local GIT repository and tell GIT that you want to track this file (just once per file):
    git add subfolder/name_of_file.txt
  • Commit file. Tell remote GIT repository to store the current version of the file.
    git commit -m "useful message here" subfolder/name_of_file.txt
  • Push the file to the remote repository
    git push
  • Editing. When editing an existing file, you just need to run git commit and git push. Before committing you can always the last changes of the local file by running:
    git diff name_of_file.txt
  • Delete. Remove a file with
    git rm filename
    git commit -m "deleting because..." filename
    git push
  • Re-start from scratch or get somebodyelse’s code
    Sometimes your local configuration gets messy, sometimes you change machine and you need to start from where you have left it. Go to an empty folder and just type

    git clone git@github.com:username/projectname

    This will create a subfolder called projectname and will download all remote files into it. The same command can be used to download locally someonelse’s git repository. In the latter case however you might not have permissions to add changes if the repository owner doesn’t give you write access. For these cases you can also “fork” somebodyelse’s project. Just visit an existing project and you will see the fork button.

GIT troubleshooting

In general, google the error you get: there is always an answer. If local configuration gets messy, it is easier to start from scratch with git clone (see above).  Sometimes there are cases where the remote repository is modified (e.g. via web interface) and also the local repository has changes. This can happen especially when collaborating with others. When both repositories have changes, git will try to merge them. When you locally run git push, the command will complain that it needs to do a merge. You can merge these situations with:

  1. Run
    git merge origin/master
  2. Open the conflicting files. Conflicts will look like:
    <<<<<<< HEAD
    code on your version that is not in the remote version
    =======
    code on the remote version that is not in the master version
    >>>>>>> origin/master
  3. Remove the “<<<” “===” “>>>” lines and fix the conflicting changes. You can then git commit and git push again.

GIT dump of some advanced commands

From https://help.github.com/articles/changing-a-remote-s-url/ sometimes you have an existing local repository and want to back it up to a new remote repo. You first create your remote empty repo with the new location and then:

cd existing_repo
git remote set-url origin URLOFNEWREPO
git push -u origin --all
git push -u origin --tags