Linux for (data) scientists – my collection of Linux FUC = frequently used commands

This is just a dump of Linux commands that I often use. They should be useful to the average (data) scientist, and a source of inspiration for those who are learning Linux. Do not fear the terminal, it’s the most powerful computer tool for a scientist! Sometimes I don’t remember the correct options, so this page is also useful for me rather than always googling for ad-hoc solutions. This page will be updated frequently and the list will grow.

P.S.: A useful resource is also http://www.bashoneliners.com/.

 

  1. List contents of zip file without unzipping
    unzip -l filename.zip
  2. List contents of tar file without untarring
    tar -tf filename.tar.gz
  3. Extract content of a file in a zip
    unzip -c filename.zip file1.txt| less
  4. Save a remote file locally (-O), resume download if connectivity breaks (-C -), limit the download speed (–limit-rate). This is useful for downloading huge datasets in the background.
    curl -L -O -C - --limit-rate 200k http://remotefile
  5. Replace new lines with a symbol (e.g. a semicolon)
    cat file.txt|tr '\n' ';'
  6. Run a command and store the output to a log file (including standard errors)
    ./command.sh 2>&1|tee logfile.log
  7. Get a list of who is connected to a machine and how many processes that user is running (root is usually the user with most processes… but not always :)). Note the command sed. It replaces any number of consecutive spaces with a semicolumn, i.e. it creates a csv out of your previous command. Then cut, extracts column number 1. sort+uniq -c counts multiple occurences and a final sort, sorts the output.
    ps -fA|sed 's/ \+/;/g'|cut -d\; -f1|sort|uniq -c|sort -nr
  8. This one is to test md5 checksum for downloaded data (this works for the Human Connectome Project). Imagine you have downloaded file1.zip and file1.zip.md5. The second file contains the so called md5checksum, an unique fingerprint of your file. You want to test that the file you downloaded gives the same identical fingerprint that is written in the md5 file. To do this for all the hundreds of files of the HCP, you can just run:
    for n in $(ls *.md5);do f=$(echo $n|sed 's/.md5//g');echo -n $f" ";nn=$(md5sum $f);nnf=$(cat $n);if [ "$nn" == "$nnf" ];then echo "OK!";else echo "$nn $nnf";fi;done 2>&1|tee hcp_checks.log
  9. Commands to make sure that shared folders/files are accessible by the group. 1st one is for folders and the second is for files.
    find . -type d -exec chmod g+rwxs {} \;
    
    find . -type f -exec chmod g+rw {} \;
  10. Get Nth line from a txt file. NUM is a variable with the number.
    sed "${NUM}q;d" file
  11. Get the number of a column of a csv file, based on the header (the header contains the string that is grepped)
    cat bigfile.csv |grep header_of_a_column|tr '|' '\n'|sed '/./='|sed '/./N; s/\n/ /'
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s