Advice to a young bioinformatician

What rules of thumb would you suggest to a fledging bioinformatican? Or if you could travel back in time, what would you tell your younger self?

I thought it would be fun to try and get down some of the things I’ve learnt, some are practical and some are a bit more philosophical.

Any bioinformaticians reading I’d be interested to hear your thoughts!

1) Always “nohup” your analysis runs

Basic, but easy to forget. How many times has a long-running analysis died, usually just before the end, because my laptop went on standby, or the ‘net connection went down, killing an SSH session.

Write a script that calls your program and execute it with nohup <script-name> & to run it in the background. If it sucks major CPU time then remember to renice the process, remembering that higher nice values mean lower priority.

2) Name your directories and your files carefully

It’s no fun returning to an analysis some months or even years after you did it to find a directory called ‘stuff’ filled with files like ‘out’, ‘results’ and ‘crunch’.

Name your scripts according to function, e.g. ‘parse-glimmer-output-to-genbank’. Organise your directories into logical hierarchies, and then datestamp and serial each set of results i.e. ‘amazing-genome-work-23052009-1’.

3) Put all your scripts and primary data into Subversion or equivalent source control system

This solves two problems at once: how to ensure a copy of all your useful scripts on each server you work with and your personal machine. It is also the answer to how to keep a distributed backup of all your work.

Put primary data sources in Subversion, but add your results directory to ‘.svnignore’ so you don’t store analysis which can be re-constructed easily by your scripts.

Subversion is just an example, you might prefer git or bazaar or some other fangled system. It doesn’t matter so much what you use but that you use it.

4) Write your scripts for the guy in ‘Momento’

You remember the guy that can’t remember anything older than about 5 minutes ago. That’s you.

Make your scripts parse the command-line properly and output a friendly message when you run it with no arguments. Use a consistent scheme for input and output, either using command-line flags like “–in” and “–out”, or make all your scripts use redirection. Try not to mix different schemes.

Put a README in the directory to remind you of the order you run the scripts, and any environment variables or other gotchas when running.

5) Be able to work effectively on both your laptop as well as a server

Install a desktop environment similar to your server environment (probably Linux), and enough on your laptop so you can develop stand-alone.

That way when you are at the airport or on a plane or train, in a tedious seminar you can’t get out of, or when your Internet connection has gone down you can can still get some work done. Point 3) helps  with that.

6) Split scripts by task, and allow check-points

There is nothing more annoying than a script that takes a week to run crapping out in the last few minutes due to a bug and there being no way to restart the analysis except from the beginning. Break down tasks into modules, write a separate script for each part of the analysis, and make it so you can start from any point in your analysis pipeline.

7) Beware the “ultimate system” which solves everything

batch2 148An error that affects newbies and experienced bioinformaticians alike, in my experience. In your mind there is a gleaming, shiny “system” that can solve all known bioinformatics problems. It probably has an elaborate web interface, RPC bindings, a fully normalised RDBMS and probably a full Turing AI as well. This system will never exist. Don’t start building it. Resist.

Instead build something that does something useful NOW, document it, release it and move on. If the thing you built is really useful to someone else you will start getting emails. Then you can start extending it, if necessary!

Like all advice, it is easier to dish out than to take, particularly the last one!

4 thoughts on “Advice to a young bioinformatician”

  1. …. I made a perfect system for ecological data modelling. It was even edible in case of environmental disasters.

  2. …. I made a perfect system for ecological data modelling. It was even edible in case of environmental disasters.

    There’s a really tiny wee smiley right at the bottom of the ‘#comments’ page

  3. Number 7 makes me think of a chapter in the book (by a guy called Brooks) called ‘The mythical man month’ Get hold of a copy if you can. My last copy went walkies so I cant lend it to you

Leave a Reply to Jon Segar Cancel reply

Your email address will not be published. Required fields are marked *