What rules of thumb would you suggest to a fledging bioinformatican? Or if you could travel back in time, what would you tell your younger self?
I thought it would be fun to try and get down some of the things I’ve learnt, some are practical and some are a bit more philosophical.
Any bioinformaticians reading I’d be interested to hear your thoughts!
1) Always “nohup” your analysis runs
Basic, but easy to forget. How many times has a long-running analysis died, usually just before the end, because my laptop went on standby, or the ‘net connection went down, killing an SSH session.
Write a script that calls your program and execute it with nohup <script-name> & to run it in the background. If it sucks major CPU time then remember to renice the process, remembering that higher nice values mean lower priority.
2) Name your directories and your files carefully
It’s no fun returning to an analysis some months or even years after you did it to find a directory called ‘stuff’ filled with files like ‘out’, ‘results’ and ‘crunch’.
Name your scripts according to function, e.g. ‘parse-glimmer-output-to-genbank’. Organise your directories into logical hierarchies, and then datestamp and serial each set of results i.e. ‘amazing-genome-work-23052009-1′.
3) Put all your scripts and primary data into Subversion or equivalent source control system
This solves two problems at once: how to ensure a copy of all your useful scripts on each server you work with and your personal machine. It is also the answer to how to keep a distributed backup of all your work.
Put primary data sources in Subversion, but add your results directory to ‘.svnignore’ so you don’t store analysis which can be re-constructed easily by your scripts.
Subversion is just an example, you might prefer git or bazaar or some other fangled system. It doesn’t matter so much what you use but that you use it.
4) Write your scripts for the guy in ‘Momento’
You remember the guy that can’t remember anything older than about 5 minutes ago. That’s you.
Make your scripts parse the command-line properly and output a friendly message when you run it with no arguments. Use a consistent scheme for input and output, either using command-line flags like “–in” and “–out”, or make all your scripts use redirection. Try not to mix different schemes.
Put a README in the directory to remind you of the order you run the scripts, and any environment variables or other gotchas when running.
5) Be able to work effectively on both your laptop as well as a server
Install a desktop environment similar to your server environment (probably Linux), and enough on your laptop so you can develop stand-alone.
That way when you are at the airport or on a plane or train, in a tedious seminar you can’t get out of, or when your Internet connection has gone down you can can still get some work done. Point 3) helps with that.
6) Split scripts by task, and allow check-points
There is nothing more annoying than a script that takes a week to run crapping out in the last few minutes due to a bug and there being no way to restart the analysis except from the beginning. Break down tasks into modules, write a separate script for each part of the analysis, and make it so you can start from any point in your analysis pipeline.
7) Beware the “ultimate system” which solves everything
An error that affects newbies and experienced bioinformaticians alike, in my experience. In your mind there is a gleaming, shiny “system” that can solve all known bioinformatics problems. It probably has an elaborate web interface, RPC bindings, a fully normalised RDBMS and probably a full Turing AI as well. This system will never exist. Don’t start building it. Resist.
Instead build something that does something useful NOW, document it, release it and move on. If the thing you built is really useful to someone else you will start getting emails. Then you can start extending it, if necessary!
Like all advice, it is easier to dish out than to take, particularly the last one!