Needing to draw a rarefaction curve to demonstrate gene accumulation when sequencing increasing numbers of bacterial genomes I decided it was time to get stuck into learning a bit of R. Luckily, as it turned out there is a useful R library called vegan, designed for vegetation ecologists, which does much of the painful work. The included function specaccum is designed for a slightly different application – figuring out the diversity of species when sampling in the field (remember using quadrats back at school?). But if we think proteins instead of pretty flowers and genomes instead of quadrats it works just the same.
Figuring out how you get your data ready to be used in R took a few hours, so I thought I’d post the step-by-step in case anyone else might benefit from this information in future (or can suggest improvements). Disclaimer: I did this for the first time today, so there might be problems with it. If you find any, please post in the comments section!
Before you can draw your curve, you need to compute your orthologues. I did this with OrthoMCL which is pretty user-friendly and has been shown to yield good results. I decided on a 70% minimum length match and a BLAST e-value cut-off less than 1e-5. I discard proteins less than 50 nucleotides in length as these are likely to be spurious and caused by misannotation. Following the OrthoMCL instructions a few hours later I ended up with a file of COGs (clusters of orthologous groups) in a file called groups.txt. Now, to business.
First one must convert the groups.txt file into a table file suitable for loading into R. I decided to create a matrix with binary values (1 or 0) depending on the presence or absence of a particular COG in a genome. Because the groups.txt will only list clusters containing 2 or more protein sequences, I had to add some code to my script to also output the singletons which make up the species pan-genome.
This Python script was what I ended up with:
import sys class Cluster: def __init__(self, name): self.name = name self.members = {} def add(self, bug, member): self.members[bug] = member all_proteins = {} clusters = [] bugs = {} for ln in open(sys.argv[1]): if ln.startswith('>'): protein = ln[1:].rstrip() all_proteins[protein] = False n = 0 for ln in open(sys.argv[2]): prefix, proteins = ln.rstrip().split(':') proteins = proteins.split() c = Cluster(prefix) for p in proteins: bug, id = p.split("|") c.add(bug, id) bugs[bug] = bug all_proteins[p] = True clusters.append(c) # add the singletones i = 1 for p, already_counted in all_proteins.iteritems(): if already_counted == False: c = Cluster("single%d" % (i)) bug, id = p.split("|") c.add(bug, id) i += 1 clusters.append(c) first_cluster = clusters[0] print " ".join(['"%s"' % (c.name) for c in clusters]) i = 0 for b in sorted(bugs): print '"%s"' % (b), for c in clusters: print " ", if b in c.members: print "1", else: print "0", print i += 1
This script is called with the path to the OrthoMCL output files goodProteins.fasta as the first argument and groups.txt file as the second argument. Redirect the output to a file which is suitable for loading into R. This should give you a file in the format:
"cog1" "cog2" "cog3" "cog4" .. "single1" "single2" ..
genome1 1 0 1 1 1 0 0genome2 1 0 1 1 0 0 1genome3 1 1 0 1 1 1 1...
This file is now suitable to loading into R with the read.table function.
OK, on to the curve. Firstly you must install the vegan library, which gives access to the specaccum function. I put my table file into c:workspaceR and called it pangenome.table.
The following R code then produces a rarefaction curve with an overlying box-plot showing the error bars and standard deviation. We run specaccum twice
library(vegan)
setwd(‘c:/workspace/R’)
mydata <- read.table(‘pangenome.table’)
sp <- specaccum(mydata, “random”, permutations=”100″)
summary(sp)
plot(sp, ci.type=”poly”, col=”blue”, lwd=2, ci.lty=0, ci.col=”lightblue”, xlab=”Genomes”, ylab=”Proteins”, main=”Gene accumulation plot”)
boxplot(sp, col=”yellow”, add=TRUE)
The second call to specaccum picks the order genomes are added randomly 100 times, to estimate the standard deviation and error rates.
The final output looks like this:
How exciting! It is left as an exercise for the reader to work out how to do a core-genome plot!
nice job! but for the less gifted among us: how do you do a core-genome plot? and since we are at it already: what about the pan-genome?
i am really kind of stuck with this and don’t want to invent the wheel all over…
Nice work.. but I have the same question as Carsten.. How do you do core genome plot from the 0-1 matrix file???
This is awesome. I managed to change the plots to ggplots2. However, I also have the same question as the other two fellas from 2010 and 2013. I need a little hint on doing the core genome plot from your matrix. I generated a VennDiagram and the totals do not sum up to the genomes. I think something is wrong with your python script.