Gene accumulation curves in R


Needing to draw a rarefaction curve to demonstrate gene accumulation when sequencing increasing numbers of bacterial genomes I decided it was time to get stuck into learning a bit of R. Luckily, as it turned out there is a useful R library called vegan, designed for vegetation ecologists, which does much of the painful work. The included function specaccum is designed for a slightly different application – figuring out the diversity of species when sampling in the field (remember using quadrats back at school?). But if we think proteins instead of pretty flowers and genomes instead of quadrats it works just the same.

Figuring out how you get your data ready to be used in R took a few hours, so I thought I’d post the step-by-step in case anyone else might benefit from this information in future (or can suggest improvements). Disclaimer: I did this for the first time today, so there might be problems with it. If you find any, please post in the comments section!

Before you can draw your curve, you need to compute your orthologues. I did this with OrthoMCL which is pretty user-friendly and has been shown to yield good results. I decided on a 70% minimum length match and a BLAST e-value cut-off less than 1e-5. I discard proteins less than 50 nucleotides in length as these are likely to be spurious and caused by misannotation. Following the OrthoMCL instructions a few hours later I ended up with a file of COGs (clusters of orthologous groups) in a file called groups.txt. Now, to business.

First one must convert the groups.txt file into a table file suitable for loading into R. I decided to create a matrix with binary values (1 or 0) depending on the presence or absence of a particular COG in a genome. Because the groups.txt will only list clusters containing 2 or more protein sequences, I had to add some code to my script to also output the singletons which make up the species pan-genome.

This Python script was what I ended up with:

import sys

class Cluster:
    def __init__(self, name): = name
        self.members = {}

    def add(self, bug, member):
        self.members[bug] = member

all_proteins = {}
clusters = []
bugs = {}

for ln in open(sys.argv[1]):
    if ln.startswith('>'):
        protein = ln[1:].rstrip()
        all_proteins[protein] = False

n = 0
for ln in open(sys.argv[2]):
    prefix, proteins = ln.rstrip().split(':')
    proteins = proteins.split()

    c = Cluster(prefix)

    for p in proteins:
        bug, id = p.split("|")
        c.add(bug, id)
        bugs[bug] = bug

        all_proteins[p] = True


# add the singletones
i = 1
for p, already_counted in all_proteins.iteritems():
    if already_counted == False:
        c = Cluster("single%d" % (i))
        bug, id = p.split("|")
        c.add(bug, id)
        i += 1

first_cluster = clusters[0]
print " ".join(['"%s"' % ( for c in clusters])

i = 0
for b in sorted(bugs):
    print '"%s"' % (b),

    for c in clusters:
        print " ",
        if b in c.members:
            print "1",
            print "0",

    i += 1

This script is called with the path to the OrthoMCL output files goodProteins.fasta as the first argument and groups.txt file as the second argument. Redirect the output to a file which is suitable for loading into R. This should give you a file in the format:

"cog1" "cog2" "cog3" "cog4" .. "single1" "single2" ..
genome1  1 0 1 1 1 0 0
genome2  1 0 1 1 0 0 1
genome3  1 1 0 1 1 1 1

This file is now suitable to loading into R with the read.table function.

OK, on to the curve. Firstly you must install the vegan library, which gives access to the specaccum function. I put my table file into c:workspaceR and called it pangenome.table.

The following R code then produces a rarefaction curve with an overlying box-plot showing the error bars and standard deviation. We run specaccum twice

mydata <- read.table(‘pangenome.table’)
sp <- specaccum(mydata, “random”, permutations=”100″)
plot(sp, ci.type=”poly”, col=”blue”, lwd=2, ci.lty=0, ci.col=”lightblue”, xlab=”Genomes”, ylab=”Proteins”, main=”Gene accumulation plot”)
boxplot(sp, col=”yellow”, add=TRUE)

The second call to specaccum picks the order genomes are added randomly 100 times, to estimate the standard deviation and error rates.

The final output looks like this:


How exciting! It is left as an exercise for the reader to work out how to do a core-genome plot!

Loaf Club


Hannah makes a lovely loaf most weeks but it is still a bit of an effort to make bread at home. So I was excited to see that the appropriately named Tom Baker has set up a community bakery down the road from us in Cotteridge. The scheme is simple – you pay £11/month and each Friday evening he supplies either a sourdough or a rye loaf. This is perfect for us as we can have our after-work pint and then pop down and get it. He’s running this as a non-profit scheme and I really like the idea. I think he bakes it on a clay oven he built in his garden. This is exactly the kind of thing I’d love to do if I had time!