Biopython is awesome! :)

Table of Contents

Concatenate fasta files

I need to merge hundreds of fasta files into one big file by concatenating sequences with the same identifier.
My fasta files look like the following: exampleFiles

Biopython makes it so easy to merge these sequences together.

First, cat all the fasta files:

cat *.fasta > all.fasta

Then we can use a dictionary to map all the sequence names (the code was adapted from here):

concat_seq.py
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/python

from Bio import SeqIO
from collections import defaultdict

sequence_map = defaultdict(str)

for sequence in SeqIO.parse('all.fasta', "fasta"):
  sequence_map[sequence.name] += str(sequence.seq)

for key in sorted(sequence_map.iterkeys()):
  print '>' + key
  print sequence_map[key]

And we will get:

>Seq1
TTTCTTATT
>Seq2
TCTCCTACT
>Seq3
TATCATAAT

Concatenate alignment files

Similarly, use AlignIO to read in sequence alignment files and concatenate them into a meta-alignment file. For example, these files are in relaxed phylip format (the sequence identifiers can be longer than 10 characters):

# algn1.phy
     3    15
homo_sapiens                 GATAATGCTG ACTAC
ailuropoda_melanoleuca       GATAATGCTG ACTAT
felis_catus                  GATAATGCTG ACTAT
# algn2.phy
     3    15
homo_sapiens                 AGATTATTTC AGAAA
ailuropoda_melanoleuca       AGATTATTTC AGAAA
felis_catus                  AGATTATTTC AGAAA

Python code to concatenate them:

concat_algn.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/usr/bin/python

from Bio import AlignIO
import glob

filelist = glob.glob('./*.phy')
i = 0
for file in filelist:
  print file
  alignment = AlignIO.read(file, "phylip-relaxed")
  alignment.sort()  # sort the sequence identifiers so they match in all files
  print("Alignment of length %i" % alignment.get_alignment_length())
  if i==0:
    cat_algn = alignment
  else:
    cat_algn += alignment
  i += 1

print "Concatenated:"
print("Alignment of length %i" % cat_algn.get_alignment_length())

outfh = open("superalgn.phy", "w")
AlignIO.write(cat_algn, outfh, "phylip-relaxed")

outfh.close()
And the output looks like:

 3 30
ailuropoda_melanoleuca  GATAATGCTG ACTATAGATT ATTTCAGAAA 
felis_catus             GATAATGCTG ACTATAGATT ATTTCAGAAA 
homo_sapiens            GATAATGCTG ACTACAGATT ATTTCAGAAA