How to restore the truncated sequence identifiers.

In the previous post, I mentioned that the length of the sequence identifiers in the strict PHYLIP file format cannot exceed 10 characters. Longer sequence names will be truncated:

# strict.phy
     3    15
homo_sapie GATAATGCTG ACTAC
ailuropoda GATAATGCTG ACTAT
felis_catu GATAATGCTG ACTAT

We can probably still figure out that they are human 👶, panda 🐼 and cat 🐱. But it might get confusing when the data contains many more sequences.
I wrote a perl script to restore the sequence names and change the alignment files to relaxed PHYLIP format.
We will need a text file containing the full species names:

# species_list.txt
homo_sapiens
ailuropoda_melanoleuca
felis_catus

The code:

change_phy_species_names.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#!/usr/bin/perl
use strict;
#Change species names in .phy file from 10 characters to full length.

my ( $phyfile, $speciesfile, $readfh, $line, %Species_length, @array, %Firstten,
   $num_species, $seq_length, $num_blocks, $i, $species, $ten_char, $species_length, 
   @sort, $key, $speciesname_length, $space, $num, $j, @array, $outfile, $outfh, $k
);

$phyfile = $ARGV[0];
$speciesfile = $ARGV[1];
$outfile = "relaxed_".$phyfile;

open $readfh,'<',$speciesfile or die "$!";
while (!eof($readfh)) {
    $line = <$readfh>;
    $line =~ s/\r?\n*//g;
    $Species_length{$line} = length($line);
    $ten_char = $line;
    $ten_char =~ s/^([\w\W]{10})[\w\W]+/$1/g;
    $Firstten{$ten_char} = $line;   
}
@array = ();
foreach $key (sort keys %Species_length) {
    push @array, $Species_length{$key};
}
@sort = sort { $a <=> $b } @array;
#find the longest species name
if ($sort[-1] < 30) {
    $speciesname_length = $sort[-1]; 
} else {
    $speciesname_length = 30;
}
close $readfh;
open $readfh,'<',$phyfile or die "$!";
open $outfh,'>',$outfile or die "$!";

#headline
$line = <$readfh>;
$line =~ s/\r?\n*//g;
# $line = $line.'        I';
print $outfh qq{$line\n};
$line =~ s/\r?\n*//g;
@array = split /\s+/, $line;
$num_species = $array[1];
$seq_length = $array[2];
$num_blocks = int ($seq_length / 50);

# first block with species names
for ($i = 0; $i < $num_species; $i++) {
    $space = '';
    $line = <$readfh>;
    $line =~ s/\r?\n*//g;
    @array = split /\s/, $line;
    if (exists $Firstten{$array[0]} ){
        $species = $Firstten{$array[0]};
        shift(@array);
        $species_length = length($species);
        $num = $speciesname_length - $species_length;
        if ($num >= 0) {
            for ($j = 0; $j <= $num+1; $j++) {
                $space = $space." ";
            }
        }
    }
    print $outfh qq{$species$space };
    foreach (@array) {
        print $outfh qq{$_ };
    }
    print $outfh qq{\n};
}
    
for ($k = 0; $k < $num_blocks; $k++) {
    $line = <$readfh>;
    print $outfh qq{\n};
    for ($i = 0; $i < $num_species; $i++) {
        $line = <$readfh>;
        $line =~ s/\r?\n*//g;
        @array = split /\s+/, $line;
        shift(@array);
        for ($j = 0; $j <= $speciesname_length+2; $j++) {
            print $outfh qq{ };
        }
        foreach (@array) {
            print $outfh qq{$_ };
        }
        print $outfh qq{\n};
    }
}   
close $readfh;
close $outfh;

sub length {
    my $line = $_;
    my @array = split //, $line;
    return scalar(@array);
};
Usage: ./change_phy_species_names.pl strict.phy species_list.txt
And the output will be:

# relaxed.phy
     3    15
homo_sapiens                 GATAATGCTG ACTAC
ailuropoda_melanoleuca       GATAATGCTG ACTAT
felis_catus                  GATAATGCTG ACTAT