Handy little PERL scripts for processing GenBank formatted files
Genbank entries generally look something like this:
LOCUS LCA_HUMAN 142 aa PRI 15-DEC-1999
DEFINITION ALPHA-LACTALBUMIN PRECURSOR (LACTOSE SYNTHASE B PROTEIN).
ACCESSION P00709
PID g126001
VERSION P00709 GI:126001
DBSOURCE swissprot: locus LCA_HUMAN, accession P00709;
class: standard.
created: Jul 21, 1986.
sequence updated: Jul 21, 1986.
annotation updated: Dec 15, 1999.
xrefs: gi: 186830, gi: 307104, gi: 34212, gi: 296662, gi: 1335200,
gi: 67416, gi: 86817, gi: 999813, gi: 4929847
xrefs (non-sequence databases): MIM 149750, PROSITE PS00128, PFAM
PF00062
KEYWORDS Milk; Lactose; Glycoprotein; Calcium-binding; Signal; 3D-structure.
.... (goes on for pages)...
BASE COUNT 27054 a 13837 c 14525 g 26268 t
ORIGIN
1 aagcttcttg taattataat ttattaaaaa gcgttagtct tagtcttgac ataatgcgaa
61 ggtatcaaat gttttaaatg catttctaaa gaaatgtggc actatagttt atgcatgatg
121 tatataatca aatcaaaaga atataaattg tcctggtttt atggcttaat tggtcatcac
181 caataaagtg gccgcaaccg cattatccag tgcacattgc tataaagaga agattttatt
241 aaagatccaa acgaaaatat aataagttga tttagaatat atatagtttt atacccagtg
301 gaaggagacc ataagacttt ttgtatatca ttatctacta tgaagtttaa ttaaagagta
...(continues on for thousands of lines)...
For extracting all “keywords” (Go terms) from Genbank file, in list format:
#!/usr/bin/perl
use strict;
use warnings;
my $l;while ($_ = ) {
chomp $_;
$l.=$_;
}while ($l =~ /KEYWORDS\s{4}([a-z|A-Z|0-9].*?)\./g){
my $match = $1;
$match =~s/;\s*/;/g;
$match = lc $match;
my @keywords = split “;”, $match;
print join “\n”, @keywords;
print “\n”;
}
Extracting and sorting only unique Go terms:
#!/usr/bin/perl
use strict;
use warnings;
my $l;
my @allkeywords;while ($_ = ) {
chomp $_;
$l.=$_;
}while ($l =~ /KEYWORDS\s{4}([a-z|A-Z|0-9].*?)\./g){
my $match = $1;
$match =~s/;\s*/;/g;
$match = lc $match;
my @keywords = split “;”, $match;
push @allkeywords, @keywords;
}my %hash = map {$_,1} @allkeywords;
my @single = keys %hash;
my @sorted = sort {$a cmp $b} @single;
print join “\n”, @sorted;
print “\n”;
Generating simple gene model with exon-intron boundaries and gene id (FASTA format):
#!/usr/bin/perl
use strict;
use warnings;
my $l;while ($_ = ) {
chomp $_;
$l.=$_;
}
my $DNA;
my $m = $l;
my $n = $l;while ($l =~ /ORIGIN(.*?)\/\//g){
$DNA = $1;
$DNA =~s/\s//g;
$DNA =~s/[0-9]//g;
}#open(OUTFILE,”>job3output.txt”);
while ($m =~ /CDS\s+join\((.*?)\)\s+\/gene=\”(.*?)\”/g){
my $COOR = $1;
my $GeneID = $2;$COOR =~s/\s//g;
print “>”,$GeneID;
#print OUTFILE “>”,$GeneID;
my $CR1;
my $CR2;
my $exons=0;
while ($COOR =~ /(\d+)\.\.(\d+)/g){
$CR1 = $1.”:”.$1.”-”.$2.”-”;
$CR2 .= $CR1;
chomp $CR2;
$exons=$exons+1;
}$CR2 = $CR2.”e”;
my @CRarray = split “:”, $CR2;print ” (number of exons: “,$exons,”)”,”\n”;
#print OUTFILE ” (number of exons: “,$exons,”)”,”\n”;my @final;
for (@CRarray) {
my $printA = ”;
my $printB = ”;if ($_ =~ /(\d+)-(\d+)-(\d+)/){
my $print1 = uc substr ($DNA,$1-1,$2-$1+1);
my $print2 = substr ($DNA,$2,$3-$2-1);
$printA = $print1.$print2;
push @final, $printA;
}if ($_ =~ /(\d+)-(\d+)-e/){
my $lastexonlength = $2-$1+1;
$printB = uc substr ($DNA,$1-1,$lastexonlength);
push @final, $printB;
}}
my $finalsequence=”;
for (@final) {
chomp $_;
$finalsequence.=$_;
}print $finalsequence,”\n”;
#print OUTFILE $finalsequence,”\n”;}
while ($n =~ /CDS\s+complement\((.*?)\)\s+\/gene=\”(.*?)\”/g){
my $COOR = $1;
my $GeneID = $2;$COOR =~s/\s//g;
print “>”,$GeneID;
#print OUTFILE “>”,$GeneID;my $CR1;
my $CR2;
my $exons=0;
while ($COOR =~ /(\d+)\.\.(\d+)/g){
$CR1 = $1.”:”.$1.”-”.$2.”-”;
$CR2 .= $CR1;
chomp $CR2;
$exons=$exons+1;
}$CR2 = $CR2.”e”;
my @CRarray = split “:”, $CR2;print ” (RC, number of exons: “,$exons,”)”,”\n”;
#print OUTFILE ” (RC, number of exons: “,$exons,”)”,”\n”;my @final;
for (@CRarray) {
my $printA = ”;
my $printB = ”;if ($_ =~ /(\d+)-(\d+)-(\d+)/){
my $print1 = uc substr ($DNA,$1-1,$2-$1+1);
my $print2 = substr ($DNA,$2,$3-$2-1);
$printA = $print1.$print2;
push @final, $printA;
}if ($_ =~ /(\d+)-(\d+)-e/){
my $lastexonlength = $2-$1+1;
$printB = uc substr ($DNA,$1-1,$lastexonlength);
push @final, $printB;
}}
my $finalsequence=”;
for (@final) {
chomp $_;
$finalsequence.=$_;
}my $finalsequence2 = reverse $finalsequence;
$finalsequence2 =~ tr/ACGTacgt/TGCAtgca/;
print $finalsequence2,”\n”;#print OUTFILE $finalsequence2,”\n”;
}
Calculating codon-usage statistics (exon-only):
#!/usr/bin/perl
use strict;
use warnings;
my $l;while ($_ = ) {
chomp $_;
$l.=$_;
}
my $DNA;
my $m = $l;
my $n = $l;while ($l =~ /ORIGIN(.*?)\/\//g){ #finds and stores entire DNA string in $DNA
$DNA = $1;
$DNA =~s/\s//g;
$DNA =~s/[0-9]//g;
}my $allforwardexons = ”;
while ($m =~ /CDS\s+join\((.*?)\)\s+\/gene=\”(.*?)\”/g){
my $COOR = $1;
my $GeneID = $2;
$COOR =~s/\s//g;my $CR1;
my $CR2;
while ($COOR =~ /(\d+)\.\.(\d+)/g){
$CR1 = $1.”:”.$2.”;”;
$CR2 .= $CR1;
chomp $CR2;
}
chop $CR2;my @CRarray = split “;”, $CR2;
my @final;
for (@CRarray) {
my $printA = ”;
my $printB = ”;
if ($_ =~ /(\d+):(\d+)/){
my $exon = uc substr ($DNA,$1-1,$2-$1+1);
push @final, $exon;
}
}my $forwardexons=”;
for (@final) {
chomp $_;
$forwardexons.=$_;
}$allforwardexons= $allforwardexons.$forwardexons;
}my $allreverseexons = ”;
while ($n =~ /CDS\s+complement\((.*?)\)\s+\/gene=\”(.*?)\”/g){
my $COOR = $1;
my $GeneID = $2;
$COOR =~s/\s//g;my $CR1;
my $CR2;
while ($COOR =~ /(\d+)\.\.(\d+)/g){
$CR1 = $1.”:”.$2.”;”;
$CR2 .= $CR1;
chomp $CR2;
}
chop $CR2;my @CRarray = split “;”, $CR2;
my @final;
for (@CRarray) {
my $printA = ”;
my $printB = ”;
if ($_ =~ /(\d+):(\d+)/){
my $exon = uc substr ($DNA,$1-1,$2-$1+1);
$exon = reverse $exon;
$exon=~ tr/ACGT/TGCA/;
push @final, $exon;
}
}
@final = reverse @final;
my $reverseexons=”;
for (@final) {
chomp $_;
$reverseexons.=$_;
}$allreverseexons= $allreverseexons.$reverseexons;
}my $ALLEXONS = $allforwardexons.$allreverseexons;
#print $ALLEXONS, “\n”;my @DNA_codons;
while ($ALLEXONS =~ /([ATCG]{3})/g){
push @DNA_codons, $1;
}my %DNA_codon_counters;
for (@DNA_codons) {
$DNA_codon_counters{$_}++;
}my @printarray;
while (my ($key, $value) = each %DNA_codon_counters) {
my $output = “$key ===> $value”;
push @printarray, $output;
}my @sorted = sort {$a cmp $b} @printarray;
print join “\n”, @sorted;
print “\n”;
About this entry
You’re currently reading “Handy little PERL scripts for processing GenBank formatted files,” an entry on Ramblings & Musings
- Published:
- September 20, 2007 / 6:53 pm
- Category:
- Science.
- Tags:
No comments yet
Jump to comment form | comments rss [?] | trackback uri [?]