Fasta files: remove header after the gi identifier


#1

I use fasta files, they have a long header like this (after I download from NCBI)

gi|550003|gb|U14742.1|PRU14742 Papaya ringspot virus isolate Vietnam coat protein mRNA, partial cds
CTGGTCTAAATGAAAAGCTCAAAGAAAAGGAAAAACAGAAAGAAAAAGAAAAAGAAAAAGATAAACAAAA
AGATAAAGATAACGATGGAGCTAGTGACGAAAATGATGTGTCAACTAGCACAAAGACTGGAGAGAGAGAT
AGAGATGTCAACGCCGGAACTAGCGGAACTTTCACTGTTCCAAGGATAAAGTCTTTTACTGATAAGATGA
TTTTACCAAGAATTAAGGGAAAGTCTGTCCTTAATTTGAATCATCTTCTTCAGTATAATCCGCAACAAAT
(after the > symbol, this line is considered as a label and is not used in analysis, it is the first line)

I would like to shorten the label to >gi|550003 (on the first line)

gi|550003
CTGGTCTAAATGAAAAGCTCAAAGAAAAGGAAAAACAGAAAGAAAAAGAAAAAGAAAAAGATAAACAAAA
AGATAAAGATAACGATGGAGCTAGTGACGAAAATGATGTGTCAACTAGCACAAAGACTGGAGAGAGAGAT
AGAGATGTCAACGCCGGAACTAGCGGAACTTTCACTGTTCCAAGGATAAAGTCTTTTACTGATAAGATGA
TTTTACCAAGAATTAAGGGAAAGTCTGTCCTTAATTTGAATCATCTTCTTCAGTATAATCCGCAACAAAT

In my text file there would be some 600 fasta files, how can do the above process using atom (right now I select and delete by hand). I will have duplicate copies of the file and I can trace the label by the gi number when needed.

This feature will help all bioinformaticians. thanks


#2

A package could be written to make this a built-in feature. If I wrote it could I get credit as a junior bio-engineer or something? :smile:

I suppose you would also like to jump directly to those headers by typing them in. or at least a key to hop to the next one. The labels could be highlighted to make them easy to see when you scroll fast.

We could make it a standard file type like a language. Does it have a specific file suffix?


#3

If I wrote it could I get credit as a junior bio-engineer or something?
You will be the Principal bio-engineer

like to jump directly to those headers by typing them in. or at least a key to hop to the next one.
This is not available in any program till now, but would be immensely useful;

The labels could be highlighted to make them easy to see when you scroll fast.
this can be done by searching for |gb (but only |gb will be highlighted not the gi)

We could make it a standard file type like a language. Does it have a specific file suffix?
Sir, these are text files (the file ending is *fasta) but any text editor will open it
,
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the “>” symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the “>” and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence.

the gene bank |gb|, EBI |emb|, Japan |dbj| are also seen in fasta file headers.
unfortunately the number of digits following the >gi| xxxxxx| as >gi|62378| varies from 6 to even 8

Sir thank you for your kindness and for replying


#4

I’ll do it for the experience. But I’d like to get credit for it whenever possible. :smile:

I’ll work up a simple version this week and then you can collaborate with me on how to polish it up. I’ll contact you via private message when I have questions.


#5

Sir, I will definitely be honoured to assist you, with regards, Dr.D.K.Samuel, Senior Virologist, Indian Institute of Horticultural Research, Bangalore, India


#6

Respected Sir, thanks for your idea on using regular expressions. your code works very well , Thank you so much, you are a good wizard. I tried on many fasta files both nucleotide and proteins,

^>gi| works fine
^(>gi|\d+).* works fine I just made a small modification to highlight from 5 to 12 digits in the gi number ^(>gi\|)\d{4,12} for 12 digits

Sir I have one request , I want the text highlighted using ^(>gi|)\d{4,12}$ highlighted like this , I did this in PSPAD, but I am not able to do this in Atom, thank you so much for your kind help, It saved a lot of computational cut and delete for me, Samuel


#7

Text highlighting in color would require a package to be written. But you can outline all of them at once by using the find/replace. Just search for ^>gi\|\d+$ and they will appear boxed while you scroll through the file.


#8

Yes Sir, i saw the box, Sir will not any color addon for Atom be suitable, thanks Samuel


#9

I find this thread strangely fascinating.