Need help with regex find and replace


#1

Hello,

Can someone please help me write a find and replace regex for the following (please note that I have no background in coding or programming and so far I’ve been changing the names in my files manually):

I need a regular expression in order to automatically change the string:

gi|323508378|emb|CBQ68249.1| related to carbon source-regulated protein (putative arabinase) [Sporisorium reilianum SRZ2]

to the string:

CBQ68249.1_Sporisorium reilianum

I have about 4000 of these and I can’t do them all manually. Thanks in advance.


#2

Search string …

^(.*\|){3}(.*)\|.*\[(\S+\s\S+).*\]$

Replace string …

$2_$3

Edit: If you ever decide to learn regexes I highly suggest this website. http://regviz.org/ Of course that regex above looks ridiculous but when you break it down into parts it isn’t so bad.

Edit2: I simplified the regex a bit.

Edit3: I simplified the regex a bit more.


#3

This one’s nice as well:

https://www.debuggex.com


#4

I just tried https://www.debuggex.com and I get the railroad diagram but I couldn’t find any place to put the test string. I could only find the place for the regex and the results box.


#5

Yeah, confusingly, the text goes into the box underneath “results”. It’s especially useful for viewing branches and complex grouping.


#6

It worked! Thank you so much!! :smile:
One question, the name of the species is in the square brackets

and I would like to keep only the first 2 words. Some of my files have a different ending and I don’t know how to fix the string so that it removes the end part. Ex.

gi|347009817|gb|AEO57303.1| glycoside hydrolase family 43 protein [Myceliophthora thermophila ATCC 42464]

I got

AEO57303.1_Myceliophthora thermophila ATCC

instead of

AEO57303.1_Myceliophthora thermophila


#7

Try the edited version above.

Edit: So all species have two-word names?


#8

Thanks for your reply, Mark. The edited version doesn’t work. Yes, all species have two word names. Here are some more examples of what I have and the bolded portions is what I would like to keep separated by underscore:

gi|113649137|dbj|BAF29649.1| Os12g0406100 [Oryza sativa Japonica Group]
gi|28924091|gb|EAA33248.1| predicted protein [Neurospora crassa OR74A]
gi|62952904|gb|AAY23175.1| putative xylosidase/arabinosidase, partial [Penicillium chrysogenum]
gi|392315282|gb|AFM57364.1| beta-xylosidase, partial [Phaeosphaeria avenaria f. sp. tritici 4 MM-2012]
gi|211581717|emb|CAP79831.1| endoarabinanase abnc-Penicillium chrysogenum [Penicillium rubens Wisconsin 54-1255]


#9

That one was untested. I fixed it.


#10

It worked perfectly! Thank you so much Mark, you are the best!!


#11

OK, I must be too drunk or something. Can you tell me where in this I paste the text to be tested against?


#12

After reload, there is a hint where to put the text: “My test data”. But the hint goes away quickly.


#13

OK, so you put the test text in the Result box. Stupid me. I thought the Result box was for results. I’ll try again.