Problem with REGEX for full line replacement when file is in 'line fine' Windows format


#1

There is a problem (or misunderstanding on my part) about regex for full line replacement when the line fine is in Windows style.

My base text :

Uther
Jaina
Thrall
Arthas

In regex mode i serach for :
(an entire line)

^(.*)$ 

To replace by :

$1 is a famous character named $1

If the file is in Unix line fine format (LF), the result is OK is :

Uther is a famous character named Uther
Jaina is a famous character named Jaina
Thrall is a famous character named Thrall
Arthas is a famous character named Arthas

But if the file is in Windows line fine format (CR+LF), the result is not like I expect :

Uther is a famous character named Uther is a famous character named 
Jaina is a famous character named Jaina is a famous character named 
Thrall is a famous character named Thrall is a famous character named 
Arthas is a famous character named Arthas

In this case, the content of the line is double except for the last
What is the problem ?


#2

I think the best thing to do would be to file an issue on find and replace and then link the issue here.


#3

I create an issue : Find-and-replace issues #468


#4

I think the behaviour is correct, but perhaps unintuitively so.

If you search for ^(.*)$ you ask for zero or more repetitions of any character (except a newline character) in between the start and end of a string. With CR+LF line endings there are two newline characters at the end of a line, so there is a zero match in between the carriage return and the line feed.

To visualize this, consider the following example, where ¤ denotes a carriage return and ¬ a line feed. I use the ^ to indicate the start and end of the matches.

Uther¤¬
^    ^^

Instead, if you search for ^(.+)$ you ask for one or more repetitions, which means there will be no match between the carriage return and the line feed and your replacement will work as expected.

Regular expressions that can match zero length are a bit of a pain sometimes. :slight_smile:


#5

Thanks for your reply @Alchiadus, I fully understand the behavior now :smile:

And the tips to use ^(.+)$ works perfectly !

I was accustomed to my Notepad++ that does not pose this kind of problem.
But I’ve just realized that there was finally an option that hid this behavior : '.' matches newline


#6

It shouldn’t. It should match anything except a newline. The tricky part here is that with CR+LF line endings, there is a second string starting at ¤ and ending at ¬. The length of this string is 0, which a .* will match.

Perhaps the visualization below illustrates this better. The square brackets indicate the start of string and end of string.

[Uther]¤[]¬

#7

Yes, that’s what I get for each line :

[Uther] is a famous character named [Uther] []is a famous character named[]
 ^ Before ¤                                 ^ after ¤ before ¬ 

And for the last line, there is no ¬ so I have no duplicate.

Arthas is a famous character named Arthas

The Notepad++ option '.' matches newline is finally not related to this behavior.


But in Notepad++, I have no difference in behavior according to the end of line.

This unintuitively behavior it’s finaly a bug of Atom or not ?

What is the truth about REGEX, I found two definitions :

  • Metacharacters $

  • End of line

  • or end of string

  • Metacharacters ^

  • Start of line

  • or start of string

Regex engine are different but the rules are not universal?


#8

Ahh I just realized I misunderstood what you meant with ‘.’ matches a newline, I didn’t know it was an option in Notepad++. I honestly don’t know the official definition, which can also depend on the standard that is implemented. Conceptually I always think of it as start of string and end of string, with a string being something that starts after a newline and ends just before the next newline.

Treating the carriage return followed by a line feed as one newline character instead of two would solve the issue. I can’t imagine a scenario where one would like to consider them as separate newline characters, but I could easily be overlooking something.

I wouldn’t consider it a bug, but I definitely agree it’s tricky and not at all what you’d expect.


#9

This doesn’t make sense. Atom shouldn’t behave like that. In a DOS/Windows file, \r\n is the end of line marker, and you can’t behave as if the \r was a Mac end-of-line, followed by an empty line, followed by a Unix end-of-line \n.