Regular expression to remove line breaks

2023-02-12 18:58 问答作者：

I am a complete newbie to Python, and I'm stuck with a regex problem. I'm trying to remove the line break character at the end of each line in a text file, but only if it follows a lowercase letter, i.e开发者_运维问答. [a-z]. If the end of the line ends in a lower case letter, I want to replace the line break/newline character with a space.

This is what I've got so far:

import re
import sys

textout = open("output.txt","w")
textblock = open(sys.argv[1]).read()
textout.write(re.sub("[a-z]\z","[a-z] ", textblock, re.MULTILINE) )
textout.close()

Try

re.sub(r"(?<=[a-z])\r?\n"," ", textblock)

\Z only matches at the end of the string, after the last linebreak, so it's definitely not what you need here. \z is not recognized by the Python regex engine.

(?<=[a-z]) is a positive lookbehind assertion that checks if the character before the current position is a lowercase ASCII character. Only then the regex engine will try to match a line break.

Also, always use raw strings with regexes. Makes backslashes easier to handle.

Just as an alternative answer, although it takes more lines, I think the following may be clearer since the regular expression is simpler:

import re
import sys

with open(sys.argv[1]) as ifp:
    with open("output.txt", "w") as ofp:
        for line in ifp:
            if re.search('[a-z]$',line):
                ofp.write(line.rstrip("\n\r")+" ")
            else:
                ofp.write(line)

... and that avoids loading the whole file into a string. If you want to use fewer lines, but still avoid postive lookbehind, you could do:

import re
import sys

with open(sys.argv[1]) as ifp:
    with open("output.txt", "w") as ofp:
        for line in ifp:
            ofp.write(re.sub('(?m)([a-z])[\r\n]+$','\\1 ',line))

The parts of that regular expression are:

(?m) [turn on multiline matching]
([a-z]) [match a single lower case character as the first group]
[\r\n]+ [match one or more of carriage returns or newlines, to cover \n, \r\n and \r]
$ [match the end of the string]

... and if that matches line, the lowercase letter and line ending are replaced by \\1, which will the lower case letter followed by a space.

my point was that avoiding using positive lookbehind might make the code more readable

OK. Though, personally, I don't find it's less readable. It's a matter of taste.

In your EDIT:

First, (?m) is not necessary since for line in ifp: selects one line at a time and so there is only one newline at the end of each line's string
Secondly, $ as it is placed, has no utility because it will always match the end of the string line.

Any way, adopting your point of view, I found two manners to avoid the lookbehind assertion:

with open(sys.argv[1]) as ifp:
    with open("output.txt", "w") as ofp:
        for line in ifp:
            ante_newline,lower_last = re.match('(.*?([a-z])?$)',line).groups()
            ofp.write(ante_newline+' ' if lower_last else line)

and

with open(sys.argv[1]) as ifp:
    with open("output.txt", "w") as ofp:
        for line in ifp:
            ofp.write(line.strip('\r\n')+' ' if re.search('[a-z]$',line) else line)

the second one is better: only one line , a simple matching to test, no need of groups(), naturally logic

EDIT: oh I realize that this second code is simply your first code rewritten in one line, Longair

继续阅读：python python-2.7 regex

Regular expression to remove line breaks

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？