开发者

What is the pythonic way to remove trailing spaces from a string?

The parameter to the function satisfy these rules:

  1. It does not have any leading whitespace
  2. It might have trailing whitespaces
  3. There might be interleaved whitespaces in the string.

Goal: remove duplicate whitespaces that are interleaved & strip trailing whitespaces.

This is how I am doing it now:

# toks - a priori no leading space
def squeeze(toks):
  import re
  p = re.compile(r'\W+')
  a = p.split( toks ) 
  for i in range(0, len(a)):
   开发者_JAVA技巧 if len(a[i]) == 0:
      del a[i]
  return ' '.join(a) 

>>> toks( '  Mary  Decker   is hot   ' )
Mary Decker is hot

Can this be improved ? Pythonic enough ?


This is how I would do it:

" ".join(toks.split())

PS. Is there a subliminal message in this question? ;-)


Can't you use rstrip()?

some_string.rstrip() 

or strip() for stripping the string from both sides?

In addition: the strip() methods also support to pass in arbitrary strip characters:

string.strip = strip(s, chars=None)
    strip(s [,chars]) -> string

Related: if you need to strip whitespaces in-between: split the string, strip the terms and re-join it.

Reading the API helps!


To answer your questions literally:

Yes, it could be improved. The first improvement would be to make it work.

>>> squeeze('x    !    y')
'x y' # oops

Problem 1: You are using \W+ (non-word characters) when you should be using \s+ (whitespace characters)

>>> toks = 'x  !  y  z  '
>>> re.split('\W+', toks)
['x', 'y', 'z', '']
>>> re.split('\s+', toks)
['x', '!', 'y', 'z', '']

Problem 2: The loop to delete empty strings works, but only by accident. If you wanted a general-purpose loop to delete empty strings in situ, you would need to work backwards, otherwise your subscript i would get out of whack with the number of elements remaining. It works here because re.split() without a capturing group can produce empty elements only at the start and end. You have defined away the start problem, and the end case doesn't cause a problem because there have been no prior deletions. So you are left with a very ugly loop which could be replaced by two lines:

if a and not a[-1]: # guard against empty list
    del a[-1]

However unless your string is very long and you are worried about speed (in which case you probably shouldn't be using re), you'd probably want to allow for leading whitespace (assertions like "my data doesn't have leading whitespace" are ignored by convention) and just do it in a loop on the fly:

a = [x for x in p.split(toks) if x]

Next step is to avoid building the list a:

return ' '.join(x for x in p.split(toks) if x)

Now you did mention "Pythonic" ... so let's throw out all that re import and compile overhead stuff, and the genxp and just do this:

return ' '.join(toks.split())


Well, I tend not to use the re module if I can do the job reasonably with the built-in functions and features. For example:

def toks(s):
    return ' '.join([x for x in s.split(' ') if x])

... seems to accomplish the same goal with only built in split, join, and the list comprehension to filter our empty elements of the split string.

Is that more "Pythonic?" I think so. However my opinion is hardly authoritative.

This could be done as a lambda expression as well; and I think that would not be Pythonic.

Incidentally this assumes that you want to ONLY squeeze out duplicate spaces and trim leading and trailing spaces. If your intent is to munge all whitespace sequences into single spaces (and trim leading and trailing) then change s.split(' ') to s.split() -- passing no argument, or None, to the split() method is different than passing it a space.


To make your code more Pythonic, you must realize that in Python, a[i] being a string, instead of deleting a[i] if a[i]=='' , it is better keeping a[i] if a[i]!='' .

So, instead of

def squeeze(toks):
    import re
    p = re.compile(r'\W+')
    a = p.split( toks )
    for i in range(0, len(a)):
        if len(a[i]) == 0:
            del a[i]
    return ' '.join(a)

write

def squeeze(toks):
    import re
    p = re.compile(r'\W+')
    a = p.split( toks )
    a = [x for x in a if x]
    return ' '.join(a)

and then

def squeeze(toks):
    import re
    p = re.compile(r'\W+')
    return ' '.join([x for x in p.split( toks ) if x])

Then, taking account that a function can receive a generator as well as a list:

def squeeze(toks):
    import re
    p = re.compile(r'\W+')
    return ' '.join((x for x in p.split( toks ) if x))

and that doubling parentheses isn't obligatory:

def squeeze(toks):
    import re
    p = re.compile(r'\W+')
    return ' '.join(x for x in p.split( toks ) if x)

.

.

Additionally, instead of obliging Python to verify if re is or isn't present in the namespace of the function squeeze() each time it is called (it is what it does), it would be better to pass re as an argument by defautlt :

import re
def squeeze(toks,re = re):
    p = re.compile(r'\W+')
    return ' '.join(x for x in p.split( toks ) if x)

and , even better:

import re
def squeeze(toks,p = re.compile(r'\W+')):
    return ' '.join(x for x in p.split( toks ) if x)

.

.

Remark: the if x part in the expression is useful only to leave apart the heading '' and the ending '' occuring in the list p.split( toks ) when toks begins and ends with whitespaces.

But , instead of splitting, it is as much good to keep what is desired:

import re
def squeeze(toks,p = re.compile(r'\w+')):
    return ' '.join(p.findall(toks))

.

.

All that said, the pattern r'\W+' in your question is wrong for your purpose, as John Machin pointed it out.

If you want to compress internal whitespaces and to remove trailing whitespaces, whitespace being taken in its pure sense designating the set of characters ' ' , '\f' , '\n' , '\r' , '\t' , '\v' ( see \s in re) , you must replace your spliting with this one:

import re
def squeeze(toks,p = re.compile(r'\s+')):
    return ' '.join(x for x in  p.split( toks ) if x)

or, keeping the right substrings:

import re
def squeeze(toks,p = re.compile(r'\S+')):
    return ' '.join(p.findall(toks))

which is nothing else than the simpler and faster expression ' '.join(toks.split())

But if you want in fact just to compress internal and remove trailing characters ' ' and '\t' , keeping the newlines untouched, you will use

import re
def squeeze(toks,p = re.compile(r'[^ \t]+')):
    return ' '.join(p.findall(toks))

and that can't be replaced by anything else.


I know this question is old. But why not use regex?

import re

result = '  Mary  Decker   is hot   '
print(f"=={result}==")

result = re.sub('\s+$', '', result)
print(f"=={result}==")

result = re.sub('^\s+', '', result)
print(f"=={result}==")

result = re.sub('\s+', ' ', result)
print(f"=={result}==")

The output is

==  Mary  Decker   is hot   ==
==  Mary  Decker   is hot==
==Mary  Decker   is hot==
==Mary Decker is hot==
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜