What's the fast way to split a unicode string into a list, using white spaces OR punctuation as a separator?

2023-02-16 04:49 问答作者：

 # -*- coding: utf-8 -*-
 import string
 s = u"Dobre uczynki są jak dobre poematy. Można łatwo uchwycić, ku czemu zmierzają, lecz nie zawsze da się je racjonalnie wytłumaczyć."
 exclude = set(string.punctuation)
 s = ''.join(ch for ch in s if ch not in exclude)
 s = s.split()
 print s

prints...

 [u'Dobre', u'uczynki', u's\u0105', u'jak', u'dobre', u'poematy', u'Mo\u017cna', u'\u0142atwo', u'uchwyci\u0107', u'ku', u'czemu', u'zmierzaj\u0105', u'lecz', u'nie', u'zawsze', u'da', u'si\u0119', u'je', u'racjonalnie'开发者_开发百科, u'wyt\u0142umaczy\u0107']

It seems to be not only inelegant, but also slow.

Can you find any better solution? Maybe using regex?

Use re.compile if you will have to re-use the regular expression and use re.split.

import string
import re

s = u"Dobre uczynki są jak dobre poematy. Można łatwo uchwycić, ku czemu zmierzają, lecz nie zawsze da się je racjonalnie wytłumaczyć."
regex = re.compile(r'[%s\s]+' % re.escape(string.punctuation))
print regex.split(s)

Use Regular expressions!

# -*- coding: utf-8 -*-
import re
s = u"Dobre uczynki są jak dobre poematy. Można łatwo uchwycić, ku czemu zmierzają, lecz nie zawsze da się je racjonalnie wytłumaczyć."
print re.split('\W+', s, flags=re.UNICODE)

You could use a more specific regex of course, but this one seems to fit the bill.

Note that the flags parameter is for python 2.7+. If you're using an earlier version, simply use '\W+(?u)' as your regex.

This should do it. (I'll admit that I haven't tested it yet though)

import re, string
s = 'your string ...'
re.split('\s'+''.join(string.punctuation), s)

EDIT: Don't use mine. Mahmoud Abdelkader's answer above is better.

I would use nltk and here's why:

# coding: utf-8
from time import time
import nltk
import string
import re


def bench(n,s):
    start_1 = time()
    regex = regex = re.compile(r'[%s\s]+' % re.escape(string.punctuation))
    for i in xrange(n):
        regex.split(s)
    stop_1 = time()
    start_2 = time()
    for i in xrange(n):
        nltk.wordpunct_tokenize(s)
    stop_2=time()
    print "Regex: {} nltk:{}".format((stop_1-start_1),(stop_2-start_2))


bench(1000000, u"Dobre uczynki są jak dobre poematy. Można łatwo uchwycić, ku czemu zmierzają, lecz nie zawsze da się je racjonalnie wytłumaczyć.")

Regex: 22.7673070431 nltk:16.1646370888

继续阅读：python regex

What's the fast way to split a unicode string into a list, using white spaces OR punctuation as a separator?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？