开发者

How to filter all words, which contain N or more characters?

I would like to process a textual file to find all words which contain more than N characters. Any solution in Bash (grep,awk) or Python (re) is welcome开发者_运维技巧d! However, the shortest one is prefered.


egrep -o '[^ ]{N,}' <filename>

Find all non-space constructs at least N characters long. If you're concerned about "words" you might try [a-zA-Z].


Python

 import fileinput
 N = 5
 for line in fileinput.input():
     for word in line.split():
         if len(word) > N:
              print word


import re; [s for s in re.findall(r"\w+", open(filename, "r").read()) if len(s) >= N]


ouput words greater than length of 5, and the line number

awk -F ' ' '{for(i=1;i<=NF;i++){ if(length($i)>=6) print NR, $i }}' your_file


#!/usr/bin/env python

import sys, re

def morethan(n, file_or_string):
    try:
        content = open(file_or_string, 'r').read()
    except:
        content = file_or_string
    pattern = re.compile("[\w]{%s,}" % n)
    return pattern.findall(content)

if __name__ == '__main__':
    try:
        print morethan(*sys.argv[1:])
    except:
        print >> sys.stderr, 'Usage: %s [COUNT] [FILENAME]' % sys.argv[0]

Example usage (via this gist):

$ git clone -q git://gist.github.com/763574.git && \
     cd 763574 && python morethan.py 7 morethan.py

['stackoverflow', 'questions', '4585255', 'contain', ...


You could use a simple grep, but it would return the entire lines:

grep '[^ ]\{N\}'

Where N is your number.

I don't know how to get the single words in grep or awk, but it's easy in Python:

import re
f = open(filename, 'r')
text = f.read()
big_words = re.findall('[^ ]{N,}', s)

Again, N is your number. big_words will be a list containing your words.


In this example, replace the value of 5 with whatever length you're looking for. The second example shows it as a function

1)

>>> import re
>>> filename = r'c:\temp\foo.txt'
>>> re.findall('\w{5}', open(filename).read())
['Lorem', 'ipsum', 'dolor', 'conse', 'ctetu', 'adipi', 'scing', 'digni', 'accum', 'congu', ...]

2)

def FindAllWordsLongerThanN(n=5, file='foo.txt'):
    return re.findall('\w{%s}' % n, open(file).read())

FindAllWordsLongerThanN(7, r'c:\temp\foo.txt')


re.findall(r'\w'*N+r'\w+',txt)


try this:

N = 5 #Threshold
f = open('test.txt','r')
try:
  for line in f.xreadlines():
    print " ".join([w for w in line.split() if len(w) >= N])
finally:
  f.close()


For completeness (although the regexp solution probably is better in this case):

>>> from string import punctuation
>>> with open('foreword.rst', 'rt') as infile:
...    for line in infile:
...       for x in line.split():
...           x = x.strip(punctuation)
...           if len(x) > 5:
...              print x

Assuming you really mean "filter", that is each word should be printed several times. If you just want the words once each I'd do this:

>>> from string import punctuation
>>> result = set()
>>> with open('foreword.rst', 'rt') as infile:
...    for line in infile:
...       for x in line.split():
...           x = x.strip(punctuation)
...           if len(x) > 5:
...              if x not in result:
...                  result.add(x)
...                  print x


hello I believe that this is a nice solutino with lambda functions. first parameter is the N

import sys
import os
def main():
    p_file = open("file.txt")
    t= lambda n,s:filter(lambda t:len(t)>n,s.split())
    for line in p_file:
        print t(3,line)
if __name__ == '__main__':
    main()


Pure Bash:

N=10; set -o noglob; for word in $(<inputfile); do ((${#word} > N)) && echo "$word"; done; set +o noglob

If your inputfile doesn't contain any globbing characters (*, ?, [), you can omit the set commands.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜