Grep search strings with line breaks

2022-12-13 05:40 问答作者：

How to use grep开发者_如何学JAVA to output occurrences of the string 'export to excel' in the input files given below? Specifically, how to handle the line breaks that happen in between the search strings? Is there a switch in grep that can do this or some other command probably?

Input files:

File a.txt:

blah blah ... export to
excel ...
blah blah..

File b.txt:

blah blah ... export to excel ...
blah blah..

Do you just want to find files that contain the pattern, ignoring linebreaks, or do you want to actually see the matching lines?

If the former, you can use tr to convert newlines to spaces:

tr '\n' ' ' | grep 'export to excel'

If the latter you can do the same thing, but you may want to use the -o flag to only print the actual match. You'll then want to adjust your regex to include any extra context you want.

I don't know how to do this in grep. I checked the man page for egrep(1) and it can't match with a newline in the middle either.

I like the solution @Laurence Gonsalves suggested, of using tr(1) to wipe out the newlines. But as he noted, it will be a pain to print the matching lines if you do it that way.

If you want to match despite a newline and then print the matching line(s), I can't think of a way to do it with grep, but it would be not too hard in any of Python, AWK, Perl, or Ruby.

Here's a Python script that solves the problem. I decided that, for lines that only match when joined to the previous line, I would print a --> arrow before the second line of the match. Lines that match outright are always printed without the arrow.

This is written assuming that /usr/bin/python is Python 2.x. You can trivially change the script to work under Python 3.x if desired.

#!/usr/bin/python

import re
import sys

s_pat = "export\s+to\s+excel"
pat = re.compile(s_pat)

def print_ete(fname):
    try:
        f = open(fname, "rt")
    except IOError:
        sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
        sys.exit(2)

    prev_line = ""
    i_last = -10
    for i, line in enumerate(f):
        # is ete within current line?
        if pat.search(line):
            print "%s:%d: %s" % (fname, i+1, line.strip())
            i_last = i
        else:
            # construct extended line that included previous
            # note newline is stripped
            s = prev_line.strip("\n") + " " + line
            # is ete within extended line?
            if pat.search(s):
                # matched ete in extended so want both lines printed
                # did we print prev line?
                if not i_last == (i - 1):
                    # no so print it now
                    print "%s:%d: %s" % (fname, i, prev_line.strip())
                # print cur line with special marker
                print "-->  %s:%d: %s" % (fname, i+1, line.strip())
                i_last = i
        # make sure we don't match ete twice
        prev_line = re.sub(pat, "", line)

try:
    if sys.argv[1] in ("-h", "--help"):
        raise IndexError # print help
except IndexError:
    sys.stderr.write("print_ete <filename>\n")
    sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
            "export to excel")
    sys.exit(1)

print_ete(sys.argv[1])

EDIT: added comments.

I went to some trouble to make it print the correct line number on each line, using a format similar to what you would get with grep -Hn.

It could be much shorter and simpler if you don't need line numbers, and you don't mind reading in the whole file at once into memory:

#!/usr/bin/python

import re
import sys

# This pattern not compiled with re.MULTILINE on purpose.
# We *want* the \s pattern to match a newline here so it can
# match across multiple lines.
# Note the match group that gathers text around ete pattern uses a character
# class that matches anything but "\n", to grab text around ete.
s_pat = "([^\n]*export\s+to\s+excel[^\n]*)"
pat = re.compile(s_pat)

def print_ete(fname):
    try:
        text = open(fname, "rt").read()
    except IOError:
        sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
        sys.exit(2)

    for s_match in re.findall(pat, text):
        print s_match

try:
    if sys.argv[1] in ("-h", "--help"):
        raise IndexError # print help
except IndexError:
    sys.stderr.write("print_ete <filename>\n")
    sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
            "export to excel")
    sys.exit(1)

print_ete(sys.argv[1])

grep -A1 "export to" filename | grep -B1 "excel"

I have tested this a little and it seems to work:

sed -n '$b; /export to excel/{p; b}; N; /export to\nexcel/{p; b}; D' filename

You can allow for some extra white space at the end and beginning of the lines like this:

sed -n '$b; /export to excel/{p; b}; N; /export to\s*\n\s*excel/{p; b}; D' filename

use gawk. set record separator as excel, then check for "export to".

gawk -vRS="excel" '/export.*to/{print "found export to excel at record: "NR}' file

gawk '/export.*to.*excel/{print}
/export to/&&!/excel/{
  s=$0
  getline line
  if (line~/excel/){
   printf "%s\n%s\n",s,line
  } 
}' file

继续阅读：bash grep

Grep search strings with line breaks

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？