开发者

How do I crop a very large text file between first and last occurrence of a string? (Linux)

On a Linux system, I have a very large text file and I need to create a new text file which contains every line between the first and last of occurrence of a particular sessionId (those lines included).

I guess I probably need to use sed or something?

As a bonus, sometimes I won't know which log file will contain the session trace. So a script that can work with regular expressions would be ideal. In this case I would expect the script to find the first file with the sessionId in it and then crop that file before exiting.

Example Log file looking for sessionId 1111-ABCD-1111-SOME-GUID :

line one containing other session id: 2222-ABCD-1111-SOME-GUID blaa blaa blaa
line two blaa blaa blaa
line three containing my session id: 1111-ABCD-1111-SO开发者_高级运维ME-GUID blaa blaa blaa
line four containing other session id: 2222-ABCD-1111-SOME-GUID
line five blaa blaa blaa
line six containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line seven containing other session id: 2222-ABCD-1111-SOME-GUID
line eight containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line nine containing other session id: 3333-ABCD-1111-SOME-GUID
line ten containing my session id: 1111-ABCD-1111-SOME-GUID
line eleven
line twelve containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line thirteen containing my session id: 1111-ABCD-1111-SOME-GUID
line fouteen blaa blaa blaa
line fifteen containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa

output file would contain lines three to thirteen inclusive.


I'd propose something like this :

# Find all occurence of session id in the input file
grep -n "<session id>" "<input file>" > /tmp/grep.$$

# get the first line number of session id appearance
FIRST_LINE=$(head -1 /tmp/grep.$$ | cut -d: -f1)

# get the last line number of session id appearance
LAST_LINE=$(tail -1 /tmp/grep.$$| cut -d: -f1)

# Display only the part (inclusive) in between the first and last session id
sed -n "${FIRST_LINE},${LAST_LINE}p" "<input file>"

So that you retrive the line number of the first and last occurence of your pattern in the input file and then, using sed, you display only those (included). It can be optimised (grepping only once) but it should be working.


The following script will do all of what you asked, including the bonus. Put this script in the top-level directory that contains all the possible files with the 'uid' you want to crop. The script will recursively search this directory and crop all files that match and put the result in a new file with a .crp extension at the end (see example below). I took special consideration to make sure that this script will work with whatever filename you throw at it, whether it contain spaces or newlines or whatever in its name.

#!/bin/bash
uid="1111-ABCD-1111-SOME-GUID"

while IFS= read -r -d $'\0' file; do
    printf "%s\n" "?$uid?+1,\$d" "1,/$uid/-1d" "%p" | ex -s "$file" > "$file".crp
    echo "$file being cropped"
done < <(grep -lZR --exclude="${0#*/}" --exclude="*.crp" "$uid" .)

Result

$ ./uid.sh
./sample1.txt being cropped
./subdir/sample2.txt being cropped

$ cat ./sample1.txt.crp
line three containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line four containing other session id: 2222-ABCD-1111-SOME-GUID
line five blaa blaa blaa
line six containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line seven containing other session id: 2222-ABCD-1111-SOME-GUID
line eight containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line nine containing other session id: 3333-ABCD-1111-SOME-GUID
line ten containing my session id: 1111-ABCD-1111-SOME-GUID
line eleven
line twelve containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line thirteen containing my session id: 1111-ABCD-1111-SOME-GUID

$ cat ./subdir/sample2.txt.crp
line three containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
foo
bar
line eight containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
baz
line ten containing my session id: 1111-ABCD-1111-SOME-GUID

As you can see in the example above, my script found two files which matched, one of which was in a sub-directory below the top-level directory.


I'd probably do this using cat and awk. Something like

cat *.log | awk 'BEGIN { sidFound = 0; } {
    if (*check for SID here*) {
         sidFound = !sidFound;
    }

    if (sidFound) {
         print $0
    }
 }'


Either a few lines of Perl, or:

grep -no <session_ID> <log_file>

(make a note of the first and last line numbers with your session ID on)

awk 'NR==3,NR==935' <log_file>

(where 3 and 935 are the first and last line numbers returned from the grep command)

I can't currently think of a way to make that a one-liner.


gawk 'BEGIN{c=0}
/1111-ABCD-1111-SOME-GUID/{
    f=1
    for(i=1;i<=c;i++) print _[i]
    print
    delete _
    c=0
}
!/1111-ABCD-1111-SOME-GUID/&&f{ _[++c]=$0}
' file


The following Perl script (session_id.pl) does the job:

#!/usr/bin/perl 

my  $session_id = '1111-ABCD-1111-SOME-GUID';

while ( <> ) {
    if ( /$session_id/ ... /$session_id/ ) {
        print;
    }
}

Make it executable and run it:

./session_id.pl < session.data


What about:

sed -n "/$session_id/,/$session_id/p" file.txt

?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜