How to replace pairs of strings in two files to identical IDs?

2022-12-28 00:35 问答作者：

[Update2] As it often happens, the scope of the task expanded quite a bit as a understood it better. The obsolete parts are crossed out, and you find the updated explanation below. [/Update2]

I have a pair of rather large log files with very similar content, except that some strings are different between the two. A couple of examples:

~~UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e~~

That is, wherever the first file contains UnifiedClassLoader3@19518cc, the second contains UnifiedClassLoader3@d0357a, and so on. [Update] There are about 40 distinct pairs of such identifiers.[/Update]

UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a
JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e
Logi18n@177060f             | Logi18n@12ef4c6
LogFactory$1@15e3dc4        | LogFactory$1@2942da

That is, wherever the first file contains UnifiedClassLoader3@19518cc, the second contains UnifiedClassLoader3@d0357a, and so on. Note that all these strings are inside long lines of text, and they appear in many rows, intermixed with each other. There are about 4000 distinct pairs of such identifiers, and the size of each file is about 34 MB. So performance became an issue as well.

I want to replace these with identical IDs so that I can spot the really important differences between the two files. I.e. I want to replace all occurrences of both UnifiedClassLoader3@19518cc in file1 and UnifiedClassLoader3@d0357a in file2 with UnifiedClassLoader3@1; all occurrences of both Logi18n@177060f in file1 and Logi18n@12ef4c6 in file2 with Logi18n@2 etc. The counters 1 and 2 are arbitrary choices - the only requirement is that there is a one to one mapping between the old and new strings (i.e. the same string is always replaced by the same value and no different strings are replaced by the same value).

Using the Cygwin shell, so far I managed to list all different identifiers occurring in one of the files with

~~grep -o -e 'ClassLoader[0-9]*@[0-9a-f][0-9a-f]*' file1.log | sort | uniq~~

grep -o -e '[A-Z][A-Za-z0-9]*\(\$[0-9][0-9]*\)*@[0-9a-f][0-9a-f]*' file1.log
    | sort | uniq

However, now the original order is lost, so I don't know which is the pair of which ID in the other file. With grep -n I can get the line number, so the sort would preserve the order of appearance, but then I can't weed out the duplicate occurrences. Unfortunately grep can not print only the first match of a pattern.

I figured I could save the list of identifiers produced by the above command into a file, then iterate over the patterns in the file with grep -n | head -n 1, concatenate the results and sort them again. The result would be something like

2 ClassLoader3@19518cc
137 ClassLoader@13c2d7f
563 ClassLoader3@1267649
...

Then I could (using sed itself) massage this into a sed command like

sed -e 's/ClassLoader3@19518cc/ClassLoader3@2/g' 
    -e 's/ClassLoader@13c2d7f/ClassLoader@137/g' 
    -e 's/ClassLoader3@1267649/ClassLoader3@563/g' 
    file1.log > file1_processed.log

and similarly for file2.

However, before I start, I would like to verify that my plan is the simplest possible working solution to this.

Is there any flaw in this approach? Is there a simple开发者_运维知识库r way?

I think this does the trick, or at least comes close

#!/bin/sh
for PREFIX in file1 file2
do
    cp ${PREFIX}.log /tmp/filter.$$.txt
    FILE_MAP=`egrep -o -e 'ClassLoader[0-9a-f]*@[0-9a-f]+' ${PREFIX}.log | uniq | egrep -n .`
    for MAP in `echo $FILE_MAP`
    do
        NUMBER=`echo $MAP | cut -d : -f 1` 
        WORD=`echo $MAP | cut -d : -f 2`
        sed -e s/$WORD/ClassLoader@$NUMBER/g /tmp/filter.$$.txt > ${PREFIX}_processed.log
        cp ${PREFIX}_processed.log /tmp/filter.$$.txt
    done
    rm /tmp/filter.$$.txt
done

Let me know if you have questions on how it works and why.

Here's my test data and the output

file1.log:

A1
UnifiedClassLoader3@a45bc1
A2
UnifiedClassLoader3@a45bc1
A3
UnifiedClassLoader3@a45bc1
A4
JBossRMIClassLoader@bc450a
A5
JBossRMIClassLoader@bc450a
A6
JBossRMIClassLoader@bc450a

B1
UnifiedClassLoader3@a45bc2
B2
UnifiedClassLoader3@a45bc2
B3
UnifiedClassLoader3@a45bc2
B4
JBossRMIClassLoader@bc450b
B5
JBossRMIClassLoader@bc450b
B6
JBossRMIClassLoader@bc450b

C1
UnifiedClassLoader3@a45bc3
C2
UnifiedClassLoader3@a45bc3
C3
UnifiedClassLoader3@a45bc3
C4
JBossRMIClassLoader@bc450c
C5
JBossRMIClassLoader@bc450c
C6
JBossRMIClassLoader@bc450c

file2.log (Similar patterns except the "C" set repeats the "A" set)

A1
UnifiedClassLoader3@d0357a
A2
UnifiedClassLoader3@d0357a
A3
UnifiedClassLoader3@d0357a
A4
JBossRMIClassLoader@191777e
A5
JBossRMIClassLoader@191777e
A6
JBossRMIClassLoader@191777e

B1
UnifiedClassLoader3@d0357b
B2
UnifiedClassLoader3@d0357b
B3
UnifiedClassLoader3@d0357b
B4
JBossRMIClassLoader@191777f
B5
JBossRMIClassLoader@191777f
B6
JBossRMIClassLoader@191777f

C1
UnifiedClassLoader3@d0357a
C2
UnifiedClassLoader3@d0357a
C3
UnifiedClassLoader3@d0357a
C4
JBossRMIClassLoader@191777e
C5
JBossRMIClassLoader@191777e
C6
JBossRMIClassLoader@191777e

And after processing you get file1_processed.log

A1
UnifiedClassLoader@1
A2
UnifiedClassLoader@1
A3
UnifiedClassLoader@1
A4
JBossRMIClassLoader@2
A5
JBossRMIClassLoader@2
A6
JBossRMIClassLoader@2

B1
UnifiedClassLoader@3
B2
UnifiedClassLoader@3
B3
UnifiedClassLoader@3
B4
JBossRMIClassLoader@4
B5
JBossRMIClassLoader@4
B6
JBossRMIClassLoader@4

C1
UnifiedClassLoader@5
C2
UnifiedClassLoader@5
C3
UnifiedClassLoader@5
C4
JBossRMIClassLoader@6
C5
JBossRMIClassLoader@6
C6

and file2_processed.log

A1
UnifiedClassLoader@1
A2
UnifiedClassLoader@1
A3
UnifiedClassLoader@1
A4
JBossRMIClassLoader@2
A5
JBossRMIClassLoader@2
A6
JBossRMIClassLoader@2

B1
UnifiedClassLoader@3
B2
UnifiedClassLoader@3
B3
UnifiedClassLoader@3
B4
JBossRMIClassLoader@4
B5
JBossRMIClassLoader@4
B6
JBossRMIClassLoader@4

C1
UnifiedClassLoader@1
C2
UnifiedClassLoader@1
C3
UnifiedClassLoader@1
C4
JBossRMIClassLoader@2
C5
JBossRMIClassLoader@2
C6
JBossRMIClassLoader@2

继续阅读：cygwin grep replace shell

How to replace pairs of strings in two files to identical IDs?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？