开发者

Extracting word from file using grep or sed

I have a file in the format below:

File                  : \\dvtbbnkapp115\nautilus\030db28a-f241-4054-a0e3-9bfa7e002535.dip was
 processed. 
Entries Found         : 0
Unarchived Documents  : 1 
            File Size : 1 K 

Error : The following line could not be processed.  Bad Document Type.

Error : Marketing and Contact preference change
        update||7000003735||078ef1f3-db6b-46a8-bb0d-c40bb2296ab5.pdf



File                  : \\dvtbbnkapp115\nautilus\078ef1f3-db6b-46a8-bb0d-c40bb2296ab5.dip was
 processed. 
Entries Found         : 0
Unarchived Documents  : 1 
            File Size : 1 K 

Error : The following line could not be processed.  Bad Document Type.

Error : Declined - Bureau Data (process)||7000003723|252204|2f1d71f4-052c-49f1-95cf-9ca9b4268f0c.pdf



File                  : \\dvtbbnkapp115\nautilus\2f1d71f4-052c-49f1-95cf-9ca9b4268f0c.dip was
 processed. 
Entries Found         : 0
Unarchived Documents  : 1 
            File Size : 1 K 

Error : The following line could not be processed.  Bad Document Type.

Error : Unable to call - please
        contact|40640510016710|7000003180||3e6a792f-c136-4a4b-a654-37f4476ccef8.pdf

I require to extract just the pdf file names after the double pipe and write them to a file. I am a novice when it comes to unix/sed/gr开发者_开发问答ep commands, i have tried but no luck? any ideas or examples i could use to extract the information above?

thanks


Give this a try if you only want PDF filenames if they follow double pipe characters and are the last thing on the line:

sed -n 's/.*||\([^|]*.pdf\)$/\1/p' inputfile

The second PDF filename in your example follows a single pipe character, but there is an earlier set of double pipes on that line. This should accommodate both styles of lines if the filename is the part that does not include any pipe characters:

sed -n 's/.*||.*|\([^|]*.pdf\)$/\1/p' inputfile

If your filenames consist on only hex digits and hyphens, you can be a little more selective like this:

sed -n 's/.*||.*|\([[:xdigit:]-]*.pdf\)$/\1/p' inputfile


If I understood correctly your request, this should do it:

grep -o -E "\|\|[^\|]*.pdf" < input | cut -f 3 -d "|"

grep looks for the lines containing double pipes,followed by a pdf name. cut, 'cuts' the line based on the delimiter, and selects the n-th field.

To get all pdf that are on a line with double pipe (not only after them):

grep "||" < input | cut -f 5 -d "|" > output

Edit: after seeing the comment I think you wanted something else, so I adjusted the answer. Putting both answer as it seems it is the simple case...


This will only extract the filenames that come immediately after a '||' sequence.

grep -o '||[^|]*\.pdf' YOUR_FILE | tr -d '|'

EDIT: I removed the ${...} to make it more readable.


Why not simply send your input through sed? Like this:

sed -n -e '/\|\|.*pdf$/ { s/.*\|\|//; p; }'


Ruby(1.9+)

$ ruby -F'\|\|' -ane 'print $F[-1] if $_["\.pdf"] && !$F[1].include?("|") ' file
078ef1f3-db6b-46a8-bb0d-c40bb2296ab5.pdf
3e6a792f-c136-4a4b-a654-37f4476ccef8.pdf
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜