开发者

What's the best tool to do text processing in Linux or Mac? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 7 years ago.

Improve this question

I generally need to do a fair amount of text processing for my research, such as removing the last token from all lines, extracting the first two tokens from each line, splitting each line into tokens, etc.

What is the best way to perform this? Should I learn Perl for this? Or should I learn some kind of shell commands? The main concern is speed. If I need to write long code for such stuff, it defeats the purpose.

EDIT:

I started learning sed on @Mimisbrunnr 's recommendation开发者_JAVA技巧 and already could do what I needed to. But it seems people favor awk more. So, will try that. Thanks for all your replies.


Perl and awk come to mind, although Python will do, if you'd rather not learn a new language.

Perl's a general purpose language, awk's more oriented to text processing of the type you've described.


For doing simple steam editing sed is a great utility that comes standard on most *nix boxes, but for anything much more complex than that I would suggest getting into Perl. The curve isn't that bad and it's great for writing most forms of regular text parsing. A great reference can be found here.


#!/usr/bin/env python
# process.py     
import fileinput

for line in fileinput.input(): # you could use `inplace=True` parameter here
    words = line.split() # e.g. split on white spaces
    all_except_last = words[:-1]
    print ' '.join(all_except_last)
    # or
    first_two = words[:2]
    print ' '.join(first_two)

Examples:

$ echo a b c | python process.py
$ ./process.py input.txt another.txt


*nix tools such as awk/grep/tail/head/sed etc are good file processing tools. If you want to search for patterns in files and process them, you can use awk. For big files, you can use a combination of grep+awk. Grep for its speed in pattern searching and awk for its ability to manipulate text. with regards to sed, oftern what sed does, awk can already do them, so i find it redundant to use sed for file processing.

In terms of speed of processing files, awk is often on par, or sometimes better than Perl or other languages.

Also, 2 very good tools for getting the front and back portion of a file FAST, are tail and head. So to get last lines, you can use tail.


Best tool depends on the task to be performed, of course. Beside the usual *nix tools like sed/awk etc and programming languages (Perl, Python) cited by others, currently for the text processing I need where the original data format doesn't follow rigid parsing rules but may vary slightly, I found myself very well with Vim macros and Vimscript functions which I call inside the Vim editor.

Something like this (for the Vim uninitiated): you write the processing function(s) eg. TxtProcessingToBeDone1() on a file script.vim, source it with :source script.vim, then open the file(s) you want to edit and:

:call TxtProcessingToBeDone1()

on the whole buffer at once or as one-shot operation to be repeated on spot with @: and @@ keys. Also multiple buffers/files can be processed at the same time with :bufdo and :argdo.

With a Vimscript function you can repeat all the tasks you would do on a regular editing session (search a pattern, reg-ex, substitution, move to, delete, yank, etc, etc), automate it and also apply some programming control flow (if/then).

Similar considerations apply to other advanced scriptable editors as well.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜