I'd like to apply a regex efficiently to an entire file

2023-02-03 23:01 问答作者：

I have a complex regex, and I'd like to match it with the contents of an entire huge file. 开发者_Go百科The main concern is efficiency, since the file is indeed very big and running out of memory is a distinct possibility.

Is there a way I can somehow "buffer" the contents while pumping it through a regex matcher?

Yes, Pattern.match() will take a CharSequence.

If your input is already in a charset which uses exactly 2 bytes to represent a character without any 'prologue', you need only:

ByteBuffer bb = ...; // acquire memory mapped byte buffer
CharBuffer cb = bb.asCharBuffer();  // get a char[] 'view' of the bytes

... and since CharBuffer implements CharSequence, you're done.

On the other hand, if you need to decode the bytes into some other charset, you'll have your work cut out, since CharBuffer is charset-agnostic, and CharsetDecorder.decode(ByteBuffer) internally allocates a new CharBuffer roughly the same size as the input bytes.

Whether or not you'll be able to get away with a smaller buffer depends a fair bit on your regex and what you want to do with the match results. But the basic approach would be to implement CharSequence and wrap the memory-mapped ByteBuffer, a smaller CharBuffer for 'working space', and a CharsetDecoder. You'll use Charset.decode(ByteBuffer,CharBuffer,boolean) to decode the bytes 'on demand', and hope that the general direction of the regex matcher is 'forward', and that the input you're interested in comes in fairly small chunks.

As a rough start:

class MyCharSequence implements CharSequence {

    public MyCharSequence(File file, Charset cs, int bufferSize) throws IOException {

        FileInputStream input = new FileInputStream(file);
        FileChannel channel = input.getChannel();
        this.fileLength = (int) channel.size();
        this.bytes = channel.map(FileChannel.MapMode.READ_ONLY, 0, fileLength);
        this.charBuffer = CharBuffer.allocate(bufferSize);
        this.decoder = cs.newDecoder();

    }

    public int length() {
        // ouch! have to decode the lot, even if you don't choose to keep it all handy
    }

    public char charAt(final int index) {
        while ( /* not yet decoded target char[] */ )  {
            this.decoder.decode(this.bytes, this.charBuffer, true);
        }
        // don't assume 2-bytes == a char unless that's true for your charset!
    }

    public CharSequence subSequence(final int start, final int end) {
        // this'll be fun, too
    }

    private long fileLength;
    private MappedByteBuffer bytes;
    private CharBuffer charBuffer;
    private CharsetDecoder decoder;

}

It might be instructive to wrap a fully-decoded CharBuffer in a much simpler CharSequence wrapper of your own, and log how the methods are actually called for your given input, when you run it with a big heap on your development box. That will give you an idea if this approach is going to work for your particular scenario.

I don't know Java but do you anticipate matching the entire contents of the file like /^.+$/ ?
Or does the file break into chunks based on your regex but you don't know where?
Regex engines are funny, if it can do memory mapped file, then that would be a good start.

Lets see your regex. Typically, you can examine a regex and determine two anchor points and use that as a cutoff for a floating buffer, where the overflow(overlap) is carried over, and the window moved further down the file.

I've done this several times in my Perl modules. And on anything other than anchors at the beginning and end of a file, its easy to do.

继续阅读：regex

I'd like to apply a regex efficiently to an entire file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？