开发者

Performing regex on a stream

I have some large text files which im going to preform consecutive matching on (just capturing, not replacing). Im thinking its not such a good idea to keep the whole file in memory, but rather use a Reader.

What i know about the input is that if there's a match, its not going to span more than 5 lines. So my idea was to have some sort of buffer which just keeps these 5 lines, or so, do the first search, and continue. But it has to "know" where the regex match ended for this to work. e.g if the match e开发者_如何学JAVAnds at line 2 it should start the next search from here. Is it possible to do something like this in an efficient way?


You could use a Scanner and the findWithinHorizon method:

Scanner s = new Scanner(new File("thefile"));
String nextMatch = s.findWithinHorizon(yourPattern, 0);

From the api on findWithinHorizon:

If horizon is 0, then the horizon is ignored and this method continues to search through the input looking for the specified pattern without bound. In this case it may buffer all of the input searching for the pattern.

A side note: When matching on multiple lines, you might want to look at the constants Pattern.MULTILINE and Pattern.DOTALL.


Streamflyer is able to apply regular expressions on character streams.

Note that I'm the author of it.


The java implementation of regular expression engine looks unsuitable for streaming processing.

I would rather advocate another approach rooted on "derivative combinators".

The researcher Matt Might has published relevant posts about "derivative combinators" on his blog and suggests a Scala implementation here:

  • http://matt.might.net/articles/parsing-with-derivatives/
  • http://matt.might.net/articles/nonblocking-lexing-toolkit-based-on-regex-derivatives/

On my side, I succeed to improve this implementation by adding some "capture" ability, but I feel it could have a significant impact on memory consumption.


import java.io.*;  //BufferedReader //FileReader //FileWriter //PrintWriter
import java.io.IOException;
import java.util.Scanner;
import java.util.regex.*;

public class ScannerReader { 

    public static void main(String[] args) {

        try {  
            ReadDataFromFileTestRegex("[A-Za-z_0-9-%$!]+@[A-Za-z_0-9-%!$]+\\.[A-Za-z]{2,4}",
                                      "C:\\Users\\Admin\\Desktop\\TextFiles\\Emails.txt",
                                      "C:\\Users\\Admin\\Desktop\\TextFiles\\\\output.txt");
        } catch (Exception e) {
            System.out.println("File is not found");
            e.printStackTrace();
        }       
    }

    public static void ReadDataFromFileTestRegex (String theReg, String FileToRead, String FileToWrite) throws Exception {

        PrintWriter Pout = new PrintWriter(FileToWrite);            
        Pattern p = Pattern.compile(theReg); 
        BufferedReader br = new BufferedReader (new FileReader(FileToRead)); 
        String line = br.readLine();       
        while (line != null) {          
            Matcher m = p.matcher(line);
            while (m.find()) {
                if (m.group().length() != 0) {
                    System.out.println( m.group().trim());
                }             
                System.out.println("Start index: " + m.start());
                System.out.println("End index  : " + m.end());
                Pout.println(m.group());  //print the result to the output file
            }
            line = br.readLine();
        }
        Pout.flush();   
        br.close();
        Pout.close();
    }
}


Maybe Scanner.matchAll() is what you looking for. It simplified my code.

try(var scanner = new Scanner(Path.of(path), StandardCharsets.UTF_8)){
    var result = scanner.findAll(PATTERN)
                .map(MatchResult::group)
                .collect(Collectors.toSet());
}


With Java8 you can do this pretty simply and possibly in parallel-

// Create a pattern-matcher
private static final Pattern emailRegex = Pattern.compile("([^,]+?)@([^,]+)");

//Read content of a file
String fileContent = Files.lines(Path.get("/home/testFile.txt")
                              .collect(Collector.join(" "));
// Apply the pattern-matcher
List<String> results = matcherStream(emailRegex.matcher(fileContent))
                           .map(b -> b[2])
                           .collect(Collector.toList()));

Another way can be -

List<String> results = Files.lines(Path.get("/home/testFile.txt")
                              .parallelStream()
                              .forEach(s -> "use regex")
                              .collect(Collector.toList());
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜