开发者

Splitting up visual blocks of text in java

I have a block of text I'm trying to interpret in java (or with grep/awk/etc) looking like the following:

   Somewhat differently, plaques of the rN8 and rN9 mutants            and human coronavirus OC43 as well as the more divergent
   were of fully wild-type size, indicating that the suppressor mu-    SARS-CoV, human coronavirus HKU1, and bat coronaviruses
   tations, in isolation, were not noticeably deleterious to the       HKU4, HKU5, and HKU9 (Fig. 6B). Thus, not only do mem-
   --
   able effect on the viral phenotype. A potentially related obser-    sented for the existence of an interaction between nsp9
   vation is that the mutation A2U, which is also neutral by itself,   nsp8 (56). A hexadecameric complex of SARS-CoV nsp8 and
   is lethal in combination with the AACAAG insertion (data not        nsp7 has been found to bind to double-stranded RNA. The

And what I'd like to do is split it into two parts: left and right. I'm having trouble coming up with a regex or any other method that would split a block of text obviously visually split, but not obvious to a programming language. The lengths of the lines are variable.

I've considered looking for the first block and then finding the second by looking for multiple spaces, bu开发者_开发百科t I'm not sure that that's a robust solution. Any ideas, snippets, pseudo code, links, etc?

Text Source

Splitting up visual blocks of text in java

The text has been ran as follows through pdftotext pdftotext -layout MyPdf.pdf


Blur the text and come up with an array of the character density per column of text. Then look for gaps and split there.

String blurredText = text.replaceAll("(?<=\\S) (?=\\S)", ".");
String[] blurredLines = text.split("\r\n?|\n");

int maxRowLength = 0;
for (String blurredLine : blurredLines) {
  maxRowLength = Math.max(maxRowLength, blurredLine.length());
}

int[] columnCounts = new int[maxRowLength];
for (String blurredLine : blurredLines) {
  for (int i = 0, n = blurredLine.length(); i < n; ++i) {
    if (blurredLine.charAt(i) != ' ') { ++columnCounts[i]; } 
  }
}    

// Look for runs of zero of at least length 3.
// Alternatively, you might look for the n longest runs of zeros.
// Alternatively, you might look for runs of length min(columnCounts) to ignore
// horizontal rules.

int minBreakLen = 3;  // A tuning parameter.
List<Integer> breaks = new ArrayList<Integer>();
outer: for (int i = 0; i < maxRowLength - minBreakLen; ++i) {
  if (columnCounts[i] != 0) { continue; }
  int runLength = 1;
  while (i + runLength < maxRowLength && 0 == columnCounts[i + runLength]) {
    ++runLength;
  }
  if (runLength >= minBreakLen) {
    breaks.add(i);
  }
  i += runLength - 1;
}

System.out.println(breaks);


I doubt there is any robust solution to this. I would go for some sort of heuristic approach.

Off the top of my head, I would calculate a histogram of the column index of the first character of each word, and split on the column with the highest score (the idea being to find lots of words that are all aligned horizontally). I might also choose to weight this based on the number of preceding spaces.


I work in this general area. I am surprised that a double-column bioscience text of recent times (SARS, etc.) would be rendered in double-column monospace as the original - it would be typeset in proportional font or in HTML. So I suspect your text came from some other format (such as PDF). If so then you should try to get that format. PDF is horrible to parse, but PDF flattened to monospace is probably worse.

If you possibly can find someone who has worked in the area and see what they have done. If you have multiple documents (e.g. from different journals or reports) then your problem is worse. Yes, I could write an algorithm to solve the example you have posted, but my guess is it will break on the next set of documents. You will end up customising this for each different source (I and others have had to do this).

UPDATE: Thanks. As it's PDF then I would start by asking around. We collaborate with the group at Penn State (who have also done Citeseer). I also have colleagues at Cambridge who have spent months on a PDF reader.

If you want to do it yourself - and it will take time - then I'd start with PDFBox. I've done quite a lot with this and I think it's better for this than pdf2text or pdftotext. I can't remember whether it has double column option - I think so

UPDATE Here is a recent answer of several ways of tackling double-column PDF http://metaoptimize.com/qa/questions/3943/methods-for-extracting-two-column-text-from-a-pdf I'd certainly see what other people have done.

FWIW I spend a lot of time trying to convince people that scientists should not create their output in PDF because it destroys machine parsing - as you and I have found

UPDATE. You get the PDFs from your PI (== Principal Investigator?) In which case you'll gets lots of different sources which makes it worse.

What is the real problem you are trying to solve? I may be able to help

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜