Splitting up visual blocks of text in java

2023-02-06 22:27 问答作者：

I have a block of text I'm trying to interpret in java (or with grep/awk/etc) looking like the following:

   Somewhat differently, plaques of the rN8 and rN9 mutants            and human coronavirus OC43 as well as the more divergent
   were of fully wild-type size, indicating that the suppressor mu-    SARS-CoV, human coronavirus HKU1, and bat coronaviruses
   tations, in isolation, were not noticeably deleterious to the       HKU4, HKU5, and HKU9 (Fig. 6B). Thus, not only do mem-
   --
   able effect on the viral phenotype. A potentially related obser-    sented for the existence of an interaction between nsp9
   vation is that the mutation A2U, which is also neutral by itself,   nsp8 (56). A hexadecameric complex of SARS-CoV nsp8 and
   is lethal in combination with the AACAAG insertion (data not        nsp7 has been found to bind to double-stranded RNA. The

And what I'd like to do is split it into two parts: left and right. I'm having trouble coming up with a regex or any other method that would split a block of text obviously visually split, but not obvious to a programming language. The lengths of the lines are variable.

I've considered looking for the first block and then finding the second by looking for multiple spaces, bu开发者_开发百科t I'm not sure that that's a robust solution. Any ideas, snippets, pseudo code, links, etc?

Text Source

Splitting up visual blocks of text in java

The text has been ran as follows through pdftotext pdftotext -layout MyPdf.pdf

Blur the text and come up with an array of the character density per column of text. Then look for gaps and split there.

String blurredText = text.replaceAll("(?<=\\S) (?=\\S)", ".");
String[] blurredLines = text.split("\r\n?|\n");

int maxRowLength = 0;
for (String blurredLine : blurredLines) {
  maxRowLength = Math.max(maxRowLength, blurredLine.length());
}

int[] columnCounts = new int[maxRowLength];
for (String blurredLine : blurredLines) {
  for (int i = 0, n = blurredLine.length(); i < n; ++i) {
    if (blurredLine.charAt(i) != ' ') { ++columnCounts[i]; } 
  }
}    

// Look for runs of zero of at least length 3.
// Alternatively, you might look for the n longest runs of zeros.
// Alternatively, you might look for runs of length min(columnCounts) to ignore
// horizontal rules.

int minBreakLen = 3;  // A tuning parameter.
List<Integer> breaks = new ArrayList<Integer>();
outer: for (int i = 0; i < maxRowLength - minBreakLen; ++i) {
  if (columnCounts[i] != 0) { continue; }
  int runLength = 1;
  while (i + runLength < maxRowLength && 0 == columnCounts[i + runLength]) {
    ++runLength;
  }
  if (runLength >= minBreakLen) {
    breaks.add(i);
  }
  i += runLength - 1;
}

System.out.println(breaks);

I doubt there is any robust solution to this. I would go for some sort of heuristic approach.

Off the top of my head, I would calculate a histogram of the column index of the first character of each word, and split on the column with the highest score (the idea being to find lots of words that are all aligned horizontally). I might also choose to weight this based on the number of preceding spaces.

I work in this general area. I am surprised that a double-column bioscience text of recent times (SARS, etc.) would be rendered in double-column monospace as the original - it would be typeset in proportional font or in HTML. So I suspect your text came from some other format (such as PDF). If so then you should try to get that format. PDF is horrible to parse, but PDF flattened to monospace is probably worse.

If you possibly can find someone who has worked in the area and see what they have done. If you have multiple documents (e.g. from different journals or reports) then your problem is worse. Yes, I could write an algorithm to solve the example you have posted, but my guess is it will break on the next set of documents. You will end up customising this for each different source (I and others have had to do this).

UPDATE: Thanks. As it's PDF then I would start by asking around. We collaborate with the group at Penn State (who have also done Citeseer). I also have colleagues at Cambridge who have spent months on a PDF reader.

If you want to do it yourself - and it will take time - then I'd start with PDFBox. I've done quite a lot with this and I think it's better for this than pdf2text or pdftotext. I can't remember whether it has double column option - I think so

UPDATE Here is a recent answer of several ways of tackling double-column PDF http://metaoptimize.com/qa/questions/3943/methods-for-extracting-two-column-text-from-a-pdf I'd certainly see what other people have done.

FWIW I spend a lot of time trying to convince people that scientists should not create their output in PDF because it destroys machine parsing - as you and I have found

UPDATE. You get the PDFs from your PI (== Principal Investigator?) In which case you'll gets lots of different sources which makes it worse.

What is the real problem you are trying to solve? I may be able to help

继续阅读：text-processing

Splitting up visual blocks of text in java

Text Source

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Text Source

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？