开发者

Detecting empty pages in scanned documents

So we need to detect whether an image, created by a scanner, represents an empty page. I'm way out of my depth when it comes to image processing, so I have to run this by the community.

Here's what I have come up with so far:

  • Empty pages can be glaringly white, gray recycled paper, or yellowed old paper. The current idea is to create a histog开发者_如何学编程ram for a page, look for a steep increase of the curve, and get the percentage of pixels are darker than that. If that exceeds a threshold, the page is likely not empty.

  • Since this would likely classify a page containing a single line of text at the top as empty, we would tile the page and gather statistics about each tile.

  • We would need to detect scanned staplers and holes from binding (likely only in certain tiles), but this can be put off to some later stage. However, if you have an idea what to look out for besides these two, please mention it in a comment.

  • This needs to be fast. It's part of a document processing workflow that processes (tens of) thousands of pages per day. If processing a page takes ten seconds longer, than our customers will have to tell their customers that they'll have to wait several days longer for their results. (If this results in more false positives, some customers would rather have someone check a few dozen found "empty" pages, than have their customer wait one more day.)

So here's my questions:

  1. Is it a good idea to take this route or is there something better?

  2. If we do it this way, how would I do this? What's a good (cheap) algorithm for finding a threshold for a page? Could we gain significant speed by assuming a similar threshold for a batch of documents? To which precision could brightness values be rounded, before getting logged? What quirks could we expect?


If you know that a scanned page is going to fill the image entirely, then calculating the standard deviation might be a good way of doing this.

I would suggest blurring page slightly to reduce some noise. Then calculate the SD for the page, in theory, a page the is more or less all one colour will have a low SD and one with lots of text will have a higher SD. Then it's a case of 'training' the system to work out when a page is plain and when it is text. You might find that certain pages are hard for it to tell.

You could have it trained by having it process a vast number of pages, and it goes through them all, and you say if it is plain or not.

EDIT

ok, a white page with black text, if we have just the page and no surrounding stuff, will have a mean colour of grey, probably a fairly light grey. Getting the average is a for loop through all the pixels, adding their values and then dividing by the number of pixels. I'm not good with this o(logN) stuff, but suffice to say, it will not that long. Unless you have HUGE images.

SD is a second for loop, this time we are counting up how different each pixel is from the mean, and then dividing by the mean. This will take a bit longer then the mean, as we have to do something like

diff = thispixel - mean;
if(diff < 0) {
    diff = -diff;
}
runningTotal += diff;

For a plain coloured page, each pixel will be close to the mean value, thus our SD will be low. If the SD is below a certain value, we can assume that this means the page is all one colour.

This might have problems if their is very minimal amount of text, as it will not have a large influence on the SD, so maybe like you suggested in the question, break the page into sections. I suggest strips horizontally, as text tends to go this way. If we do one of these strips one at a time, once one strip suggests it has text, we can stop as we don't care if the rest is blank or not.

Blurring the page will help reduce noise, as the odd pixel of noise will be reduced in its impact, thus give you a 'tighter' SD. You could also use it to reduce the resolution of your image.

Say you sauce image is 300 wide by 900 high, you could sample pixels in blocks of nine, 3 *3, and thus end up with an image that is 100 wide by 300 high, so it can actually be used to reduce the amount of calculations you need to do, in this case by a ninth!

The main problem is going to be in working out how high an SD can be with just a plain page. Maybe have it find the SD of a load of blank pages.

By the sounds of it, you are probably going to want to have a middle ground that lets it be unsure and ask for human intervention, possibly letting the human value train the system to get better?


Perform some sort of simple edge detection. If the number of pixels constituting edges is below some threshold, then there's going to be a high probability the page is empty. This could be improved by classifying certain edges that correspond with high certainty (by shape and location) to punched holes and staples as trivial and discounting them from the metric.


When I worked for a document processor (~8 years ago), we handled client projects varying from very "clean" only-US-letter-sized pages to cover-/cardstock of irregular shapes mixed with normal pages. Operators fed pre-sorted files into scanning machines and only had to watch for folded corners and similar mechanical problems. Their output was multiple streams of hundreds of images corresponding to a range of files. A single scanner operator could easily scan 15k pieces of paper in a shift (that's only 0.60 pages/sec, while a scanner at speed could handle 3 pages/sec and still scan both sides). Later operators processed those looking for key pages to mark file start and end. (Image recognition can be used here, sometimes, but people also provide a quality check on the first operators.) We had many variables that could be set per client project.

I'm basing the rough outline below on that experience and how it appears that your goals and workflow are similar.

(Terminology: By client I mean our client, e.g. a specific bank. A project or client project is a set of documents from that client that contains many files, e.g. all mortgages handled by a specific branch in a given year. A file is a logical arrangement that would normally be a physical file folder for one of the client's customers, e.g. all mortgage papers for one address.)

  1. Cut off the top, bottom, sides, and corners. Throw these out of your calculations (even though you'll probably store them in the final image). This will cover staple holes, binder holes, but also (tiny) folded corners and very minute torn edges which appear as black spots. Depending on how you're scanning, the last two may be less of a problem.

    Vary the sizes of these cuts for each client project, as required. For example, even a very thin edge slice, say 1-2mm, will eliminate most ragged edges without increasing false positive rate.

  2. Convert to black and white, 1 bit per pixel. I suspect you are already doing this for some client projects anyway, so doing this efficiently and effectively, which can be subtle, should be no extra work. (Even if you don't store the 1bpp image as the deliverable result, the conversion will be helpful in empty page detection.) Eliminate noise by dropping any black pixels with none or only one black neighbor (using all surrounding 8 as neighbors).

    After cutting extremities (#1) and this simplistic noise reduction, blank pages will have a very low number of black pixels; most blanks will have no black pixels at all – exempting exceptionally poor page quality, inked stamps (when scanning back-sides, mentioned more below), or other circumstances across the whole project, and so forth.

  3. Depending on client project, you may set hotspots to be watched – the converse of cutting off the sides. For example, watching a 1" strip where a single line at the top of the page would appear may reduce false positives. A low contrast scan or faded hardcopy (perhaps even pencil, which can be common on back-sides) with only one line of text will be distinguished from a blank page this way.

    What sections are worth watching depends on each project, but do not try to divide the page up into tiles and then subdivide those tiles into areas of interest. Instead, parallelize this on the page level; e.g. 1 worker per core, each worker handles a full page at a time.

  4. Depending on how you're keying individual files, you may find it helpful to drop blanks (before marking start-of-file pages, which is still often a manual process even at high volume) then watch for blank pages at unexpected points after files have been keyed (e.g. expected would be the last page of the file, without being two blanks in a row, etc.).

    For example, if a particular project is only scanning one side of each page, then detecting two blank pages in a row is a good indication that a couple pages in a file were flipped upside-down (clients often hand over hardcopy files like this). Either the sorters (who remove things like staples and paperclips) or the first machine operators should have caught this mistake, but, regardless, it will now need a manual check to verify.

    On the other hand, there were projects that had very clean files so sorters could insert (usually colored) blank pages marking file boundaries. In this case, the second set of people still did the keying by file number, but only had to examine the first page of each file. This wasn't rare, but not common either.

Before I start rambling a bit, I hope my main point comes across: you have to decide how to mitigate rates of false positives (= data loss) and false negatives (= annoying blanks and otherwise harmless, but a maximum allowed rate may still be specified in the project contract). That varies drastically by project and the type of files/documents you're handling, but it guides you in how to do the detection. You will get much better results from a tailored approach than trying one-size-fits-all, even if the tailored approaches are 80-98% similar.

If you're delivering 1bpp images to the client, for example, you might not even want/need to eliminate blanks as filesize (and ultimately size of the delivered dataset) won't be an issue. This can be an acceptable trade-off when eliminating most blanks is harder while maintaining a low false positive rate; such as for files with inked stamps ("received on", "accepted", "due date", etc.; they bleed through to the back) or other problems, for example.


My fall class does a bunch of image-processing projects. Here's what I would try:

  1. Project from color to grayscale
  2. Pour all the pixels into a simple histogram with say 100 buckets between 0 and 1
  3. Find a local minimum in the histogram such that the absoluete value of above - below is as small as possible, where above is the number of brighter pixels and below is the number of darker pixels
  4. Force the above pixels to white and the below pixels to black
  5. If you like, as an extra step you could remove black edges
  6. If there are hardly any black pixels, the page is blank

The first two steps should be combined, and they are the only time-consuming steps; on a 600dpi images you may have to touch many millions of pixels. The rest will be lightning fast. I'd be very surprised if you can't classify multiple images per second—especially if you know there will be no black edges.

The only part that requires training or experiment is the last step. It's also possible that you will need to fiddle around with the number of buckets in the histogram; if there are too many buckets, you may have a bad local minimum.

Good luck, and report back to us how you make out!


Check out this line detection algorithm: http://homepages.inf.ed.ac.uk/rbf/HIPR2/linedet.htm. In addition to a detailed explanation of how the algo works there's a demo where you can use your own image and see the results. I tried two images: 1) a B&W scan of a receipt, 2) the B&W, "blank" back side of that same receipt. All of the edge detection algorithms I tried found edges on the "blank" page. But, this line detection algorithm was the only algorithm that correctly found lines on the front page and yet didn't find anything on the "blank" back page.


It looks as if you're trying to convert all paperwork for a company into digital documents. Some of this paper can be really old.

Say your text is black, and any other color is the background. If you take two weighted averages, one consisting of what you think is the text, and one consisting of the background, you can compare those two and see if they're distant enough to consider further evaluation. This will removing any uneven aging of the paper.

Staple holes and punched holes in paper are pretty standard in size, but they'd show up as gray or not at all if you're scanning on a white background. If not, then you can guess where these are and remove them.

Now, we look at areas of high interest, areas where the black pixels are the most dense. Select a portion of that and OCR it. Place the starting top-left closest to an area where text begins. On a typical document, a solid blank linear area going left-to-right and another going top-to-bottom denotes the top and left sides of a paragraph. You can be sure that you got a line of text because below a line of text is another blank left-to-right area. So you don't need to worry about selecting a portion that will chop text in half.


You could take the mean gray level (integer) of each few rows of the scanned image (depending on the resolution and how many lines are required to capture one line of text), then consider the spread of row means. If there is no text on the page, the spread of means should be small (i.e. background ranges from 250-255), and if there is text on the whole page or on part of the page, the spread would be much larger (i.e. 15 for text to 250 for background).

Seems to me like the solution should be computationally simple due to the large number of pages to check. Approaches requiring further processing (edge detection, filtering, etc) seem like overkill, and will take much longer to run.

There is no need to process pixel by pixel, using matrices will help this be more efficient, for example using Numpy you can calculate means, sums, etc. for entire rows, columns or matrices at once much more efficiently. There is also no need to process EVERY pixel, a good sample of rows should be able to accomplish the task with similar accuracy. 8bit accuracy should be fine, and you could even resample to large pixels before running this processing algorithm.


You can do a noisy trim, i.e. blur the image and do an auto-trim (without actually modifying the image). If the width or height of the trim result is below a threshold (e.g. 80 to 100 for a 600 dpi image) then the page is empty.

A proof of concept using the ImageMagick command line front-end:

$ convert scan.png -shave 300x0 -virtual-pixel White -blur 0x15 -fuzz 15% \
    -trim info:

The above command assumes a 600 dpi DIN A4 black and white (1 Bit) image. It also ignores a margin of 300 pixels such that artifacts like perforation holes don't yield false negatives.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜