Optimized OCR black/white pixel algorithm
I am writing a simple OCR solution for a finite set of characters. That is, I know the exact way all 26 letters in the alphabet will look like. I am using C# and am able to easily determine if a given pixel should be treated as black or white.
I am generating a matrix of black/white pixels for every single character. So for example, the letter I (capital i), might look like the following:
01110
00100
00100
00100
01110
Note: all points, which I use later in this post, assume that the top left pixel is (0, 0), bottom right pixel is (4, 4). 1's represent black pixels, and 0's represent white pixels.
I would create a corresponding matrix in C# like this:
CreateLetter("I", new List<List<bool>>() {
new List<bool>() { false, true, true, true, false },
new List<bool>() { false, false, true, false, false },
new List<bool>() { false, false, true, false, false },
new List<bool>() { false, false, true, false, false },
new List<bool>() { false, true, true, tr开发者_StackOverflow社区ue, false }
});
I know I could probably optimize this part by using a multi-dimensional array instead, but let's ignore that for now, this is for illustrative purposes. Every letter is exactly the same dimensions, 10px by 11px (10px by 11px is the actual dimensions of a character in my real program. I simplified this to 5px by 5px in this posting since it is much easier to "draw" the letters using 0's and 1's on a smaller image).
Now when I give it a 10px by 11px part of an image to analyze with OCR, it would need to run on every single letter (26) on every single pixel (10 * 11 = 110) which would mean 2,860 (26 * 110) iterations (in the worst case) for every single character.
I was thinking this could be optimized by defining the unique characteristics of every character. So, for example, let's assume that the set of characters only consists of 5 distinct letters: I, A, O, B, and L. These might look like the following:
01110 00100 00100 01100 01000
00100 01010 01010 01010 01000
00100 01110 01010 01100 01000
00100 01010 01010 01010 01000
01110 01010 00100 01100 01110
After analyzing the unique characteristics of every character, I can significantly reduce the number of tests that need to be performed to test for a character. For example, for the "I" character, I could define it's unique characteristics as having a black pixel in the coordinate (3, 0) since no other characters have that pixel as black. So instead of testing 110 pixels for a match on the "I" character, I reduced it to a 1 pixel test.
This is what it might look like for all these characters:
var LetterI = new OcrLetter() {
Name = "I",
BlackPixels = new List<Point>() { new Point (3, 0) }
}
var LetterA = new OcrLetter() {
Name = "A",
WhitePixels = new List<Point>() { new Point(2, 4) }
}
var LetterO = new OcrLetter() {
Name = "O",
BlackPixels = new List<Point>() { new Point(3, 2) },
WhitePixels = new List<Point>() { new Point(2, 2) }
}
var LetterB = new OcrLetter() {
Name = "B",
BlackPixels = new List<Point>() { new Point(3, 1) },
WhitePixels = new List<Point>() { new Point(3, 2) }
}
var LetterL = new OcrLetter() {
Name = "L",
BlackPixels = new List<Point>() { new Point(1, 1), new Point(3, 4) },
WhitePixels = new List<Point>() { new Point(2, 2) }
}
This is challenging to do manually for 5 characters and gets much harder the greater the amount of letters that are added. You also want to guarantee that you have the minimum set of unique characteristics of a letter since you want it to be optimized as much as possible.
I want to create an algorithm that will identify the unique characteristics of all the letters and would generate similar code to that above. I would then use this optimized black/white matrix to identify characters.
How do I take the 26 letters that have all their black/white pixels filled in (e.g. the CreateLetter code block) and convert them to an optimized set of unique characteristics that define a letter (e.g. the new OcrLetter() code block)? And how would I guarantee that it is the most efficient definition set of unique characteristics (e.g. instead of defining 6 points as the unique characteristics, there might be a way to do it with 1 or 2 points, as the letter "I" in my example was able to).
An alternative solution I've come up with is using a hash table, which will reduce it from 2,860 iterations to 110 iterations, a 26 time reduction. This is how it might work:
I would populate it with data similar to the following:
Letters["01110 00100 00100 00100 01110"] = "I";
Letters["00100 01010 01110 01010 01010"] = "A";
Letters["00100 01010 01010 01010 00100"] = "O";
Letters["01100 01010 01100 01010 01100"] = "B";
Now when I reach a location in the image to process, I convert it to a string such as: "01110 00100 00100 00100 01110" and simply find it in the hash table. This solution seems very simple, however, this still requires 110 iterations to generate this string for each letter.
In big O notation, the algorithm is the same since O(110N) = O(2860N) = O(N) for N letters to process on the page. However, it is still improved by a constant factor of 26, a significant improvement (e.g. instead of it taking 26 minutes, it would take 1 minute).
Update: Most of the solutions provided so far have not addressed the issue of identifying the unique characteristics of a character and rather provide alternative solutions. I am still looking for this solution which, as far as I can tell, is the only way to achieve the fastest OCR processing.
I just came up with a partial solution:
For each pixel, in the grid, store the letters that have it as a black pixel.
Using these letters:
I A O B L
01110 00100 00100 01100 01000
00100 01010 01010 01010 01000
00100 01110 01010 01100 01000
00100 01010 01010 01010 01000
01110 01010 00100 01100 01110
You would have something like this:
CreatePixel(new Point(0, 0), new List<Char>() { });
CreatePixel(new Point(1, 0), new List<Char>() { 'I', 'B', 'L' });
CreatePixel(new Point(2, 0), new List<Char>() { 'I', 'A', 'O', 'B' });
CreatePixel(new Point(3, 0), new List<Char>() { 'I' });
CreatePixel(new Point(4, 0), new List<Char>() { });
CreatePixel(new Point(0, 1), new List<Char>() { });
CreatePixel(new Point(1, 1), new List<Char>() { 'A', 'B', 'L' });
CreatePixel(new Point(2, 1), new List<Char>() { 'I' });
CreatePixel(new Point(3, 1), new List<Char>() { 'A', 'O', 'B' });
// ...
CreatePixel(new Point(2, 2), new List<Char>() { 'I', 'A', 'B' });
CreatePixel(new Point(3, 2), new List<Char>() { 'A', 'O' });
// ...
CreatePixel(new Point(2, 4), new List<Char>() { 'I', 'O', 'B', 'L' });
CreatePixel(new Point(3, 4), new List<Char>() { 'I', 'A', 'L' });
CreatePixel(new Point(4, 4), new List<Char>() { });
Now for every letter, in order to find the unique characteristics, you need to look at which buckets it belongs to, as well as the amount of other characters in the bucket. So let's take the example of "I". We go to all the buckets it belongs to (1,0; 2,0; 3,0; ...; 3,4) and see that the one with the least amount of other characters is (3,0). In fact, it only has 1 character, meaning it must be an "I" in this case, and we found our unique characteristic.
You can also do the same for pixels that would be white. Notice that bucket (2,0) contains all the letters except for "L", this means that it could be used as a white pixel test. Similarly, (2,4) doesn't contain an 'A'.
Buckets that either contain all the letters or none of the letters can be discarded immediately, since these pixels can't help define a unique characteristic (e.g. 1,1; 4,0; 0,1; 4,4).
It gets trickier when you don't have a 1 pixel test for a letter, for example in the case of 'O' and 'B'. Let's walk through the test for 'O'...
It's contained in the following buckets:
// Bucket Count Letters
// 2,0 4 I, A, O, B
// 3,1 3 A, O, B
// 3,2 2 A, O
// 2,4 4 I, O, B, L
Additionally, we also have a few white pixel tests that can help: (I only listed those that are missing at most 2). The Missing Count was calculated as (5 - Bucket.Count).
// Bucket Missing Count Missing Letters
// 1,0 2 A, O
// 1,1 2 I, O
// 2,2 2 O, L
// 3,4 2 O, B
So now we can take the shortest black pixel bucket (3,2) and see that when we test for (3,2) we know it is either an 'A' or an 'O'. So we need an easy way to tell the difference between an 'A' and an 'O'. We could either look for a black pixel bucket that contains 'O' but not 'A' (e.g. 2,4) or a white pixel bucket that contains an 'O' but not an 'A' (e.g. 1,1). Either of these could be used in combination with the (3,2) pixel to uniquely identify the letter 'O' with only 2 tests.
This seems like a simple algorithm when there are 5 characters, but how would I do this when there are 26 letters and a lot more pixels overlapping? For example, let's say that after the (3,2) pixel test, it found 10 different characters that contain the pixel (and this was the least from all the buckets). Now I need to find differences from 9 other characters instead of only 1 other character. How would I achieve my goal of getting the least amount of checks as possible, and ensure that I am not running extraneous tests?
I don't have an answer, but here are some bounds on your eventual solution:
If you want a straight up "use X pixels as a key" then you'll need at least ceiling(log2(number of characters))
pixels. You won't be able to disambiguate letters with less bits. In your case, trying to find the 5 pixels is equivalent to finding 5 pixels that split the letters into independent partitions. It probably isn't that easy.
You can also use Moron's (heheh) suggestion and build a tree based on the letter frequencies of the language you are scanning similar to Huffman coding. That would take up more space than 5-bits per letter, but would probably be smaller assuming a power-law distribution of letter usage. I would go with this approach as it allows you to search for a specific partition for each node rather than searching for a set of partitions.
You could create a tree.
Pick a pixel, and divide the letters into two buckets, based on the pixel being white or black. Then pick a second pixel, split the buckets into two buckets each based on that pixel and so on.
You could try to optimize the depth of the tree by choosing pixels which give buckets which are approximately equal in size.
Creating the tree is a one time preprocess step. You should not have to do it multiple times.
Now when you get an alphabet to match, follow the tree based on the pixels set/not set and get your letter.
I don't have an algorithm to give you the key features, but here are some things that might help.
First, I wouldn't worry too much about looking for a characteristic pixel for each character because, on the average, checking if a given character matches with a given swath (5x5) of the binary image shouldn't take more than 5-7 checks to tell that the there isn't a match. Why? Probability. For 7 binary pixels, there are 2**7=128 different possibilities. That means there is a 1/128 < 1% chance of a character matching even up to 7 pixels. Just make sure that you stop the comparisons right when you find a mismatch.
Second, if you don't want to do a hash table, then you might consider using a trie to store all of your character data. It will use less memory, and you'll be checking all of the characters at once. It won't be quite as fast to search through as a hash table, but you also won't have to convert to a string. At each node in the tree, there can only be at most 2 descendants. For instance, if you have two 2x2 characters (let's call them A and B):
A B
01 00
10 11
You trie would have only one descendant at the first node - only to the left(the 0 branch). We proceed to this next node. It has two descendents, the left (0) branch leads to the rest of B and the right (1) branch leads to the rest of A. You get the picture. Let me know if this part isn't clear.
Why not just consider the image as an 25-bit integer? A 32-bit int may work. For example, the letter 'I' can be treat as an integer 14815374 in decimal for its binary expression is 0111000100001000010001110. It's convenience for you to compare two images with the operation '==' as two integer.
One way would be to identify a pixel that's black in roughly half of the letters and white in the other set. This can then be used to split the letters into two groups, using the same algorithm on both halves recursively, until you have reached individual characters.
If you can't find a single pixel that splits the sets into two, you may have to go to a group of two or more pixels, but hopefully using single pixels ought to be good enough.
To find the pixel, start with an array of integers, the same size as your letters, initialize all elements to 0, then increment the elements if the corresponding pixel in a letter is (say) black. The ones you're interested in are the ones in the (roughly) 10≤sum≤16 range (for the top level, lower levels would need to use other bounds).
Alright, I figured out the solution.
You simply use a depth first search on every single pixel with every other pixel combination, until you find the set of unique characteristics of the letter. While performing the depth first search, make sure you don't start at x=0 and y=0 each time since you only want to process each combination only once, so what you end up doing is incrementing the x and y values in each iteration.
I created a helper object which contains these properties:
public Point LastPoint { get; set; }
public List<OcrChar> CharsWithSimilarProperties { get; set; }
public List<Point> BlackPixels { get; set; }
public List<Point> WhitePixels { get; set; }
For every iteration, if I couldn't find a unique characteristic (e.g. all other letters have this pixel as black but this letter has it as white... or the inverse) I add all subsequent pixels to a queue which is being processed, by creating an instance of this above object with the properties properly set.
Some psuedo code:
rootNode.LastPoint = new Point(-1, -1)
rootNode.CharsWithSimilarProperties = all letters in alphabet except for this one
queue.Add(rootNode)
while queue.HasNodes()
for each pixel after node.LastPoint
if node.IsBlackPixel(pixel) && node.CharsWithSimilarProperties.IsAlwaysWhite(pixel)
node.BlackPixels.Add(pixel)
return node.BlackPixels and node.WhitePixels
if node.IsWhitePixel(pixel) && node.CharsWithSimilarProperties.IsAlwaysBlack(pixel)
node.WhitePixels.Add(pixel)
return node.BlackPixels and node.WhitePixels
newNode = new Node();
newNode.BlackPixels = node.BlackPixels.Copy();
newNode.WhitePixels = node.WhitePixels.Copy();
newNode.LastPoint = pixel
if node.IsBlackPixel(pixel)
newNode.BlackPixels.Add(pixel)
newNode.CharsWithSimilarProperties = list of chars from node.CharsWithSimilarProperties that also had this pixel as black
else
newNode.WhitePixels.Add(pixel)
newNode.CharsWithSimilarProperties = list of chars from node.CharsWithSimilarProperties that also had this pixel as white
queue.Add(newNode)
To determine if "node.CharsWithSimilarProperites.IsAlwaysWhite()" or "IsAlwaysBlack()", you can generate a compositeMap in each iteration of the queue:
for each pixel after node.LastPoint
for each char in node.CharsWithSimilarProperties
if char.IsBlackPixel(pixel)
compositeMap[pixel].Add(char)
Before doing all this, I also processed the entire alphabet to find pixels that are always white or always black, since these can never be used. I added them to a List<Point> ignoredPixels
, and every time I iterate over pixels, I always use if (ignoredPixels[x, y]) continue;
.
This works perfectly and is really fast. Although keep in mind that this part of my solution doesn't need to be fast at all since it is a one-time optimization which helps me later on. In my test cases of a maximum of 8 chars per "alphabet" set, it usually produces one or two characteristics for each character. I have yet to run it on a full set of 26 characters.
I am going down a similar track trying to invent an algorithm that will give me a minimal number of tests I can use to match an image to one I've seen previously. My application is OCR but in a limited domain of recognising an image from a fixed set of images as fast as possible.
My basic assumption (which I think is the same as yours, or was the same) is that if we can identify one unique pixel (where a pixel is defined as a point within an image plus a color) then we have found the perfect (fastest) test for that image. In your case you want to find letters.
If we cannot find one such pixel then we (grudgingly) look for two pixels that in combination are unique. Or three. And so on, until we have a minimal test for each of the images.
I should note that I have a strong feeling that in my particular domain I will be able to find such unique pixels. It might not be the same for your application where you seem to have a lot of "overlap".
After considering comments in this other question (where I'm just starting to get a feel for the problem) and comments here I think I might have come up with a workable algorithm.
Here is what I've got so far. The method I describe below is written in the abstract but in my application each "test" is a pixel identified by a point plus a color, and a "result" represents the identity of an image. Identification of these images is my end goal.
Consider the following tests numbered T1 to T4.
- T1: A B C
- T2: B
- T3: A C D
- T4: A D
This list of tests can be interpreted as follows;
- If test T1 is true we conclude that we have a result of A or B or C.
- If test T2 is true we conclude that we have a result of B.
- If test T3 is true we conclude that we have a result of A or C or D.
- If test T4 is true we conclude that we have a result of A or D.
For each individual result A, B, C, D, we want to find a combination of tests (ideally just one test) that will allow us to test for an unambiguous result.
Applying intuition and with a bit of squinting at the screen we can fumble our way to the following arrangement of tests.
For A we can test for a combination of T4 (either A or D) AND T1 (A but not D)
B is easy since there is a test T2 that gives result B and nothing else.
C is a bit harder, but eventually we can see that a combination of T3 (A or C or D) and NOT T4 (not A and not D) gives the desired result.
And similarly, D can be found with a combination of T4 and (not T1).
In summary
A <- T4 && T1
B <- T2
C <- T3 && ¬T4
D <- T4 && ¬T1
(where <-
should be read as 'can be found if the following tests evaluate to true')
Intuition and squinting is fine, but we probably won't get these techniques built into the language until at least C# 5.0, so here is an attempt at formalising the method for implementation in lesser languages.
To find a result R
,
- Find the test
Tr
that gives the desired resultR
and the fewest unwanted results (ideally no others) - If the test gives the result
R
and nothing else we are finished. We can match forR
whereTr
is true. - For every unwanted result
X
in the testTr
;- (a) Find the shortest test
Tn
that givesR
but notX
. If we find such a test we can then match forR
where(T && Tn)
- (b) If no test matches condition (a) then find the shortest test
Tx
that includesX
but does not includeR
. (Such a test would eliminateX
as a result from testTr
). We can then test forR
where(T && ¬Tx)
- (a) Find the shortest test
Now I will try to follow these rules for each of the desired results, A, B, C, D.
Here are the tests again for reference;
- T1: A B C
- T2: B
- T3: A C D
- T4: A D
For A
According to rule (1) we start with T4 since it is the simplest test that gives result A. But it also gives result 'D' which is an unwanted result. According to rule (3) we can use test T1 since it includes 'A' but does not include 'D'.
Therefore we can test for A with
A <- T4 && T1
For B
To find 'B' we quickly find test T2 which is the shortest test for 'B' and since it gives only result 'B' we are finished.
B <- T2
For C
To find 'C' we start with T1 and T3. Since the results of these tests are equally short we arbitrarily choose T1 as the starting point.
Now according to (3a) we need to find a test that includes 'C' but not 'A'. Since no test satisfies this condition we cannot use T1 as the first test. T3 has the same problem.
Being unable to find a test that satisfies (3a) we now look for a test that satisfies condition (3b). We look for a test that gives 'A' but not 'C'. We can see that test T4 satisfies this condition, so therefore we can test for C with
C <- T1 && ¬T4
For D
To find D we start with T4. T4 includes unwanted result A. There are no other tests that give the result D but not A so we look for a test that gives A but not D. Test T1 satisfies this condition so therefore we can test for D with
D <= T4 && ¬T1
These results are good but I don't think I've quite debugged this algorithm enough to have 100% confidence. I'm going to think about it a bit more and maybe code up some tests to see how it holds up. Unfortunately the algorithm is just complex enough that it will take more than a few minutes to implement carefully. It might be days before I conclude anything further.
Update
I found that it is optimal to simultaneously look for tests that satisfy (a) OR (b) rather than look for (a) and then (b). If we look first for (a) we might get a long list of tests when we might have got a shorter list by allowing some (b) tests.
精彩评论