Big O Notation for string matching algo
What would the big O notation of the function foo be?
int foo(char *s1, char *s2)
{
int c=0, s, p, found;
for (s=0; s1[s] != '\0'; s++)
{
for (p=0, found=0; s2[p] != '\0'; p++)
{
if (s2[p] == s1[s])
{
found = 1;
break;
}
}
if (!found) c++;
}
return c;
}
What is the efficienc开发者_如何转开发y of the function foo?
a) O(n!)
b) O(n^2)
c) O(n lg(base2) n )
d) O(n)
I would have said O(MN)...?
It is O(n²)
where n = max(length(s1),length(s2)) (which can be determined in less than quadratic time - see below). Let's take a look at a textbook definition:
f(n) ∈ O(g(n)) if a positive real number c and positive integer N exist such that f(n) <= c g(n) for all n >= N
By this definition we see that n represents a number - in this case that number is the length of the string passed in. However, there is an apparent discrepancy, since this definition provides only for a single variable function f(n)
and here we clearly pass in 2 strings with independent lengths. So we search for a multivariable definition for Big O. However, as demonstrated by Howell in "On Asymptotic Notation with Multiple Variables":
"it is impossible to define big-O notation for multi-variable functions in a way that implies all of these [commonly-assumed] properties."
There is actually a formal definition for Big O with multiple variables however this requires extra constraints beyond single variable Big O be met, and is beyond the scope of most (if not all) algorithms courses. For typical algorithm analysis we can effectively reduce our function to a single variable by bounding all variables to a limiting variable n
. In this case the variables (specifically, length(s1) and length(s2)) are clearly independent, but it is possible to bound them:
Method 1
Let x1 = length(s1)
Let x2 = length(s2)
The worst case scenario for this function occurs when there are no matches, therefore we perform x1 * x2 iterations.
Because multiplication is commutative, the worst case scenario foo(s1,s2) == the worst case scenario of foo(s2,s1). We can therefore assume, without loss of generality, that x1 >= x2. (This is because, if x1 < x2 we could get the same result by passing the arguments in the reverse order).
Method 2 (in case you don't like the first method)
For the worst case scenario (in which s1 and s2 contain no common characters), we can determine length(s1) and length(s2) prior to iterating through the loops (in .NET and Java, determining the length of a string is O(1) - but in this case it is O(n)), assigning the greater to x1 and the lesser to x2. Here it is clear that x1 >= x2.
For this scenario, we will see that the extra calculations to determine x1 and x2 make this O(n² + 2n) We use the following simplification rule which can be found here to simplify to O(n²):
If f(x) is a sum of several terms, the one with the largest growth rate is kept, and all others omitted.
Conclusion
for n = x1
(our limiting variable), such that x1 >= x2
, the worst case scenario is x1 = x2
.
Therefore: f(x1) ∈ O(n²)
Extra Hint
For all homework problems posted to SO related to Big O notation, if the answer is not one of:
O(1)
O(log log n)
O(log n)
O(n^c), 0<c<1
O(n)
O(n log n) = O(log n!)
O(n^2)
O(n^c)
O(c^n)
O(n!)
Then the question is probably better off being posted to https://math.stackexchange.com/
In big-O notation, we always have to define what the occuring variables mean. O(n)
doesn't mean anything unless we define what n
is. Often, we can omit this information because it is clear from context. For example if we say that some sorting algorithm is O(n log(n))
, n
always denotes the number of items to sort, so we don't have to always state this.
Another important thing about big-O notation is that it only gives an upper limit -- every algorithm in O(n)
is also in O(n^2)
. The notation is often used as meaning "the algorithm has the exact asymptotic complexity given by the expression (up to a constant factor)", but it's actual definition is "the complexity of the alogrithm is bounded by the given expression (up to a constant factor)".
In the example you gave, you took m
and n
to be the respective lengths of the two strings. With this definition, the algorithm is indeed O(m n)
. If we define n
to be the length of the longer of the two strings though, we can also write this as O(n^2)
-- this is also an upper limit for the complexity of the algorithm. And with the same definition of n
, the algorithm is also O(n!)
, but not O(n)
or O(n log(n))
.
O(n^2)
The relevant part of the function, in terms of complexity, is the nested loops. The maximum number of iterations is the length of s1 times the length of s2, both of which are linear factors, so the worst-case computing time is O(n^2), i.e. the square of a linear factor. As Ethan said, O(mn) and O(n^2) are effectively the same thing.
Think of it this way:
There are two inputs. If the function simply returned, then it's performance is unrelated to the arguments. This would be O(1).
If the function looped over one string, then the performance is linearly related to the length of that string. Therefore O(N).
But the function has a loop within a loop. The performance is related to the length of s1 and the length of S2. Multiply those lengths together and you get the number of loop iterations. It's not linear any more, it follows a curve. This is O(N^2).
精彩评论