开发者

An interview question - Split text into sub-strings according to rules

Split text into sub-strings according to below rules:

  • a) The length of each sub-string should less than or equal to M
  • b) The length of sub-string should less than or equal to N (N < M) if the sub-string contains any numeric char
  • c) The total number of sub-strings should be as small as possible

I have no clue how to solve this question, I guess it is related to "dynamic programming". Can anybody 开发者_运维问答help me implement it using C# or Java? Thanks a lot.


Idea

A greedy approach is the way to go:

  • If the current text is empty, you're done.
  • Take the first N characters. If any of them is a digit then this is a new substring. Chop it off and go to beginning.
  • Otherwise, extend the digitless segment to at most M characters. This is a new substring. Chop it off and go to beginning.

Proof

Here's a reductio-ad-absurdum proof that the above yields an optimal solution. Assume there is a better split than the greedy split. Let's skip to the point where the two splits start to differ and remove everything before this point.

Case 1) A digit among the first N characters.

Assume that there is an input for which chopping off the first N characters cannot yield an optimal solution.

Greedy split:   |--N--|...
A better split: |---|--...
                      ^
                      +---- this segment can be shortened from the left side

However, the second segment of the putative better solution can be always shortened from the left side, and the first one extended to N characters, without altering the number of segments. Therefore, a contradiction: this split is not better than the greedy split.

Case 2) No digit among the first K (N < K <= M) characters.

Assume that there is an input for which chopping off the first K characters cannot yield an optimal solution.

Greedy split:   |--K--|...
A better split: |---|--...
                      ^
                      +---- this segment can be shortened from the left side

Again, the the "better" split can be transformed, without altering the number of segments, to the greedy split, which contradicts the initial assumption that there is a better split than the greedy split.

Therefore, the greedy split is optimal. Q.E.D.

Implementation (Python)

import sys

m, n, text = int(sys.argv[1]), int(sys.argv[2]), sys.argv[3]
textLen, isDigit = len(text), [c in '0123456789' for c in text]

chunks, i, j = [], 0, 0
while j < textLen:
   i, j = j, min(textLen, j + n) 
   if not any(isDigit[i:j]):
      while j < textLen and j - i < m and not isDigit[j]:
         j += 1
   chunks += [text[i:j]]
print chunks

Implementation (Java)

public class SO {
   public List<String> go(int m, int n, String text) {
      if (text == null)
         return Collections.emptyList();
      List<String> chunks = new ArrayList<String>();

      int i = 0;
      int j = 0;
      while (j < text.length()) {
         i = j;         
         j = Math.min(text.length(), j + n);
         boolean ok = true;
         for (int k = i; k < j; k++) 
            if (Character.isDigit(text.charAt(k))) {
               ok = false;              
               break;                   
            }                   
         if (ok)        
            while (j < text.length() && j - i < m && !Character.isDigit(text.charAt(j)))
               j++;                     
         chunks.add(text.substring(i, j));
      }         
      return chunks;
   }    

   @Test
   public void testIt() {
      Assert.assertEquals(
         Arrays.asList("asdas", "d332", "4asd", "fsdxf", "23"),
         go(5, 4, "asdasd3324asdfsdxf23"));
   }    
}


Bolo has provided a greedy algorithm in his answer and asked for a counter-example. Well, there's no counter-example because that's perfectly correct approach. Here's the proof. Although it's a bit wordy, it often happens that proof is longer than algorithm itself :)

Let's imagine we have input of length L and constructed an answer A with our algorithm. Now, suppose there's a better answer B. I.e., B has less segments than A does.

Let's say, first segment in A has length la and in B - lb. la >= lb because we've choosen first segment in A to have maximum possible length. And if lb < la, we can increase length of first segment in B without increasing overall number of segments in B. It would give us some other optimal solution B', having same first segment as A.

Now, remove that first segment from A and B' and repeat operation for length L' < L. Do it until there's no segments left. It means, answer A is equal to some optimal solution.


The result of your computation will be a partitioning of the given text into short sub-strings containing numerics and long substrings not containing numerics. (This much you knew already).

You will essentially be partitioning off short subs around the numerics and then breaking everything else down into long subs as often as needed to fulfill the length criteria.

Your freedom, i.e. what you can manipulate to improve your result, is to select which characters to include with a numeric. If N = 3, then for every numeric you get the choice of XXN, XNX or NXX. If M is 5 and you have 6 characters before your first numeric, you'll want to include at least one of those characters in your short sub so you won't end up with two "long" strings to the left of your "short" one when you could have just one instead.

As a first approximation, I'd go with extending your "short" strings leftwise far enough to avoid redundant "long" strings. This is a typical "greedy" approach, and greedy approaches often yield optimal or almost-optimal results. To do even better than that would not be easy, and I'm not going to try to figure out how to go about that.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜