开发者

Ellipsizing a set of names

OK, I'm sure somebody, somewhere must have come up with an algorithm for this already, so I figured I'd ask before I go off to (re)invent it myself.

I have a list of arbitrary (user-entered) non-empty text strings. Each string can be any length (except 0), and they're all unique. I want to display them to the user, but I want to trim them to some fixed length that I decide, and replace part of them with an ellipsis (...). The catch is that I want all of the output strings to be unique.

For example, if I have the strings:

  • Microsoft Internet Explorer 6
  • Microsoft Internet Explorer 7
  • Microsoft Internet Explorer 8
  • Mozilla Firefox 3
  • Mozilla Firefox 4
  • Google Chrome 14

then I wouldn't want to trim the ends of the strings, because that's the unique part (don't want to display "Microsoft Internet ..." 3 times), but it's OK to cut out the middle part:

  • Microsoft...rer 6
  • Microsoft...rer 7
  • Microsoft...rer 8
  • Mozilla Firefox 3
  • Mozilla Firefox 4
  • Google Chrome 14

Other times, the middle part might be unique, and I'd want to trim the end:

  • Minutes of Company Meeting, 5/25/2010 -- Internal use only
  • Minutes of Company Meeting, 6/24/2010 -- Internal use only
  • Minutes of Company Meeting, 7/23/2010 -- Internal use only

could become:

  • Minutes of Company Meeting, 5/25/2010...
  • Minutes of Company Meeting, 6/24/2010...
  • Minutes of Company Meeting, 7/23/2010...

I guess it should probably never ellipsize the very beginning of the strings, even if that would otherwise be allowed, since that would look weird. And I guess it could ellipsize more than one place in the string, but within reason -- maybe 2 times would be OK, but 3 or more seems excessive. Or maybe the number of times isn't as important as the size of the chunks that remain: less than about 5 characters between ellipses would be rather pointless.

The inputs (both number and size) won'开发者_Python百科t be terribly large, so performance is not a major concern (well, as long as the algorithm doesn't try something silly like enumerating all possible strings until it finds a set that works!).

I guess these requirements seem pretty specific, but I'm actually fairly lenient -- I'm just trying to describe what I have in mind.

Has something like this been done before? Is there some existing algorithm or library that does this? I've googled some but found nothing quite like this so far (but maybe I'm just bad at googling). I have to believe somebody somewhere has wanted to solve this problem already!


It sounds like an application of the longest common substring problem.

Replace the longest substring common to all strings with ellipsis. If the string is still too long and you are allowed to have another ellipsis, repeat.

You have to realize that you might not be able to "ellipsize" a given set of strings enough to meet length requirements.


Sort the strings. Keep the first X characters of each string. If this prefix is not unique to the string before and after, then advance until unique characters (compared to the string before and after) are found. (If no unique characters are found, the string has no unique part, see bottom of post) Add ellipses before and after those unique characters.

Note that this still might look funny:

Microsoft Office -> Micro...ffice
Microsoft Outlook -> Micro...utlook

I don't know what language you're looking to do this in, but here's a Python implementation.

def unique_index(before, current, after, size):
    '''Returns the index of the first part of _current_ of length _size_ that is 
        unique to it, _before_, and _after_. If _current_ has no part unique to it,
        _before_, and _after_, it returns the _size_ letters at the end of _current_'''
    before_unique = False
    after_unique = False
    for i in range(len(current)-size):
        #this will be incorrect in the case mentioned below
        if i > len(before)-1 or before[i] != current[i]:
            before_unique = True
        if i > len(after)-1 or after[i] != current[i]:
            after_unique = True
        if before_unique and after_unique:
            return i

    return len(current)-size

def ellipsize(entries, prefix_size, max_string_length):
    non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this.

    #If you want to preserve order then make a copy and make a mapping from the copy to the original
    entries.sort()

    ellipsized = []

    # you could probably remove all this indexing with something out of itertools
    for i in range(len(entries)):
        current = entries[i]

        #entry is already short enough, don't need to truncate
        if len(current) <= max_string_length:
            ellipsized.append(current)
            continue

        #grab empty strings if there's no string before/after
        if i == 0:
            before = ''
        else:
            before = entries[i-1]
        if i == len(entries)-1:
            after = ''
        else:
            after = entries[i+1]

        #Is the prefix unique? If so, we're done.
        current_prefix = entries[i][:prefix_size]    
        if not before.startswith(current_prefix) and not after.startswith(current_prefix):
            ellipsized.append(current[:max_string_length] + '...') #again, possibly -3

        #Otherwise find the unique part after the prefix if it exists.
        else:
            index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size)
            if index == prefix_size:
                header = ''
            else:
                header = '...'
            if index + non_prefix_size == len(current):
                trailer = ''
            else:
                trailer = '...'
            ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer)
    return ellipsized

Also, you mention the string themselves are unique, but do they all have unique parts? For example, "Microsoft" and "Microsoft Internet Explorer 7" are two different strings, but the first has no part that is unique from the second. If this is the case, then you'll have to add something to your spec as to what to do to make this case unambiguous. (If you add "Xicrosoft", "MXcrosoft", "MiXrosoft", etc. to the mix with these two strings, there is no unique string shorter than the original string to represent "Microsoft") (Another way to think about it: if you have all possible X letter strings you can't compress them all to X-1 or less strings. Just like no compression method can compress all inputs, as this is essentially a compression method.)

Results from original post:

>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20):
    print entry

Google Chrome 14
Microso...et Explorer 6
Microso...et Explorer 7
Microso...et Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40):
    print entry

Minutes of Comp...5/25/2010 -- Internal use...
Minutes of Comp...6/24/2010 -- Internal use...
Minutes of Comp...7/23/2010 -- Internal use...
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜