Optimizing WordWrap Algorithm

2023-02-17 12:33 问答作者：

I have a word-wrap algorithm that basically generates lines of text that fit the width of the text. Unfortunately, it get开发者_JAVA百科s slow when I add too much text.

I was wondering if I oversaw any major optimizations that could be made. Also, if anyone has a design that would still allow strings of lines or string pointers of lines that is better I'd be open to rewriting the algorithm.

Thanks

void AguiTextBox::makeLinesFromWordWrap()
{
    textRows.clear();
    textRows.push_back("");
    std::string curStr;
    std::string curWord;

    int curWordWidth = 0;
    int curLetterWidth = 0;
    int curLineWidth = 0;

    bool isVscroll = isVScrollNeeded();
    int voffset = 0;
    if(isVscroll)
    {
        voffset = pChildVScroll->getWidth();
    }
    int AdjWidthMinusVoffset = getAdjustedWidth() - voffset;
    int len = getTextLength();
    int bytesSkipped = 0;
    int letterLength = 0;
    size_t ind = 0;

    for(int i = 0; i < len; ++i)
    {

        //get the unicode character
        letterLength = _unicodeFunctions.bringToNextUnichar(ind,getText());
        curStr = getText().substr(bytesSkipped,letterLength);


        bytesSkipped += letterLength;

        curLetterWidth = getFont().getTextWidth(curStr);

        //push a new line
        if(curStr[0] == '\n')
        {
            textRows.back() += curWord;
            curWord = "";
            curLetterWidth = 0;
            curWordWidth = 0;
            curLineWidth = 0;
            textRows.push_back("");
            continue;
        }



            //ensure word is not longer than the width
            if(curWordWidth + curLetterWidth >= AdjWidthMinusVoffset && 
                curWord.length() >= 1)
            {
                textRows.back() += curWord;

                textRows.push_back("");
                curWord = "";
                curWordWidth = 0;
                curLineWidth = 0;
            }

            //add letter to word
            curWord += curStr;
            curWordWidth += curLetterWidth;


        //if we need a Vscroll bar start over
        if(!isVscroll && isVScrollNeeded())
        {
            isVscroll = true;
            voffset = pChildVScroll->getWidth();
            AdjWidthMinusVoffset = getAdjustedWidth() - voffset;
            i = -1;
            curWord = "";
            curStr = "";
            textRows.clear();
            textRows.push_back("");
            ind = 0;

            curWordWidth = 0;
            curLetterWidth = 0;
            curLineWidth = 0;

            bytesSkipped = 0;
            continue;
        }

        if(curLineWidth + curWordWidth >= 
            AdjWidthMinusVoffset && textRows.back().length() >= 1)
        {
            textRows.push_back("");
            curLineWidth = 0;
        }

        if(curStr[0] == ' ' || curStr[0] == '-')
        {
            textRows.back() += curWord;
            curLineWidth += curWordWidth;
            curWord = "";
            curWordWidth = 0;
        }
    }

    if(curWord != "")
    {
        textRows.back() += curWord;
    }

    updateWidestLine();
}

There are two main things making this slower than it could be, I think.

The first, and probably less important: as you build up each line, you're appending words to the line. Each such operation may require the line to be reallocated and its old contents copied. For long lines, this is inefficient. However, I'm guessing that in actual use your lines are quite short (say 60-100 characters), in which case the cost is unlikely to be huge. Still, there's probably some efficiency to be won there.

The second, and probably much more important: you're apparently using this for a text-area in some sort of GUI, and I'm guessing that it's being typed into. If you're recomputing for every character typed, that's really going to hurt once the text gets long.

As long as the user is only adding characters at the end -- which is surely the most common case -- you can make effective use of the fact that with your "greedy" line-breaking algorithm changes never affect anything on earlier lines: so just recompute from the start of the last line.

If you want to make it fast even when the user is typing (or deleting or whatever) somewhere in the middle of the text, your code will need to do more work and store more information. For instance: whenever you build a line, remember "if you start a line with this word, it ends with that word and this is the whole resulting line". Invalidate this information when anything changes within that line. Now, after a little editing, most changes will not require very much recalculation. You should work out the details of this for yourself because (1) it's a good exercise and (2) I need to go to bed now.

(To save on memory, you might prefer not to store whole lines at all -- whether or not you implement the sort of trick I just described. Instead, just store here's-the-next-line-break information and build up lines as your UI needs to render them.)

It's probably more complication than you want to take on board right now, but you should also look up Donald Knuth's dynamic-programming-based line-breaking algorithm. It's substantially more complicated than yours but can still be made quite quick, and it produces distinctly better results. See, e.g., http://defoe.sourceforge.net/folio/knuth-plass.html.

Problems on algorithms often come with problem on data-structures.

Let's make a few observations, first:

paragraphs can be treated independently
editing at a given index only invalidates the current word and those that follow
it is unnecessary to copy the whole words when their index would suffice for retrieving them and only their length matter for the computation

Paragraph

I would begin by introducing the notion of paragraph, which are determined by user-introduced line-breaks. When an edition takes place, you need to locate which is the concerned paragraph, which requires a look-up structure.

The "ideal" structure here would be a Fenwick Tree, for a small text box however this seems overkill. We'll just have each paragraph store the number of displayed lines that make up its representation and you'll count from the beginning. Note that an access to the last displayed line is an access to the last paragraph.

The paragraphs are thus stored as a contiguous sequence, in C++ terms, well probably take the hit of an indirection (ie storing pointers) to save moving them around when a paragraph in the middle is removed.

Each paragraph will store:

its content, the simplest being a single std::string to represent it.
its display, in editable form (which we need to determine still)

Each paragraph will cache its display, this paragraph cache will be invalidated whenever an edit is made.

The actual rendering will be made for only a couple of paragraphs at a time (and better, a couple of displayed lines): those which are visible.

Displayed Line

A paragraph may be to displayed with at least one line, but there is no maximum. We need to store the "display" in editable form, that is a form suitable for edition.

A single chunk of characters with \n thrown in is not suitable. Changes imply moving lots of characters around, and users are supposed to be changing the text, so we need better.

Using lengths, instead of characters, we may actually only store a mere 4 bytes (if the string takes more than 3GB... I don't guarantee much about this algorithm).

My first idea was to use the character index, however in case of edition all subsequent indexes are changed, and the propagation is error prone. Lengths are offsets, so we have an index relative to the position of the previous word. It does pose the issue of what a word (or token) is. Notably, do you collapse multiple spaces ? How do you handle them ? Here I'll assume that words are separated from one another by a single whitespace.

For "fast" retrieval, I'll store the length of the whole displayed line as well. This allows quickly skipping the first displayed lines when an edit is made at character 503 of the paragraph.

A displayed line will thus be composed of:

a total length (inferior to the maximum displayed length of the box, once computation ended)
a sequence of words (tokens) length

This sequence should be editable efficiently at both ends (since for wrapping we'll push/pop words at both ends depending on whether an edit added or removed words). It's not so important if in the middle we're not that efficient, because only one line at a time is edited in the middle.

In C++, either a vector or deque should be fine. While in theory a list would be "perfect", in practice its poor memory locality and high memory overhead will offset its asymptotic guarantees. A line is composed of few words, so the asymptotic behavior does not matter and high constants do.

Rendering

For the rendering, pick up a buffer of already sufficient length (a std::string with a call to reserve will do). Normally, you'd clear and rewrite the buffer each time, so no memory allocation occurs.

You need not display what cannot be seen, but do need to know how many lines there are, to pick up the correct paragraph.

Once you get the paragraph:

set offset to 0
for each line hidden, increment offset by its length (+ 1 for the space after it)
a word is accessed as a substring of _content, you can use the insert method on buffer: buffer.insert(buffer.end(), _content[offset], _content[offset+length])

The difficulty is in maintaining offset, but that's what makes the algorithm efficient.

Structures

struct LineDisplay: private boost::noncopyable
{
  Paragraph& _paragraph;
  uint32_t _length;
  std::vector<uint16_t> _words; // copying around can be done with memmove
};

struct Paragraph:
{
  std::string _content;
  boost::ptr_vector<LineDisplay> _lines;
};

With this structure, implementation should be straightforward, and should not slow down as much when the content grows.

General change to the algorithm -

work out if you need the scroll bar as cheap as you can, ie. count the number of \n in the text and if it's greater then the vheight turn on the scroll, check lengths so on.
prepare the text into appropriate lines for the control now that you know you need a scroll bar or not.

This allows you to remove/reduce the test if(!isVscroll && isVScrollNeeded()) as is run on almost every character - isVScroll is probably not cheep, the example code doesn't seem to pass knowledge of lines to the function so can't see how it tells if it is needed.

Assuming textRows is a vector<string> - textrows.back() += is kind of expensive, looking up the back not so much as += on string not being efficient for strings. I'd change to using a ostrstream for gathering the row and push it in when it is done.

getFont().getWidth() are likely to be expensive - is the font changing? how greatly does the width differ between smallest and largest, shortcuts for fixed width fonts.

Use native methods where possible to get the size of a word since you don't want to break them - GetTextExtentPoint32

Often the will be sufficient space to allow for the VScroll when you change between. Restarting from the beginning with measuring could cost you up to twice the time. Store the width of the line with each line so you can skip over the ones that still fit. Or don't build the line strings directly, keep the words seperate with the size.

How accurate does it realy need to be? Apply some pragmatism...
Just assume VScroll will be needed, mostly wrapping won't change much even if it isn't (1 letter words at the end/start of a line)

try and work more with words than with letters - checking remaining space for each letter can waste time. assume each letter in the string is the longest letter, letters x longest < space then put it in.

继续阅读：algorithm optimization

Optimizing WordWrap Algorithm

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？