开发者

Algorithm to split an article without breaking the reading flow or HTML code

I have a very large database of articles, of varying lengths. The articles have HTML elements in them. I have to insert some ads (simple <script> elements) in the body of each article when it is displayed (I know, I hate ads that interrupt my reading too).

Now, the problem is that each ad must be inserted at about the same position in each article. The simplest solutio开发者_C百科n is to simply split the article on a fixed number of characters (without breaking words), and insert the ad code. This, however, runs the risk of inserting the ad in the middle of a HTML tag.

I could go the regex way, but I was thinking about the following solution, using JS:

  1. Establish a character count threshold. For example, "the add should be inserted at about 200 words"
  2. Set accepted deviations in each direction, say -20, +20 characters.
  3. Loop through each text node inside the article, and while doing so, keep count of the total number of characters so far
  4. Once the count exceeds the threshold, make the following decision:

    4.1. If count exceeds the threshold by a value lower that the positive accepted deviation (for example, 17 characters), insert the ad code just after the current text node.

    4.2. If the count is greater than the sum of the threshold and the deviation, roll back to the previous text node, and make the same decision, only this time use the previous count and check if it's lower than the difference between the threshold and the deviation, and if not, insert the ad between the current node and the previous one.

    4.3. If the 4.1 and 4.2 fail (which means that the previous node reached a too low character count and the current node a too high one), insert the ad after whatever character count is needed inside the current element.

I know it's convoluted, but it's the first thing out of my mind and it has the advantage that, by trying to insert the ad between text nodes, perhaps it will not break the flow of the article as bad as it would if I would just stick it in (like the final 4.3 case)

Here is some pseudo-code I put together, I don't trust my english-explaining skills:

threshold = 200
deviation = 20
current_count = 0

for each node in article_nodes {
    previous_count = current_count
    current_count = current_count + node.length
    if current_count < threshold {
        continue // next interation
    }

    if current_count > threshold + deviation {
        if previous_count < threshdold - deviation {
            // insert ad in current node
        } else {
            // insert ad between the current and previous nodes
        }
    } else {
        // insert ad after the current node
    }

    break;
}

Am I over-complicating stuff, or am I missing a simpler, more elegant solution?

PS: both server side and client side solutions are OK for me.


  1. I would only insert an ad ideally at a paragraph break (perhaps p tag) or a line break (perhaps br tag).

Failing that, at a word break. And failing that, force it in between characters. (To cover weird corner cases.)

So here's the K.I.S.S. solution:

  1. count letters, words, lines, AND paragraphs as you go.

Simply do a cascade failure towards your preferred solution:

  1. if you get to 2000 chracters -- just force in an ad and start counting everything again from scratch.

That would never happen except in weird cases.

  1. If you get to 250 words -- just force in an ad and start counting everything again from scratch.

That would happen very infrequently, only with poorly formatted text, weird alien languages etc.

  1. If you get to 50 new lines -- just force in an ad and start counting everything again from scratch.

That would only happen occasionally, with writers who don't use paragraph breaks.

  1. And finally if you get to 3 new paragraphs -- put in an ad and start counting everything again from scratch.

That's what would normally happen.

I would not bother with complicated ideas like backtracking in nearby cases, etc etc. It's just plain not worth it. It almost always gives you a better overall longterm solution to take a consistent, simple "cascading failures" approach. Do the above and you're done!

It's much more art than science doing something like this. You'll enjoy the above, hope it helps!

Obviously, tune the numbers I put in the pseudocode above. Most of the work on a job like this is tuning paramaters on an actual testbed. Writing the code itself is nothing, you need to create a good testbed so you can do it in front of your eyes and see it working (ideally include "dials" for the paramaters, so you can see the results in realtime, you know?) That's how you do it!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜