开发者

Un-nest HTML Tags

We're creating a script to convert certain XHTML files into Word files, however, the way that Word files and HTML files handle formatting changes and are quite different.

For instance, we may have a section as follows:

<p>Title

    <ol>
        <li><p>List 1</p></li>
        <li><p>List 2</p></li>
    </ol>

Additional Information</p>

This changes between files, as some are legacy files written before certain standards, and each file was written by different people, creating inconsistencies. Many files are heavily nested, and many are not. The problem arises in detecting when a file is nested, as, while it may render perfectly in a web browser, a Word document equivalent HTML must be formatted similarly to the following to be easily converted into the XML formatting used by Word (using the previous example):

<p>Title</p>

<li>List 1</li>
<li>List开发者_如何学运维 2</li>

<p>Addition Information</p>

As a Word document, using OpenXML Standards, relies heavily on format sections explicitly beginning and ending before a new section can be created. Unfortuantely, This applies everywhere, even bolded or italic sections.

I've already created a small regular expression to convert Lists into the proper format by finding what type of list it is, removing the p tags, and converting the li tag into either an oli tag for ordered lists and uli for unformatted lists. This in turn is then converted into the proper XML formatting for the Word document.

The problem I'm encountering is that it is much harder to detect, if say a p tag is nested, like in the above example, and if so, to inject a new closing p tag before the li tag, and a new opening p tag after the list to create the un-nested, linear tagging that we're looking for.

My question is if anyone knows if there is a way to do this relatively simply, such as a regular expression or anything like that, or if it would be generally easier to go back to all of the legacy files and clean them up to the current standards to make them compatible. (This is not preferable, as we have alot of these files, and would not like to have any inconsistencies missed, creating improperly formatted Word documents before we can catch them).

Generally, we don't use more tags then p, ol/ul/li, em, strong, table/th/tr/td, and a. I've also found some text that is not inside of any HTML tags, which would be preferable to wrap in a p tag.

Note: PDF is not an acceptable option, as we are looking for ease of use, and script size limits generally prohibit this.


I would suggest using an HTML library like htmLawed to remove the tags you don't want to deal with.

http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/

A snippet from their feature list:

  *  understands improperly spaced tag content (like, spread over more than a line) and properly spaces them  `
  *  attempts to balance tags for well-formedness  ^~`
  *  understands when omitable closing tags like </p> (allowed in HTML 4, transitional, e.g.) are missing  ^~`
  *  attempts to permit only validly nested tags  ^~`
  *  option to remove or neutralize bad content ^~`
  *  attempts to rectify common errors of plain-text misplacement (e.g., directly inside blockquote) ^~`


I've found the easiest way to do this is to remove the ending tags in the content, then remove the first tag as well. The replace each beginning tag with a generic section ending tag followed by it's respective opening section tag. Finally append the first opening tag and the last closing tag onto the beginning and end of the content respectively, and it works fine now. Thank you all for the help.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜