Un-nest HTML Tags

2023-02-10 22:17 问答作者：

We're creating a script to convert certain XHTML files into Word files, however, the way that Word files and HTML files handle formatting changes and are quite different.

For instance, we may have a section as follows:

<p>Title

    <ol>
        <li><p>List 1</p></li>
        <li><p>List 2</p></li>
    </ol>

Additional Information</p>

This changes between files, as some are legacy files written before certain standards, and each file was written by different people, creating inconsistencies. Many files are heavily nested, and many are not. The problem arises in detecting when a file is nested, as, while it may render perfectly in a web browser, a Word document equivalent HTML must be formatted similarly to the following to be easily converted into the XML formatting used by Word (using the previous example):

<p>Title</p>

<li>List 1</li>
<li>List开发者_如何学运维 2</li>

<p>Addition Information</p>

As a Word document, using OpenXML Standards, relies heavily on format sections explicitly beginning and ending before a new section can be created. Unfortuantely, This applies everywhere, even bolded or italic sections.

I've already created a small regular expression to convert Lists into the proper format by finding what type of list it is, removing the p tags, and converting the li tag into either an oli tag for ordered lists and uli for unformatted lists. This in turn is then converted into the proper XML formatting for the Word document.

The problem I'm encountering is that it is much harder to detect, if say a p tag is nested, like in the above example, and if so, to inject a new closing p tag before the li tag, and a new opening p tag after the list to create the un-nested, linear tagging that we're looking for.

My question is if anyone knows if there is a way to do this relatively simply, such as a regular expression or anything like that, or if it would be generally easier to go back to all of the legacy files and clean them up to the current standards to make them compatible. (This is not preferable, as we have alot of these files, and would not like to have any inconsistencies missed, creating improperly formatted Word documents before we can catch them).

Generally, we don't use more tags then p, ol/ul/li, em, strong, table/th/tr/td, and a. I've also found some text that is not inside of any HTML tags, which would be preferable to wrap in a p tag.

Note: PDF is not an acceptable option, as we are looking for ease of use, and script size limits generally prohibit this.

I would suggest using an HTML library like htmLawed to remove the tags you don't want to deal with.

http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/

A snippet from their feature list:

  *  understands improperly spaced tag content (like, spread over more than a line) and properly spaces them  `
  *  attempts to balance tags for well-formedness  ^~`
  *  understands when omitable closing tags like </p> (allowed in HTML 4, transitional, e.g.) are missing  ^~`
  *  attempts to permit only validly nested tags  ^~`
  *  option to remove or neutralize bad content ^~`
  *  attempts to rectify common errors of plain-text misplacement (e.g., directly inside blockquote) ^~`

I've found the easiest way to do this is to remove the ending tags in the content, then remove the first tag as well. The replace each beginning tag with a generic section ending tag followed by it's respective opening section tag. Finally append the first opening tag and the last closing tag onto the beginning and end of the content respectively, and it works fine now. Thank you all for the help.

继续阅读：php tags xhtml

Un-nest HTML Tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？