RegEx to convert Word output to html order list
I'm in need of a tricky regex and I don't know if it can be written.
I'm trying to clean up some horrid html output from Ms Word. Here's an exmaple of the dandy that it does on an ordered (or numbered) list.
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi
<p>
1.
</p>
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno
<p>
2.
</p>
Ac Nec Netus Penatibus Purus Cras Mollis
<p>
3.
</p>
Beautiful, isn't it? Paragraph tags and nonbreaking spaces...
I'm wondering if it's even feasible to write a regex to replace this with the following:
<ol>
<li>
1.
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi</li>
Nulla Auctor Bibendu开发者_StackOverflow社区m Suspendisse Commodo Cras Cursus Anno
<li>
2.
</li>
Ac Nec Netus Penatibus Purus Cras Mollis
<li>
3.
</li>
</ol>
The difficulty is that the number of s
can vary from none to just a few to a lot and a list can be of varying lengths. Having no s
seems to be rare, and it seems to happen only after a list gets larger (say when going from 9 to 10 or 99 to 100.)
Anyway, if such a thing is possible, that would be awesome. As it stands, I can search for long strings of s
and then manually apply list formatting, but it's not as fast as automatic.
First: all the standard replies apply to this question: you (should|can|may) not parse/process html (valid or not) using regex. For a wide range of reasons not to do this, I recommend searching the web and/or SO.
That said (and assuming your paragraph tags cannot be nested!), you can not do this in one replacement. You will first have to wrap <ol>
and </ol>
tags around your paragraphs that "look like" ordered lists. I assume that a paragraph is an ordered list when it starts with <p> NUMBER.
(a paragraph tag, some spaces, a number and a full stop).
regex : (?s)((?:<p>\s*\d+\.(?:(?!</p>).)*</p>\s*)+)
replacement : <ol>$1</ol>
A short explanation:
// regex
(?s) # enable DOT-ALL matching
( # open group 1
(?: # open non-matching group 1
<p>\s*\d+\. # match '<p>', zero or more spaces, a number and a full stop
(?:(?!</p>).)* # [when looking ahead, if there's no '</p>', only then match any character] zero or more times
</p> # match '</p>'
\s* # match zero or more white spaces
) # close non-matching group 1
+ # non-matching group 1 one or more times
) # close group 1
// replacement
<ol> # insert '<ol>'
$1 # insert what is matched by the regex in group 1
</ol> # insert '</ol>'
Now your string will contain:
<ol><p>1.
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </p>
<p>2.
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </p>
<p>3.
Ac Nec Netus Penatibus Purus Cras Mollis </p></ol>
Next, replace all the paragraphs (including their numbers!) with <li>
and </li>
tags:
regex : (?s)<p>\s*\d+\.((?:(?!</p>).)*)</p>
replacement : <li>$1</li>
Again, a short explanation:
// regex
(?s) # enable DOT-ALL matching
<p> # match '<p>'
\s* # match zero or more white space characters
\d+ # match one or more digits
\. # match a dot
( # start group 1
(?:(?!</p>).)* # [when looking ahead, if there's no '</p>', only then match any character] zero or more times
) # end group 1
</p> # match '</p>'
// replacement
<li> # insert '<li>'
$1 # insert what is matched by the regex in group 1
</li> # insert '</li>'
Now your string will look like:
<ol><li>
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </li>
<li>
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </li>
<li>
Ac Nec Netus Penatibus Purus Cras Mollis </li></ol>
But again: be very very careful. When there's one little mistake in an opening or closing tag, you may very well end up with something that is far worse than what you've started with!
Not quite what you're asking for, but the HTML output from Microsoft Word has long been regarded by many as very poor, and many people have found themselves trying to clean it up. As a result, there are a good number of HTML-cleaning tools out there, and a quick search on Google suggests that the HTML Tidy Library Project, or others, may help you out. Don't reinvent the wheel unless you have to!
No, it is not feasible as a regular expression, because HTML is not a regular language.
Instead, take any HTML parser, find subsequent <p>
nodes that are inside a common parent node and the contents of which begin with ordered numerals, and put them as <li>
nodes into a new <ol>
node.
I am using this .JS wrapped into a function to best clean up a loaded .doc file into a DIV. Its by no means a total solution. Improvements are welcome.
h = h.replace(/<[/]?(font|st1|shape|path|lock|imagedata|stroke|formulas|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>/gi, '')
h = h.replace(/<([^>]*)style="([^>"]*)"([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style='([^>']*)'([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style=([^> ]*) ([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style=([^>]*)>/gi, '<$1>')
h = h.replace(/<([^>]*)class="([^>"]*)"([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class='([^>']*)'([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class=([^> ]*) ([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class=([^>]*)>/gi, '<$1>')
I also found this VB solution on Tim Mackeys helpful blog:
Private Function CleanHtml(ByVal html As String) As String
html = Regex.Replace(html, "<[/]?(font|link|m|a|st1|meta|object|style|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
Dim i As Integer = 0
Dim x As Integer = 0
html = customClean(html, "<!--[if", "<![endif]-->")
html = customClean(html, "<!-- /*", "-->")
Return html
End Function
Private Function customClean(ByVal html As String, ByVal begStr As String, ByVal endStr As String) As String
Dim i As Integer
Dim j As Integer
While html.Contains(begStr)
i = html.IndexOf(begStr, 0)
j = html.IndexOf(endStr, 0)
html = html.Remove(i, ((j - i) + endStr.Length))
End While
Return html
End Function
Hope this helps.
All those
has no effect, What you need is this:
/<p>( *[0-9]+.*?)<\/p>/<li>\1<\/li>/
12 years later, Word-HTML still use lots of  's for formatting list-items. Worse, those  's tend to be specified incorrectly. Consequently, Word-HTML's lists often have incorrect and inconsistent indentation.
I recently wrote a Python program that fixes these problems in Word-HTML, for bulleted and ordered-lists. The program is part of the open-source system WordWebNav (WWN).
In Word-HTML, each list-item is an HTML paragraph (<p>). WWN fixes the Word-HTML lists by correcting those HTML paragraphs, e.g., it ensures the correct number of  's are used. This seemed simpler than replacing HTML paragraphs with HTML list-items (<li>), as proposed in the OP.
Most of the Word-HTML parsing is too complex for regex
WWN uses BeautifulSoup to do the bulk of the HTML parsing and editing. This avoids the known problems from using regex to parse HTML. Those regex problems are described in other answers to the OP, and here.
Fixing the Word-HTML lists involved researching Word-HTML files, to discover the various ways incorrect HTML is generated. BeautifulSoup was used to parse and fix the buggy Word-HTML. There's a lot of variation in the Word-HTML for lists, and parsing that HTML with regex's would be especially problematic. For example, the HTML paragraph-tags (<p>) can contain randomly-placed span-tags with a "lang" attribute:
<span lang=EN-GB>...</span>
WWN uses regex for some HTML parsing, but it's only for small subsets of the HTML, where there's little variation in the content.
The Word-HTML research-results, and the parsing-code are too complex to fully describe here. Highlights are described below.
Word-HTML's bugs, that cause mis-formatted lists
For list-items, Word-HTML uses  's to set the indentation before the list-symbol (e.g., number).  's are also used to set the spacing between the list-symbol and the start of the list-item's text. The number of  's used is often incorrect and inconsistent. The problem is the worst with multi-level lists. WordWebNav's docs show examples.
With ordered lists, another cause for mis-formatted list-items is using incorrect values for the style attribute "text-indent". This affects the spacing before the list-symbol.
Example Word-HTML, for mis-formatted list-items
A bulleted-list list-item, with lots of  's after the bullet-symbol ("·"):
<p class=MsoListParagraphCxSpFirst style='margin-left:.25in;text-indent:-.25in'><span
style='font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>
</span></span>This is the list-item's text.</p>
An ordered-list list-item, with  's before and after the list-symbol ("i."):
<p class=MsoListParagraphCxSpMiddle style='margin-left:1.5in;text-indent:-1.5in'><span
style='font:7.0pt "Times New Roman"'>
</span>i.<span style='font:7.0pt "Times New Roman"'>
</span>This is the list-item's text.</p>
Bullet symbols that don't display properly in Firefox
For bulleted-lists, there are two list symbols that don't display properly in Firefox. They are shown in the WordWebNav doc's examples, cited earlier.
Using BeautifulSoup to fix the Word-HTML bugs in lists
WWN has a program create_web_page.py, and it fixes the bugs in the Word-HTML lists. The program also fixes other bugs in Word-HTML, and it adds features to the Word-HTML, to make it a more usable web-page (e.g., a Navigation Pane is added).
The code in create_web_page.py is commented, and it explains the parsing and fixes for the HTML bugs. The code-sections that process lists are identified by block comments, e.g.,
'''
######################
Code Section: Fix the list-items in ordered-lists
######################
'''
精彩评论