Transforming HTML with XQuery
I'm wanting to take the HTML generated by a QTextEdit editor and transform it to something a little more friendly for use in an actual web page. Unfortunately, the HTML generator that is part of the QTextEdit api is not public and cannot be modified. I'd rather not have to create a WYSIWYG html editor when I have most of what I need built in.
In a short discussion on the qt-interest mailing list, someone mentioned using XQuery via the QtXmlPatterns module.
For an example of the ugly HTML the editor outputs, it uses <span style=" font-weight:600">
for bold text, <span style=" font-weight:600; text-decoration: underline">
for bold and underline text, etc. Here's a sample:
<html>
<head>
</head>
<body style=" font-family:'Lucida Grande'; font-size:14pt; font-weight:400; font-style:normal;">
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" font-weight:600;">bold text</span></p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-weight:600;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" font-style:italic;">italics text</span></p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-style:italic;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" text-decoration: underline;">underline text</span></p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0p开发者_运维问答x;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" font-weight:600; text-decoration: underline;">bold underline text</span></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" font-weight:600;">bold text </span><span style=" font-weight:600; text-decoration: underline;">bold underline text</span></p>
</body>
</html>
What I'd like to transform this into is something along the lines of this:
<body>
<p>plain text</p>
<p/>
<p>plain text <b>bold text</b></p>
<p/>
<p>plain text <em>italics text</em></p>
<p/>
<p>plain text <u>underline text</u></p>
<p/>
<p>plain text <b>bold text <u>bold underline text</u></b></p>
</body>
I've gotten around 90% of the way to where I need to be. I can correctly transform the first 4 where each <span>
style member has only one of the italics, bold, or underline attributes. I'm having trouble when the span style has multiple attributes. For instance, if the span style has both font-weight:600
and text-decoration: underline
.
Here's my XQuery code that I have thus far:
declare function local:process_span_data($node as node())
{
for $n in $node
return (
for $attr in $n/@style
return (
if(contains($attr, 'font-weight:600')) then (
<b>{data($n)}</b>
)
else if(contains($attr, 'text-decoration: underline')) then (
<u>{data($n)}</u>
)
else if (contains($attr, 'font-style:italic')) then (
<em>{data($n)}</em>
)
else (
data($n)
)
)
)
};
declare function local:process_p_data($data as node()+)
{
for $d in $data
return (
if ($d instance of text()) then $d
else local:process_span_data($d)
)
};
let $doc := doc('myfile.html')
for $body in $doc/html/body
return
<body>
{
for $p in $body/p
return (
if (contains($p/@style, '-qt-paragraph-type:empty;')) then (
<p />
)
else (
if (count($p/*) = 0) then (
<p>{data($p)}</p>
)
else (
<p>
{for $data in $p/node()
return local:process_p_data($data)}
</p>
)
)
)
}</body>
Which gives ALMOST the correct result:
<body>
<p>plain text</p>
<p/>
<p>plain text <b>bold text</b>
</p>
<p/>
<p>plain text <em>italics text</em>
</p>
<p/>
<p>plain text <u>underline text</u>
</p>
<p/>
<p>plain text <b>bold underline text</b>
</p>
<p>plain text <b>bold text </b>
<b>bold underline text</b> <!-- NOT UNDERLINED!! -->
</p>
</body>
Can anyone point me in the right direction of achieving my desired output? Thanks in advance from an XQuery n00b!
your approach is correct but XQuery transformation logic is bit non-functional paradigm approach.
check out this.
xquery version '1.0-ml';
declare namespace mittai = "mittai";
declare function mittai:parse-thru($n as node())
{
for $z in $n/node()
return mittai:dispatch($z)
};
declare function mittai:dispatch($n as node())
{
typeswitch($n)
case text() return $n
case element(p) return element{ fn:node-name($n) } {mittai:parse-thru($n)}
case element(span) return element{ fn:node-name($n) } {mittai:parse-thru($n)}
case element(body) return element{ fn:node-name($n) } {mittai:parse-thru($n)}
default return element{ fn:node-name($n) } {$n/@*, mittai:parse-thru($n)}
};
let $d := doc('myfile.html')
return <html> {mittai:parse-thru($d)} </html>
This XQuery (using the common identity function):
declare variable $Prop as element()* :=
(<prop name="em">font-style:italic</prop>,
<prop name="strong">font-weight:600</prop>,
<prop name="u">text-decoration:underline</prop>);
declare function local:copy($element as element()) {
element {node-name($element)}
{$element/@*,
for $child in $element/node()
return if ($child instance of element())
then local:match($child)
else $child
}
};
declare function local:match($element as element()) {
if ($element/self::span[@style])
then local:replace($element)
else local:copy($element)
};
declare function local:replace($element as element()) {
let $prop := local:parse($element/@style)
let $no-match := $prop[not(.=$Prop)]
return element {node-name($element)}
{$element/@* except $element/@style,
if (exists($no-match))
then attribute style
{string-join($no-match,';')}
else (),
local:nested($Prop[.=$prop]/@name,$element)}
};
declare function local:parse($string as xs:string) {
for $property in tokenize($string,';')[.]
return
<prop>{
replace(normalize-space($property),'( )?:( )?',':')
}</prop>
};
declare function local:nested($names as xs:string*,
$element as element()) {
if (exists($names))
then element {$names[1]}
{local:nested($names[position()>1],$element)}
else for $child in $element/node()
return if ($child instance of element())
then local:match($child)
else $child
};
local:match(*)
Output:
<html>
<head> </head>
<body style=" font-family:'Lucida Grande'; font-size:14pt; font-weight:400; font-style:normal;">
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;"/>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text
<span>
<strong>bold text</strong>
</span>
</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-weight:600;"/>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text
<span>
<em>italics text</em>
</span>
</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-style:italic;"/>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text
<span>
<u>underline text</u>
</span>
</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;"/>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text
<span>
<strong>
<u>bold underline text</u>
</strong>
</span>
</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text
<span>
<strong>bold text </strong>
</span>
<span>
<strong>
<u>bold underline text</u>
</strong>
</span>
</p>
</body>
</html>
精彩评论