开发者

Removing styling from HTML

I have a database full of product descriptions that have been entered riddled with horrible computer generated HTML and littered with different styling information...style attributes, font tags, background attributes...

I have to re-design the website, but first I need to remove all the styling from the product descriptions. There are 100,000 products before someone suggests doing it manually. I am thinking some creative regex's in PHP might do the trick.

Ideally I would like to remove all HTML and just have plain text, but the descriptions contain tables and tables of tables... so that would just end in tears.

Looking forward to your creative solutions :)

EDIT-

On second thoughts I could also do it in VBA as I can export them to an excel sheet. So PHP or VBA solutions would be great.

EDIT-

    <div class="XXXX-template-06">
          <table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="694" id="AutoNumber1">
            <tbody><tr>
              <td width="516" height="18" bgcolor="#999966" align="center">
              <p align="center"><font face="Verdana" color="#FFFFFF"><b>Mont Blanc Scott Roof mounted cycle bike carrier<br>
              <br>
              Part Number: 728540</b></font></p></td>
              <td width="178" height="18" bgcolor="#999966" align="center">
              <a href="/shippingcalculator.html?SKU=728540" target="_blank"><img border="0" src="http://images.ZZZZpro.com/2145/" width="88" height="33"></a></td>
            </tr>
            <tr>
              <td width="694" height="57" bgcolor="#CCCC99" align="center" colspan="2">
              <b><font face="Verdana" size="2" class="CustomStyle-CycleCarrier">
    <script type="text/javascript">
    <!--function click() { if (event.button==2) { alert('All graphics, descriptions and other information, including the HTML code of this listing are the property of XXXX Limited and may not be reproduced in any form without the express permission of XXXX Limited. Email us: sales@XXXX.com'); } } document.onmousedown=click // -->
    <!---->
    <!---->
    <!---->
    <!---->
    <!---->
    <!---->
    <!---->
    <!---->
    <!---->
    <!---->
    <!---->
    <!----> -->
    </script>


    <div align="center">
      <center>
        <table height="336" background="http://images.ZZZZpro.com/2145/I/21/fade1.jpg" width="680" border="0">
          <tbody><tr>
            <td height="49" width="136"><p align="center"><img height="62" src="http://XXXXbiz.ipage.com/XXXX/Images/Mont%20Blanc/montblanc.jpg" width="165" border="0"></p></td>
            <td height="49" width="378"><p align="center"><font face="Verdana" color="#0000ff" size="5"><u><strong>Mont Blanc </strong></u></font><u><strong><font face="Verdana" color="#0000FF" size="5">Scott Roof Bar Rack 1 Cycle Carrier</font></strong></u></p></td>
            <td height="49" width="146"><img height="69" src="http://images.ZZZZpro.com/2145/I/20/logomed.gif" width="174" border="0"></td>
          </tr>
          <tr>
            <td height="241" colspan="3" width="672"><hr><p align="center"><img height="223" src="http://XXXXbiz.ipage.com/XXXX/Images/Mont%20Blanc/scottlrg.jpg" width="237" border="0"></p><p><font color="black"><b>Scott</b> </font></p><ul><li>Stylish, easy to use roof mounted cycle carrier, distinctive oval carrying bar.<br></li><li>Extra Soft Frame clamps hold cycle safely and gently<br></li><li>Extra wide wheel holders take the fattest tyres<br></li><li>Strong Webbing straps fasten wheels securely to carrier<br></li><li><font size="3" color="black">Upright, roof bar mounted, locking cycle carrier<br></font></li><li><font size="3" color="black">&nbsp;Locks to roof rails and locks bikes<br></font></li><li><font size="3" color="black">&nbsp;Quick and easy to use<br></font></li><li><font size="3" color="black">Adjustable for most cycle styles</font></li></ul><center><table cellspacing="0" width="100%" cellpadding="20" border="0" height="1" class="featuretable">
                  <tbody><tr>
                    <td height="55" class="featuretd" width="110"><p align="center"><a target="_blank" href="http://www.montblancuk.co.uk/support/inst/scott.pdf"><img width="20" alt="Open document" src="http://espimages.biz/2145/I/20/mount_link.gif" border="0" height="20"></a></p></td>
                    <td height="55" class="featuretd">To view Fitting Instructions in PDF format please click the spanner</td>
                  </tr>
                </tbody></table>
                <table height="317">
                  <tbody><tr class="technicaltr" valign="top">
                    <td height="1" class="technicalfirstcolumn"><font class="technicalheader">Technical data</font></td>
                    <td height="1" class="technicalsecondcolumn"><p><font class="heading1">Mont </font>Blanc Scott</p><p align="center"><img height="107" src="http://XXXXbiz.ipage.com/XXXX/Images/Mont%20Blanc/scottfaint.jpg" width="127" border="0"></p></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Max number of bikes</div></td>
                    <td height="21" class="technicalsecondcolumn"><div>1</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="18" class="technicalfirstcolumn"><div>Load capacity (kg)</div></td>
                    <td height="18" class="technica开发者_开发技巧lsecondcolumn"><div>15 KG</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Weight (kg)</div></td>
                    <td height="21" class="technicalsecondcolumn"><div>2.2KG</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Fits frame-dimensions (mm)</div></td>
                    <td height="21" class="technicalsecondcolumn">Up to 80mm</td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Fits wheel-dimensions</div></td>
                    <td height="21" class="technicalsecondcolumn"><div>All</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Locks bikes to carrier</div></td>
                    <td height="21" class="technicalsecondcolumn"><div>Yes</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Locks carrier to car</div></td>
                    <td height="21" class="technicalsecondcolumn"><div>Yes</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Tilt function, with bikes</div></td>
                    <td height="21" class="technicalsecondcolumn"><div>NA</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>TÜV/EuroBE approved</div></td>
                    <td height="21" class="technicalsecondcolumn"><div>NA</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Fullfills City Crash norms</div></td>
                    <td height="21" class="technicalsecondcolumn"><div>NA</div></td>
                  </tr>
                  <tr class="technicaltr" valign="top">
                    <td height="21" class="technicalfirstcolumn"><div>Miscellaneous</div></td>
                    <td height="21" class="technicalsecondcolumn"><div><p>Fits all types of Roof Bars,</p></div></td>
                  </tr>
                </tbody></table>
                <p align="center">
                  <font size="2" face="Verdana">The cycle carrier is 
                  guaranteed for Five year from date of purchase.                  
<br>                  
<br>We stock a wide range of towbars and towing accessories.                   
<a href="mailto:sales@XXXX.com?subject=Witter ZX88 Cycle Carrier"><br>Click 
                  here to email us</a> if you require details of our other 
                  towing equipment.</font>
                </p>


<hr>                
              </center>

            </td>

          </tr>
        </tbody></table>
      </center>

    </div>

  <br>
              Please note that with the Type of cycle carrier where you mount it
              <br>
              onto a flange ball you may need the long reach ball which will <br>
              allow you enough clearance from the bumper</font></b></td>
            </tr>
            <tr>
              <td width="694" height="57" bgcolor="#CCCC99" align="center" colspan="2">
              <a href="http://www.XXXXeuro.ZZZZprostorefront.co.uk/products/728540-mont-blanc-scott-roof-mounted-cycle-bike-carrier-728540.html" target="_blank"><img border="0" src="http://images.ZZZZpro.com/2145/" width="55" height="40"></a>
              <b><font face="Verdana" size="2">Not from the UK ? Click the flag
              to purchase this item from our EU site </font></b><a href="http://www.XXXXeuro.ZZZZprostorefront.co.uk/products/728540-mont-blanc-scott-roof-mounted-cycle-bike-carrier-728540.html" target="_blank"><img border="0" src="http://images.ZZZZpro.com/2145/" width="57" height="40"></a></td>
            </tr>
          </tbody></table>
</div>

EDIT-

Looking through it I think I need to get rid of the following:

Atrributes: style bgcolor background

Tags: font


I would recommend using XSLT to strip off all unwanted content. A simple identity template would be a good starting point.


What about php's strip_tags function?

The annoying part is you'll have to pass every tag you want to preserve in an array, but you only have to write it once.

For removing the tag attributes, bgcolor, etc. Somebody made this function here which could be worth a look, but mind the dodgy double-quotes on that page. There's a link at the bottom to download the code without wordpress formatting.


Thanks to @Paul's idea here is an example in Excel. This is very rough and also needs to be modified depending on how you are storing your HTML in Excel; but hopefully it will get you started.

This example supposes a few things:

  1. You have first installed the TidyATL COM object (click the link that says 'wrapper'; you can register it on 64-bit Win 7 by first copying the DLL into C:\Windows\SysWOW64 and running regsvr32 C:\Windows\SysWOW64\TidyATL.dll).

  2. Your Excel Project has references to Microsoft XML 6.0, and Tidy 1.0 Type Library

  3. Your HTML is stored in Cell A1 of Sheet 1. Results are put into Cell B1. You can easily extend this idea to iterate through all used cells in a column and process all the HTML at once.

  4. I have zero experience writing XSLT. I ripped the 'identity template' directly from here. I had never used XSLT before today; so maybe someone who knows it can edit the XSLT to strip out the <font> nodes. This example just strips out all of the attributes.

This uses Tidy HTML to convert your ugly HTML into XHTML, then applies an XSLT template to the result.

EDIT: sorry, screwed up the "match" attribute in the XSLT. Was: match='@*|node()' should be: match='node()'

Here's the code I used:

Sub TidyUp()

    Dim t As TidyATL.TidyDocument

    Dim sXSLT

    sXSLT = "<?xml version='1.0' encoding='ISO-8859-1'?>" & _
        "<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>" & _
        "<xsl:template match='node()'>" & _
        "  <xsl:copy>" & _
        "    <xsl:apply-templates select='node()'/>" & _
        "  </xsl:copy>" & _
        "</xsl:template>" & _
        "</xsl:stylesheet>"

    Set t = New TidyATL.TidyDocument
    t.ParseString Sheet1.Range("A1").Value
    t.SetOptBool TidyXmlOut, True
    t.SetOptBool TidyXhtmlOut, True
    t.SetOptBool TidyNumEntities, True
    t.SetOptBool TidyXmlDecl, True

    t.CleanAndRepair


    Dim x As MSXML2.DOMDocument
    Dim x2 As MSXML2.FreeThreadedDOMDocument
    Dim xe As MSXML2.IXMLDOMParseError
    Set x = New MSXML2.DOMDocument
    Set x2 = New MSXML2.FreeThreadedDOMDocument

    'Load XHTML into a DOM
    x.LoadXML t.SaveString
    Set xe = x.parseError
    If xe.ErrorCode <> 0 Then
        MsgBox "Err: " & xe.reason
        End
    End If

    'Load XSLT into a DOM
    x2.LoadXML sXSLT
    Set xe = x2.parseError
    If xe.ErrorCode <> 0 Then
        MsgBox "Err: " & xe.reason
        End
    End If


    Dim xt As XSLTemplate
    Set xt = New XSLTemplate
    Set xt.stylesheet = x2

    Dim xp As IXSLProcessor

    Set xp = xt.createProcessor
    xp.input = x
    xp.transform

    Sheet1.Range("B1").Value = xp.output
End Sub

Here's the result (still ugly but with no attributes):

<?xml version="1.0" encoding="UTF-16"?><html xmlns="http://www.w3.org/1999/xhtml"><head><meta></meta><title></title></head><body><div><table><tbody><tr><td><p><font><b>Mont
Blanc Scott Roof mounted cycle bike carrier<br></br><br></br>
 Part Number: 728540</b></font></p></td><td><a><img></img></a></td></tr><tr><td><b><font><script>
//
    &lt;!--function click() { if (event.button==2) { alert('All graphics, descriptions and other information, including the HTML code of this listing are the property of XXXX Limited and may not be reproduced in any form without the express permission of XXXX Limited. Email us: sales@XXXX.com'); } } document.onmousedown=click // --&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt;
    &lt;!----&gt; --&gt;
//</script></font></b><div><center><table><tbody><tr><td><p><img></img></p></td><td><p><font><u><strong>Mont Blanc</strong></u></font><u><strong><font>Scott Roof
Bar Rack 1 Cycle Carrier</font></strong></u></p></td><td><img></img></td></tr><tr><td><hr></hr><p><img></img></p><p><font><b>Scott</b></font></p><ul><li>Stylish, easy to use roof mounted cycle carrier, distinctive
oval carrying bar.<br></br></li><li>Extra Soft Frame clamps hold cycle safely and gently<br></br></li><li>Extra wide wheel holders take the fattest tyres<br></br></li><li>Strong Webbing straps fasten wheels securely to
carrier<br></br></li><li><font>Upright, roof bar mounted, locking
cycle carrier<br></br></font></li><li><font> Locks to roof rails and
locks bikes<br></br></font></li><li><font> Quick and easy to
use<br></br></font></li><li><font>Adjustable for most cycle
styles</font></li></ul><center><table><tbody><tr><td><p><a><img></img></a></p></td><td>To view Fitting Instructions in
PDF format please click the spanner</td></tr></tbody></table><table><tbody><tr><td><font>Technical data</font></td><td><p><font>Mont</font> Blanc Scott</p><p><img></img></p></td></tr><tr><td><div>Max number of bikes</div></td><td><div>1</div></td></tr><tr><td><div>Load capacity (kg)</div></td><td><div>15 KG</div></td></tr><tr><td><div>Weight (kg)</div></td><td><div>2.2KG</div></td></tr><tr><td><div>Fits frame-dimensions (mm)</div></td><td>Up to 80mm</td></tr><tr><td><div>Fits wheel-dimensions</div></td><td><div>All</div></td></tr><tr><td><div>Locks bikes to carrier</div></td><td><div>Yes</div></td></tr><tr><td><div>Locks carrier to car</div></td><td><div>Yes</div></td></tr><tr><td><div>Tilt function, with bikes</div></td><td><div>NA</div></td></tr><tr><td><div>TÜV/EuroBE approved</div></td><td><div>NA</div></td></tr><tr><td><div>Fullfills City Crash norms</div></td><td><div>NA</div></td></tr><tr><td><div>Miscellaneous</div></td><td><div><p>Fits all types of Roof Bars,</p></div></td></tr></tbody></table><p><f
ont>The cycle carrier
is guaranteed for Five year from date of purchase.<br></br><br></br>
 We stock a wide range of towbars and towing accessories.
<a><br></br>
Click here to email us</a> if you require details of our other
towing equipment.</font></p><hr></hr></center></td></tr></tbody></table></center></div><b><br></br>
 Please note that with the Type of cycle carrier where you mount
it<br></br>
 onto a flange ball you may need the long reach ball which
will<br></br>
 allow you enough clearance from the bumper</b></td></tr><tr><td><a><img></img></a><b><font>Not from the UK ? Click
   the flag to purchase this item from our EU site</font></b><a><img></img></a></td></tr></tbody></table></div></body></html>

EDIT: This XSLT seems to do the trick; it removes some tags with their content, and some tags without their content, whichever you specify. Again maybe someone with some XSLT knowledge can elaborate.

<?xml version='1.0' encoding='ISO-8859-1'?>
<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:xhtml="http://www.w3.org/1999/xhtml" >

<xsl:template match='node()|@*'>
  <xsl:copy>
    <xsl:apply-templates select='node()'/>
  </xsl:copy>
</xsl:template>

<!--these tags will be removed with their content-->
<xsl:template match='xhtml:script|xhtml:head'/>

<!--these tags will be removed but keep their content-->
<xsl:template match='xhtml:font|xhtml:p|xhtml:b|xhtml:u|xhtml:i|xhtml:center|xhtml:a|xhtml:img|xhtml:strong'><xsl:apply-templates/></xsl:template>
</xsl:stylesheet>

Result:

<?xml version="1.0" encoding="UTF-16"?><html xmlns="http://www.w3.org/1999/xhtml"><body><div><table><tbody><tr><td>Mont
Blanc Scott Roof mounted cycle bike carrier<br></br><br></br>
 Part Number: 728540</td><td></td></tr><tr><td><div><table><tbody><tr><td></td><td>Mont BlancScott Roof
Bar Rack 1 Cycle Carrier</td><td></td></tr><tr><td><hr></hr>Scott<ul><li>Stylish, easy to use roof mounted cycle carrier, distinctive
oval carrying bar.<br></br></li><li>Extra Soft Frame clamps hold cycle safely and gently<br></br></li><li>Extra wide wheel holders take the fattest tyres<br></br></li><li>Strong Webbing straps fasten wheels securely to
carrier<br></br></li><li>Upright, roof bar mounted, locking
cycle carrier<br></br></li><li> Locks to roof rails and
locks bikes<br></br></li><li> Quick and easy to
use<br></br></li><li>Adjustable for most cycle
styles</li></ul><table><tbody><tr><td></td><td>To view Fitting Instructions in
PDF format please click the spanner</td></tr></tbody></table><table><tbody><tr><td>Technical data</td><td>Mont Blanc Scott</td></tr><tr><td><div>Max number of bikes</div></td><td><div>1</div></td></tr><tr><td><div>Load capacity (kg)</div></td><td><div>15 KG</div></td></tr><tr><td><div>Weight (kg)</div></td><td><div>2.2KG</div></td></tr><tr><td><div>Fits frame-dimensions (mm)</div></td><td>Up to 80mm</td></tr><tr><td><div>Fits wheel-dimensions</div></td><td><div>All</div></td></tr><tr><td><div>Locks bikes to carrier</div></td><td><div>Yes</div></td></tr><tr><td><div>Locks carrier to car</div></td><td><div>Yes</div></td></tr><tr><td><div>Tilt function, with bikes</div></td><td><div>NA</div></td></tr><tr><td><div>TÜV/EuroBE approved</div></td><td><div>NA</div></td></tr><tr><td><div>Fullfills City Crash norms</div></td><td><div>NA</div></td></tr><tr><td><div>Miscellaneous</div></td><td><div>Fits all types of Roof Bars,</div></td></tr></tbody></table>The cycle carrier
is guaranteed for Five year from date of purchase.<br></br><br></br>
 We stock a wide range of towbars and towing accessories.
<br></br>
Click here to email us if you require details of our other
towing equipment.<hr></hr></td></tr></tbody></table></div><br></br>
 Please note that with the Type of cycle carrier where you mount
it<br></br>
 onto a flange ball you may need the long reach ball which
will<br></br>
 allow you enough clearance from the bumper</td></tr><tr><td>Not from the UK ? Click
   the flag to purchase this item from our EU site</td></tr></tbody></table></div></body></html>


This regex should give you expected results, but I haven't tested it:

preg_replace('/(<.*)(style=\".*\")(.*>)/', '{$1}{$3}', $yourhtml);


I think that the regex needed could be much simpler than you are imagining, but then again, I don't know what the product descriptions are like. What are the chances of encountering < and > in the descriptions, aside from as part of HTML tags? If the chances are very small, could something like this not do the trick?

$new_description = preg_replace('/<([\w_ '"])+>/', '', $description);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜