开发者

Minimize html, doubts and questions

Minimizing html is the only section on Google's Page Speed where there is still room for improvement.

My site is all dynamic and the HTML is already Deflated so there is no reason to put any more pressure on the server (I d开发者_高级运维on't want to minimize pages real time before sending).

What I could do was to minimize the template files. My templates files are a mix of PHP and HTML so I've come up with some code that I think is pretty safe but would like to be community revised.

// this will loop trough all template files
// php is cleaned first so that line-comments will not interfere with the regex
$original = file_get_contents($dir.'/'.$file);
$php_clean = php_strip_whitespace($dir.'/'.$file);
$minimized = preg_replace('/\s+/', ' ', $php_clean);

This will make my template files as a single very long file alternated with some places where DB content is inserted. Google's homepage source looks more or less like what I get so I wonder if they follow a similar approach.

Question 1: Do you antecipate potencial problems?

Question 2: Is there anyway better (more efficient to do this)?

And please remember that I'm not trying to validate HTML as the templates are not valid HTML (header and footer are includes, for example).

Edit: Do take into consideration that the template files will be minimized on deploy. As CSS and Javascript files are minimized and compressed using YUI Compressure and Closure, the template files would be minimized like-wise, on deploy. Not on client-request.

Thank you.


Google's own Closure Templates (Soy) strips whitespace at the end of the line by default, and the template designer explicitly inserts a space using {sp}. This probably isn't a good enough reason to switch away from PHP, but I just wanted to bring it to your attention.

In addition, realize that HTML 4 allows you to exclude some tags, as recommended by the Page Speed documentation on minifying HTML (http://code.google.com/p/page-speed/wiki/MinifyHtml). You can exclude </p>, </td>, </tr>, etc. For a complete list of elements for which you can omit the end tag, search for "- O" in the HTML 4 DTD (http://www.w3.org/TR/REC-html40/sgml/dtd.html). You can even omit the <html>, <head>, <body>, and <tbody> tags entirely, as both start and end tags are optional ("O O" in the DTD).

You can also omit the quotes around attributes (http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2) such as id, class (with a single class name), and type that have simple content (i.e., matches /^[-A-Za-z0-9._:]+$/). For attributes that have a single possible value, you can exclude the value (e.g., say simply checked rather than checked=checked).

Some people may find these tips repulsive because we've been conditioned for so many years to prepare for the upcoming world of simple LALR parsers for XHTML. Thus, tools like Dave Raggett's HTML Tidy generate HTML with proper closing tags and quotes around attribute values. But let's face it, all the browsers already have parsers that understand HTML 4, any new browser will use the HTML 5 parser rather than XHTML, and we should get comfortable writing HTML that is optimized for size.

That being said, besides a couple large companies like Google and Facebook, my guess is that page size is a negligible component of latency, so if you're optimizing your own site it's probably because of your own obsessive tendencies rather than performance.


White space can be significant (e.g. in pre elements).

When I had a particularly large page (i.e. large enough that there was a benefit in minifying the HTML) I used HTML Tidy and cached the results.

tidy -c -n -omit -ashtml -utf8 --doctype strict \
    --drop-proprietary-attributes yes --output-bom no \
    --wrap 0


I think you'll end up running into issues with load time with this approach, as the get contents, strip whitespace, and preg replace calls are going to take a lot longer to do than whatever bandwidth the minified HTML is saving you.


I've been running tests on all my sites for a couple of weeks and I can say that this method is pretty consistent. It will only affect template content, so there is little risk of messing up with unknown <pre> or similar.

It is run before deploy so there is no impact on server - actually there should be a little speed up as the file becomes smaller.

Do remember that all content that comes from the database will not suffer any influence as, like said before, this runs before deploy and on template files only.

The method seams solid enough to pass it into production.

If anything goes wrong I'll post it here.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜