Compress(minimize) HTML from python
How is to compress (minimize) HTML from python; I know I can use some regex to strip spaces and other things, but I want a real compiler using pure python(so it can be used on Google App Engine).
I did a test on a online html compressor and it saved 65% of the html size. I want that, but from pytho开发者_StackOverflown.
You can use htmlmin to minify your html:
import htmlmin
html = """
<!DOCTYPE html>
<html lang="en">
<head>
<title>Bootstrap Case</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
</head>
<body>
<div class="container">
<h2>Well</h2>
<div class="well">Basic Well</div>
</div>
</body>
</html>
"""
minified = htmlmin.minify(html.decode("utf-8"), remove_empty_space=True)
print(minified)
htmlmin and html_slimmer are some simple html minifying tools for python. I have millions of html pages stored in my database and running htmlmin, I am able to reduce the page size between 5 and 50%. Neither of them do an optimal job at complete html minification (i.e. the font color #00000 can be reduced to #000), but it's a good start. I have a try/except block that runs htmlmin and then if that fails, html_slimmer because htmlmin seems to provide better compression, but it does not support non ascii characters.
Example Code:
import htmlmin
from slimmer import html_slimmer # or xhtml_slimmer, css_slimmer
try:
html=htmlmin.minify(html, remove_comments=True, remove_empty_space=True)
except:
html=html_slimmer( html.strip().replace('\n',' ').replace('\t',' ').replace('\r',' ') )
Good Luck!
I suppose that in GAE there is no really need for minify your html as GAE already gzip it Caching & GZip on GAE (Community Wiki)
I did not test but minified version of html will probably win only 1% of size as it only remove space once both version are compressed.
If you want to save storage, for example by memcached it, you have more interest to gzip it (even at low level of compression) than removing space as in python it will be probably smaller and faster as processed in C instead of pure python
import htmlmin
code='''<body>
Hello World
<div style='color:red;'>Hi</div>
</body>
'''
htmlmin.minify(code)
Last line output
<body> Hello World <div style=color:red;>Hi</div> </body>
You can use this code to delete spaces
htmlmin.minify(code,remove_empty_space=True)
I wrote a build script that duplicates my templates into another directory and then I use this trick to tell my application to select the correct template in development mode, or in production:
DEV = os.environ['SERVER_SOFTWARE'].startswith('Development') and not PRODUCTION_MODE
TEMPLATE_DIR = 'templates/2012/head/' if DEV else 'templates/2012/output/'
Whether it is gzipped by your webserver is not really the point, you should save every byte that you can for performance reasons.
If you look at some of the biggest sites out there, they often do things like writing invalid html to save bytes, for example, it is common to omit double quotes in id attributes in html tags, for example:
<!-- Invalid HTML -->
<div id=mydiv> ... </div>
<!-- Valid HTML -->
<div id="mydiv"> ... </div>
And there are several examples like this one, but that's beside the scope of the thread I guess.
Back to the question, I put together a little build script that minifies your HTML, CSS and JS. Caveat: It doesn't cover the case of the PRE tag.
import os
import re
import sys
from subprocess import call
HEAD_DIR = 'templates/2012/head/'
OUT_DIR = 'templates/2012/output/'
REMOVE_WS = re.compile(r"\s{2,}").sub
YUI_COMPRESSOR = 'java -jar tools/yuicompressor-2.4.7.jar '
CLOSURE_COMPILER = 'java -jar tools/compiler.jar --compilation_level ADVANCED_OPTIMIZATIONS '
def ensure_dir(f):
d = os.path.dirname(f)
if not os.path.exists(d):
os.makedirs(d)
def getTarget(fn):
return fn.replace(HEAD_DIR, OUT_DIR)
def processHtml(fn, tg):
f = open(fn, 'r')
content = f.read()
content = REMOVE_WS(" ", content)
ensure_dir(tg)
d = open(tg, 'w+')
d.write(content)
content
def processCSS(fn, tg):
cmd = YUI_COMPRESSOR + fn + ' -o ' + tg
call(cmd, shell=True)
return
def processJS(fn, tg):
cmd = CLOSURE_COMPILER + fn + ' --js_output_file ' + tg
call(cmd, shell=True)
return
# Script starts here.
ensure_dir(OUT_DIR)
for root, dirs, files in os.walk(os.getcwd()):
for dir in dirs:
print "Processing", os.path.join(root, dir)
for file in files:
fn = os.path.join(root) + '/' + file
if fn.find(OUT_DIR) > 0:
continue
tg = getTarget(fn)
if file.endswith('.html'):
processHtml(fn, tg)
if file.endswith('.css'):
processCSS(fn, tg)
if file.endswith('.js'):
processJS(fn, tg)
精彩评论