Compress(minimize) HTML from python

2023-02-22 21:33 问答作者：

How is to compress (minimize) HTML from python; I know I can use some regex to strip spaces and other things, but I want a real compiler using pure python(so it can be used on Google App Engine).

I did a test on a online html compressor and it saved 65% of the html size. I want that, but from pytho开发者_StackOverflown.

You can use htmlmin to minify your html:

import htmlmin

html = """
<!DOCTYPE html>
<html lang="en">
<head>
  <title>Bootstrap Case</title>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
</head>
<body> 
<div class="container">
  <h2>Well</h2>
  <div class="well">Basic Well</div>
</div>
</body>
</html>
"""

minified = htmlmin.minify(html.decode("utf-8"), remove_empty_space=True)
print(minified)

htmlmin and html_slimmer are some simple html minifying tools for python. I have millions of html pages stored in my database and running htmlmin, I am able to reduce the page size between 5 and 50%. Neither of them do an optimal job at complete html minification (i.e. the font color #00000 can be reduced to #000), but it's a good start. I have a try/except block that runs htmlmin and then if that fails, html_slimmer because htmlmin seems to provide better compression, but it does not support non ascii characters.

Example Code:

import htmlmin
from slimmer import html_slimmer # or xhtml_slimmer, css_slimmer
try:
    html=htmlmin.minify(html, remove_comments=True, remove_empty_space=True)
except:
    html=html_slimmer( html.strip().replace('\n',' ').replace('\t',' ').replace('\r',' ')  )

Good Luck!

I suppose that in GAE there is no really need for minify your html as GAE already gzip it Caching & GZip on GAE (Community Wiki)

I did not test but minified version of html will probably win only 1% of size as it only remove space once both version are compressed.

If you want to save storage, for example by memcached it, you have more interest to gzip it (even at low level of compression) than removing space as in python it will be probably smaller and faster as processed in C instead of pure python

import htmlmin

code='''<body>
    Hello World
    <div style='color:red;'>Hi</div>
    </body>
'''

htmlmin.minify(code)

Last line output

<body> Hello World <div style=color:red;>Hi</div> </body>

You can use this code to delete spaces

htmlmin.minify(code,remove_empty_space=True)

I wrote a build script that duplicates my templates into another directory and then I use this trick to tell my application to select the correct template in development mode, or in production:

DEV = os.environ['SERVER_SOFTWARE'].startswith('Development') and not PRODUCTION_MODE

TEMPLATE_DIR = 'templates/2012/head/' if DEV else 'templates/2012/output/'

Whether it is gzipped by your webserver is not really the point, you should save every byte that you can for performance reasons.

If you look at some of the biggest sites out there, they often do things like writing invalid html to save bytes, for example, it is common to omit double quotes in id attributes in html tags, for example:

<!-- Invalid HTML -->
<div id=mydiv> ... </div>


<!-- Valid HTML -->
<div id="mydiv"> ... </div>

And there are several examples like this one, but that's beside the scope of the thread I guess.

Back to the question, I put together a little build script that minifies your HTML, CSS and JS. Caveat: It doesn't cover the case of the PRE tag.

import os
import re
import sys

from subprocess import call

HEAD_DIR = 'templates/2012/head/'

OUT_DIR = 'templates/2012/output/'

REMOVE_WS = re.compile(r"\s{2,}").sub

YUI_COMPRESSOR = 'java -jar tools/yuicompressor-2.4.7.jar '

CLOSURE_COMPILER = 'java -jar tools/compiler.jar  --compilation_level ADVANCED_OPTIMIZATIONS '
   
def ensure_dir(f):
    d = os.path.dirname(f)
    if not os.path.exists(d):
        os.makedirs(d)
        
def getTarget(fn):
  return fn.replace(HEAD_DIR, OUT_DIR)
  
def processHtml(fn, tg):
  f = open(fn, 'r')
  content = f.read()
  content = REMOVE_WS(" ", content)
  ensure_dir(tg)
  d = open(tg, 'w+')
  d.write(content)
  content

def processCSS(fn, tg):
  cmd = YUI_COMPRESSOR + fn + ' -o ' + tg
  call(cmd, shell=True)
  return

def processJS(fn, tg):
  cmd = CLOSURE_COMPILER + fn + ' --js_output_file ' + tg
  call(cmd, shell=True)
  return

# Script starts here.
ensure_dir(OUT_DIR)
for root, dirs, files in os.walk(os.getcwd()):
  for dir in dirs:
    print "Processing", os.path.join(root, dir)
  for file in files:
    fn = os.path.join(root) + '/' + file
    if fn.find(OUT_DIR) > 0:
      continue
    tg = getTarget(fn)
    if file.endswith('.html'):
      processHtml(fn, tg)
    if file.endswith('.css'):
      processCSS(fn, tg)
    if file.endswith('.js'):
      processJS(fn, tg)

继续阅读：google-app-engine

Compress(minimize) HTML from python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？