Best practice for HTML escaping user-supplied data with PHP (and ZF)
Note: I'm using Zend Framework, but I think most of this applies to PHP coding in general.
I'm trying to choose a strategy for writing views scripts, possibly with the help of a templating engine. Motivations: clarity and security. I'm just not happy with writing .phtml scripts. This syntax is awfully verbose to do the most often needed thing - outputting a variable:
<?php echo $this->escape($this->myVariable); ?>
In addition to the code being lengthy, IMHO the template author shouldn't have to remember (and bother) writing an escape call each time he/she wants to output a variable. Forgetting the call will almost definitely result in an XSS vulnerability.
I have two possible solutions for this problem:
Solution 1: A template engine with automatic escaping
I think at least Smarty has an option for automatically escaping html entities when outputting variables. There are points against Smarty, but maybe at least some of them are addressed in the upcoming 3.0 - I haven't checked yet.
XML based template engines like PHPTAL will also escape any data by default. They might look quite odd for a beginner, though. Maybe still worth trying?
Solution 2: Escape the data in the Model
Of course, the other option would be to escape the needed data already in the Model (or even the controller?). The Model should already know the content-type (mainly plain text or HTML text) of each field, so it would be kind of logical to escape the data there. The view could consider all data as safe HTML. This would allow eg. 开发者_运维技巧changing the datatype of a field from plain text to HTML without touching the view script - only by changing the Model.
But then again, it doesn't feel like good MVC practice. In addition, there are problems with this approach as well:
- sometimes the view only wants to print the first n characters, and we don't want to end up truncating the data
foo & bar
asfoo &am
(having first escaped it asfoo & bar
) - maybe the view wants to construct an URL with varName=$varName in the querystring - again, escaping already in the Model would be bad.
(These problems could be addressed by providing two versions of the data, or unescaping in the template. Seems bad to me.)
Ideas? Am I missing something? What do you consider "the best practice"?
PS. This post is about finding a general solution for any user-supplied plain-text data that may contain <
or >
or any other characters. So, filtering data before saving it to the database isn't the solution.
Update:
Thanks for all comments so far. I did some more research and will next evaluate Twig and possibly Open Power Template. Both seem interesting: Twig looks very straightforward, but the project is young. On the XML side, OPT's syntax looks a bit nicer than PHPTAL's. Both Twig and OPT are quite well documented.
Filter as soon as possible. You should ensure that all text input is proper UTF-8, to make your text manipulation functions work predictably.
But don't try to filter out "dangerous" characters or fragments! That doesn't work. Only fix or reject incorrect data on input. There's nothing incorrect in
<
or'
characters.Escape as late as possible. Add SQL escaping in your SQL query function (or better – use prepared statements). HTML-escape in your HTML templates. Quoted-Printable-escape in your e-mail generation functions, shell-escape when running CLI commands, etc.
Don't let escaped data spread all over your application, because the longer escaped data lives, the bigger chance you'll mix it up with unescaped data or break escaping during processing.
But then again, it doesn't feel like good MVC practice.
Totally agree, the Model's the wrong place for such presentation concerns and storing both an HTML and a raw version of every variable would make it easy for them to get out of sync. Forget solution 2.
That leaves you with alternative templating engines, or sticking with PHP and learning to bear the load of calling htmlspecialchars all the time. I'm open to the idea of alternative templating entries, but the ones I've tried so far I haven't really been happy with.
(Many discard PHP syntax and implement their own limited expression languages, which means you lose the advantage of the language you already know and are stuck with a noddy-language which makes more complex presentation logic impossible, so you end up doing it yourself in PHP with strings full of HTML, which is absolutely not a win.)
So for the moment I'd suggest a Solution 0a to add to the pile: define a global function with a short name to take the pain out of HTML-escaping:
<?php
function h($s) {
echo(htmlspecialchars($s, ENT_QUOTES));
}
?>
...
My lovely variable is <?php h($this->myVariable); ?>.
I've no idea why PHP doesn't define a shortcut for this, which is as you say by far the most common use case. Now they've dumped the short-tags for XML-PI-style tags, why isn't there one with another name to do the right thing, like say <?phph
?
This isn't a total solution, but one extremely helpful thing in this sort of situation is hungarian-style notation. Hungarian notation used all the time is just annoying, to me, but this is the kind of place where that sort of metadata in the variable name is very valuable. A good practice is to name your variables with a prefix that says what to expect from it...i.e. $rawUserInput, $escapedUserInput, etc.
This doesn't totally solve the problem, but it's a good coding practice. Then when you see a snippet of code that says
'SELECT * from table where username = ' + $rawUserName
it's immediately obvious that there's an injection vulnerability, because you know the raw prefix means you haven't escaped it.
There is a dozen ways of doing this. Here is a few:
- You could write your custom View class, as described in the Zend Framework Manual, and escape any variables when they are assigned to or requested from the View.
- In case of Datasets, you could wrap them into a custom ArrayIterator that does output escaping when fetching items from it, along with any other stuff you want to automate on output.
- Or you could use the View Script approach.
- Or, if you dont want to have your template authors write any PHP or template syntax whatsoever, you could ask them them to write just the structured HTML and then insert the values through the DomDocument extension.
As for PHP in a template being verbose, well.. it might not offer the shortest notation, but then again, it does provide a notation and it comes with no overhead. Even for non-PHP template authors it should be easy to learn a few method calls in PHP than a (often weird) template language that basically reinvents a subset of what PHP can do out of the box.
You could also use the Alternative PHP Syntax and NowDoc or HereDoc in your templates to get rid of <?php
and echo
calls, so you could end up with something like
<?php
// get some partial block done first
foreach($this->books as $book):
$loopdata = << LOOPDATA
<li> {$book->title} - {$book->author} - {$book->publisher}</li>
LOOPDATA;
endforeach;
// render entire template
echo << HTML
<h1>{$this->title}</h1>
<ul>{$loopdata}</ul>
HTML;
Personally, I dont find this too appealing, but as you can see, there is many ways to write your templates with PHP. Just pick one.
精彩评论