How to separate possible URI from other content in PHP?
What is the simplest and fastest way to check if string is single URL or TEXT (that might contain urls)
possible scenarios:
// successful scenario
$example[] = 'http://sub-domain.my-domain.com/folder/file.php?some=param';
// successful scenario
$example[] = '/assets/scripts/jquery.min.js?v=1.4';
// successful scenario
$example[] = 'jquery.min.js';
// this scenario should fail validation
$example[] = "http://www.domain.com welcome text\n and some other http://www.domain.com";
// this scenario should fail validation
$example[] = "scriptVar=50;";
I have tried to use native php functions like parse_url, filter_var but non of them work as expected.
UPDATE 1
To make it more clear, I'm trying to separate possible URI from script content that would be inserted as DOM element. All urls would go as SRC attribute and rest as content, example:
<script type="text/javascript" src="{$string}"></script>
<script type="text/javascript">{$string}</script>
UPDATE 2 By analysing possibl开发者_如何转开发e content I come to conclusion that string containing white space character or semicolon mean that string could not be URI, I presume that this pattern could solve my problem:
preg_match('/[\s]|[;]/', $string);
would it cover all possible javascript/css code?
$exampleData = Array(
'http://sub-domain.my-domain.com/folder/file.php?some=param',
'/assets/scripts/jquery.min.js?v=1.4',
'<a href="/assets/scripts/jquery.min.js?v=1.4">',
'<a href="assets/scripts/jquery.min.js?v=1.4">',
'http://www.domain.com welcome text\n and some other http://www.domain.com',
);
foreach($exampleData as $example)
{
echo "Trying \"" . $example . "\" -> ";
echo (preg_match('%((http(s)?://|www\.)[^ \r\n]+|<a.+?href=(\'|")(http(s)?://|www\.|[^#])[^\4\r\n]*?\4.*?>)%i', $example)) ?
"Match" : "No match";
echo "\r\n";
}
This would produce:
Trying "http://sub-domain.my-domain.com/folder/file.php?some=param" -> Match
Trying "/assets/scripts/jquery.min.js?v=1.4" -> No match
Trying "<a href="/assets/scripts/jquery.min.js?v=1.4">" -> Match
Trying "<a href="assets/scripts/jquery.min.js?v=1.4">" -> Match
Trying "http://www.domain.com welcome text\n and some other http://www.domain.com" -> Match
Update:
After reading your last update. If you want to parse HTML. Use a DOM-parser like:
http://simplehtmldom.sourceforge.net/
Example:
include_once('simple_html_dom.php');
$dom = file_get_html('http://www.stackoverflow.com/');
foreach($dom->find('script') as $scriptElement)
{
if(strlen(trim($scriptElement->src)) > 0)
{
// Script with URI set
echo "<strong>Found script with URI</strong>";
echo "<p>" . $scriptElement->src . "</p>";
}
else
{
// Script with content
echo "<strong>Found script with content</strong>";
echo("<p>" . nl2br(htmlspecialchars($scriptElement->innertext)) . "</p>");
}
}
Would output something like(HTML stripped):
Found script with URI
http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js
Found script with URI
http://sstatic.net/js/master.min.js?v=afc76d4deac3
Found script with content
var imagePath='http://sstatic.net/stackoverflow/img/';
var inboxUnviewedCount = -1;
...etc
This function will return true if the passed text is an URL. It is based on a regex seen here on SO.
function validate_url ($url)
{
$regex = '/^(https?|ftp):\/\/'; //protocol
$regex .= '(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'; //username
$regex .= '(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'; //password
$regex .= '@)?'; //auth requires @
$regex .= '((([a-z0-9][a-z0-9-]*[a-z0-9]\.)*'; //domain segments AND
$regex .= '[a-z][a-z0-9-]*[a-z0-9]'; //top level domain OR
$regex .= '|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}';
$regex .= '(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'; //IP address
$regex .= ')(:\d+)?'; //port
$regex .= ')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'; //path
$regex .= '(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'; //query string
$regex .= '?)?)?'; //path and query string optional
$regex .= '(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'; //fragment
$regex .= '$/i';
return (preg_match($regex, $url) ? true : false);
}
You can try it here: http://www.exorithm.com/algorithm/view/validate_url
EDIT in response to comment, this function will validate URL fragments like /index.php or index.php
function validate_url_fragment ($url)
{
$regex = '/^(((\/?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'; //path
$regex .= '(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'; //query string
$regex .= '?)?)?'; //path and query string optional
$regex .= '(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'; //fragment
$regex .= '$/i';
return (preg_match($regex, $url) ? true : false);
}
if (validate_url_fragment($url) || validate_url($url)) {
//is url
} else {
//not url
}
(note that the empty string is valid, so you may want a special case for that)
filter_var
should do what you want for a single URL:
<?php
$safe_url = filter_var( $unsafe_url, FILTER_SANITIZE_URL );
?>
精彩评论