Extract HTML-like tags with PHP
I'm trying to extract OUTERMOST special HTML-like tags from a given string. Here's a sample string:
sample string with <::Class id="some id\" and more">text with possible other tags inside<::/Class> some more text
I need to find where in a string a <::Tag starts and where it ends. The problem is it might contain nested tags inside. Is there a simple loop-like algorithm to find the FIRST ocurrence of the <::Tag and the length of the string until the matching <::/Tag>? I've tried a different way, using a simple HTML tag instead and using DomDocument,开发者_Python百科 but it cannot tell me the position of the tag in a string. I cannot use external libraries, i'm just looking for pointers as to how this could be solved. Maybe you've seen an algorithm that does exactly that - i'd like to have a look at it.
Thanks for the help. P.S. regex solutions will not work since there are nested tags. Recursive regex solutions will not work as well. I'm just looking for a very simple parsing algorighm for this specific case.
What you're talking about here is making a template. Regex for parsing templates is very slow. Instead, your template-reading/processing engine should be doing a string parse. It's not super-easy, but it's also not terribly hard. Still, my advice is use another template library instead of reinventing the wheel.
There's an open-source template engine in PHPBB that you could utilize or learn from. Or, use something like Smarty. If performance is a major deal, have a look at Blitz.
strpos + strrpos
(ouch...)
$str = 'sample string with <::Class id="some id" and more">text with possible <::Strong>other<::/Strong> tags inside<::/Class> some more text';
$tag = '<::';
$first = strpos($str, $tag);
$last = strrpos($str, $tag);
$rtn = array();
$cnt = 0;
while ($first<$last)
{
if (!$cnt)
{
$rtn[] = substr($str, 0, $first);
}
++$cnt;
$next = strpos($str, $tag, $first+1);
if ($next)
{
$pos = strpos($str, '>', $first);
$rtn[] = substr($str, $first, $pos-$first+1);
$rtn[] = substr($str, $pos+1, $next-$pos-1);
$first = $next;
}
}
With the $rtn
, you can do whatever you want then ... this code is not perfect yet ...
array (
0 => 'sample string with ',
1 => '<::Class id="some id" and more">',
2 => 'text with possible ',
3 => '<::Strong>',
4 => 'other',
5 => '<::/Strong>',
6 => ' tags inside',
7 => '<::/Class> some more text',
)
So basically here's what i came up with. Something like ajreal's solution only not as clean ;] Not even sure if it works perfectly yet, initial testing was successful.
protected function findFirstControl()
{
$pos = strpos($this->mSource, '<::');
if ($pos === false)
return false;
// get the control name
$endOfName = false;
$controlName = '';
$len = strlen($this->mSource);
$i = $pos + 3;
while (!$endOfName && $i < $len)
{
$char = $this->mSource[$i];
if (($char >= 'a' && $char <= 'z') || ($char >= 'A' && $char <= 'Z'))
$controlName .= $char;
else
$endOfName = true;
$i++;
}
if ($controlName == '')
return false;
$posOfEnd = strpos($this->mSource, '<::/' . $controlName, $i);
$posOfStart = strpos($this->mSource, '<::' . $controlName, $i);
if ($posOfEnd === false)
return false;
if ($posOfStart > $pos)
{
while ($posOfStart > $pos && $posOfEnd !== false && $posOfStart < $posOfEnd)
{
$i = $posOfStart + 1;
$n = $posOfEnd + 1;
$posOfStart = strpos($this->mSource, '<::' . $controlName, $i);
$posOfEnd = strpos($this->mSource, '<::/' . $controlName, $n);
}
}
if ($posOfEnd !== false)
{
$ln = $posOfEnd - $pos + strlen($controlName) + 5;
return array($pos, $ln, $controlName, substr($this->mSource, $pos, $ln));
}
else
return false;
}
Not an extendable solution, but it works.
$startPos = strpos($string, '<::Class');
$endPos = strrpos($string, '<::/Class>');
Note my use of strrpos
to fix the nesting problem. Also, this will give you the start position of <::/Class>
, not the end.
Why don't you just use regular XML and the DOM? Or just an existing template engine like Smarty?
精彩评论