开发者

preg_match pattern to find the contents of a string between <html> and </html> tags

I'm working on a PHP script that reads the content of emails, and pulls out certain information to store in a database.

Using imap_fetchbody ($imap_stream, $msg_number, 1), I'm able to get at the body of the email. In some cases (especially email sent as SMS from mobile phones), the body of the email looks like this:

===------=_Part_110734_170079945.1283532109852
Content-Type: text/html;charset=UTF-8;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<html> 
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
        <title>Multimedia Message</title> 
    </head> 
    <body leftmargin="0" topmargin="0"> 


                <tr height="15" style="border-top: 1px solid #0F7BBC;"> 
                    <td> 
                        SMS to email test
                    </td> 
                </tr> 


     </body> 
</html> 


------=_Part_110734_170079945.1283532109852--===

I want to pull out the "content" of the email. So, my plan is this:

Check to see if the body contains the "html" tags. If not, I can read it normally (it's not an HTML email).

If it does, extract the content between the "html" tags. Then, eliminate all the other HTML tags, and the "content" is what's left.

However, I'm pretty clueless when it comes to regex patterns.

I tried this:

$pattern = '/<html[^>]*>(.*?)<\/html>/i';
preg_match($pattern, $body, $matches);
// my 'content' should be in $matches[1]

But that didn't work (probably because $body contains newlines and other whitespace). So then I tried this:

$pattern = '/<html[^>]*>([.\s]*?)<\/html>/i';
preg_match($pattern, $body, $matches);
开发者_JAVA技巧

But that didn't work either.

So, what $pattern can I use to extract all the text between the "html" tags?

UPDATE: I've stumbled into a workaround - strip all the whitespace first:

$body = preg_replace('/\s\s+/', ' ', $body);
$pattern = '/<body[^>]*>(.*?)<\/body>/';

I suspect this isn't the fastest or most efficient method, but it works, and is the best I've got so far. I'd still appreciate a better solution if there is one, though.

UPDATE 2: Thanks to Gumbo suggestions, I've tried a little harder to dig through the structure of the email to find the part I was looking for, instead of attempting to regex HTML. I finally found this: http://docstore.mik.ua/orelly/webprog/pcook/ch17_04.htm, which explains how to do exactly what I needed.


$pattern = '/<html[^>]*>([^\00]*?)<\/html>/i';

That will only break if there's a 0x00 byte in the content, which should not be.


[.\s] means either a literal . or a whitespace character. What you need is either (.|\s), or [\s\S], or you simply set the s modifier to have . also match line breaks.

But besides that, you should not use regular expressions to match HTML. Parts of HTML are not regular and thus you cannot use regular expressions to describe it.

But besides that, you should not try to guess the range of a multipart content when you have distinct delimiters. But these aren’t <html>…</html>. Because what if they are missing? Then your attempt will fail. Use the delimiters defined by the message itself: the boundary value. So use the boundary to get the parts and split them at the first CRLF+CRLF sequence to separate the header from the body.

But besides that, why don’t you use the IMAP functions to get the body? I’m not familiar with the PHP’s IMAP API, but there probably is a function that does exactly that what you’re looking for.


you can use an html parser like : http://php-html.sourceforge.net/

or you can use strip_tags php.net/strip_tags


You just need to add s modifier to allow . match newlines:

$pattern = '/<html[^>]*>(.*?)<\/html>/si';
preg_match($pattern, $body, $matches);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜