preg_match pattern to find the contents of a string between <html> and </html> tags

2023-01-14 04:59 问答作者：

I'm working on a PHP script that reads the content of emails, and pulls out certain information to store in a database.

Using imap_fetchbody ($imap_stream, $msg_number, 1), I'm able to get at the body of the email. In some cases (especially email sent as SMS from mobile phones), the body of the email looks like this:

===------=_Part_110734_170079945.1283532109852
Content-Type: text/html;charset=UTF-8;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<html> 
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
        <title>Multimedia Message</title> 
    </head> 
    <body leftmargin="0" topmargin="0"> 


                <tr height="15" style="border-top: 1px solid #0F7BBC;"> 
                    <td> 
                        SMS to email test
                    </td> 
                </tr> 


     </body> 
</html> 


------=_Part_110734_170079945.1283532109852--===

I want to pull out the "content" of the email. So, my plan is this:

Check to see if the body contains the "html" tags. If not, I can read it normally (it's not an HTML email).

If it does, extract the content between the "html" tags. Then, eliminate all the other HTML tags, and the "content" is what's left.

However, I'm pretty clueless when it comes to regex patterns.

I tried this:

$pattern = '/<html[^>]*>(.*?)<\/html>/i';
preg_match($pattern, $body, $matches);
// my 'content' should be in $matches[1]

But that didn't work (probably because $body contains newlines and other whitespace). So then I tried this:

$pattern = '/<html[^>]*>([.\s]*?)<\/html>/i';
preg_match($pattern, $body, $matches);
开发者_JAVA技巧

But that didn't work either.

So, what $pattern can I use to extract all the text between the "html" tags?

UPDATE: I've stumbled into a workaround - strip all the whitespace first:

$body = preg_replace('/\s\s+/', ' ', $body);
$pattern = '/<body[^>]*>(.*?)<\/body>/';

I suspect this isn't the fastest or most efficient method, but it works, and is the best I've got so far. I'd still appreciate a better solution if there is one, though.

UPDATE 2: Thanks to Gumbo suggestions, I've tried a little harder to dig through the structure of the email to find the part I was looking for, instead of attempting to regex HTML. I finally found this: http://docstore.mik.ua/orelly/webprog/pcook/ch17_04.htm, which explains how to do exactly what I needed.

$pattern = '/<html[^>]*>([^\00]*?)<\/html>/i';

That will only break if there's a 0x00 byte in the content, which should not be.

[.\s] means either a literal . or a whitespace character. What you need is either (.|\s), or [\s\S], or you simply set the s modifier to have . also match line breaks.

But besides that, you should not use regular expressions to match HTML. Parts of HTML are not regular and thus you cannot use regular expressions to describe it.

But besides that, you should not try to guess the range of a multipart content when you have distinct delimiters. But these aren’t <html>…</html>. Because what if they are missing? Then your attempt will fail. Use the delimiters defined by the message itself: the boundary value. So use the boundary to get the parts and split them at the first CRLF+CRLF sequence to separate the header from the body.

But besides that, why don’t you use the IMAP functions to get the body? I’m not familiar with the PHP’s IMAP API, but there probably is a function that does exactly that what you’re looking for.

you can use an html parser like : http://php-html.sourceforge.net/

or you can use strip_tags php.net/strip_tags

You just need to add s modifier to allow . match newlines:

$pattern = '/<html[^>]*>(.*?)<\/html>/si';
preg_match($pattern, $body, $matches);

继续阅读：php regex

preg_match pattern to find the contents of a string between <html> and </html> tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？