PHP- How do I search through an HTML document and extract certain strings in php?
I have an html document that I saved as a .txt file. I want to extract each string following /user/ and make a comma-separated list of all the extracted strings. So every time th开发者_高级运维ere's a "/user/boy34" in this txt file, I would like to extract the "boy34" part. Im really new to PHP but I've been reading about the preg_match_all() function and I think that's what I need to use.
Here's what I've come up so far but it doesn't work:
<?php
$str = file_get_contents("comment.txt");
preg_match_all ('/^(user\/)\/[A-Z0-9][A-Z0-9_-]+\"$/i', $str, $preg);
print_r ($preg);
?>
The output I get from this is:
Array ( [0] => Array ( ) [1] => Array ( ) )
Can somebody please help me?
Using ^
in a regex means that it will only match if the entire line begins with your subject. Also, the $
at the end means the line must also end right after the match. So you will never find anything, unless the entire line is nothing but /user/boy34
. Also, you probably need the m
flag for multiline mode.
You should also use the shortcuts, like \w (word characters, A-Za-z0-9_
)
Try out this regex pattern: /"\/user\/(\w+)"/im
If you post an example of your HTML, I can actually test this out and get you a working regex pattern.
--- UPDATE ---
I tested using this HTML:
<html>
<body>
<a href="/user/boy30" />
<a href="/user/boy31" />
<a href="/user/boy32" />
</body>
</html>
and the regex mentioned above, and I got it to work in this very simple test. I used this site to test: http://www.spaweditor.com/scripts/regex/index.php
Here were my results:
Array
(
[0] => Array
(
[0] => "/user/boy30"
[1] => "/user/boy31"
[2] => "/user/boy32"
)
[1] => Array
(
[0] => boy30
[1] => boy31
[2] => boy32
)
)
--- Regex Explanation ---
/
Required to start any regex pattern"
Looks for a double-quote character\/user\/
Searches for /user/ (the forward-slashes needed to be escaped)(
Anything between parenthesis will be grouped together in your results (leaving the parenthesis out will not break the regex, it will still find the matches, but this allows us to extract "boy32" up front.)\w+
Searches for 1 or more (+
means "1 or more") word characters (equivalent to[a-zA-Z0-9_]
))
Ends the grouping started before
"
Looks for another double-quote character/
Required at the end of any regex pattern, and before any flagsi
Flag: Case-Insensitive Modem
Flag: Multi-Line Mode (normally, line-breaks will terminate expressions, this allows the pattern to match even over multiple lines)
You can use XPath. Check this thread out - Execute a XQuery with PHP. Hope that helps.
精彩评论