matching html attributes with regex in php
I'm trying to make an expression th开发者_JAVA百科at will search through a page like how2bypass.co.cc and return the contents of the "action" attribute in the "form" tag, and the contents of the "name" and "type" attributes in any input tags. I can't use an html parser because my ultimate goal is to automatically detect if a given page is a web proxy, and once sites catch on that I'm doing that they're probably going to start doing silly things like writing the entire document with javascript to stop me from parsing it.
I'm using the code
preg_match_all('/<form.*action\="(.*?)".*>[^<]*<input.*type\=/i', $pageContents, $inputMatches);
which works fine for the action attribute, but once I put a " after type\= the code stops working. why is this? It works fine once, but not twice?
Regular expressions are greedy...
If you inspect the page source, the following is probably matching the first <input
with the last type=
, and capturing everything in between.
`<input.*type\=`
You're not going to be able to capture the form and all inputs with your current expression because not every input is prefixed with the form markup. You need to approach it one of the following ways:
- Capture the entire form markup,
<form>...</form>
, and then a regex to match all the inputs in the capture - Adjust your current expression to be non-greedy,
.*?
, and allow for multiple captures of input markup.
Without seeing the target page that you want to extract from, there are only a few things to guess:
- The
type=
attribute might not have double quotes, astype=text
is valid too. Or it might have single quotes instead, or some whitespace around the=
. - The
.*
placeholders might fail if there are newlines between or within the tags. Using the/s
regex flag is advisable. - And it's usually more reliable to use negated character classes like
[^<>]*
or[^"]
instead of.*
anyway. - You don't need to escape the
\=
equal sign.
And maybe you should split it up. Use one regex to extract the <form>..</form>
block. And then search for the <input>
tags within.
精彩评论