开发者

Regular expressions - finding and comparing the first instance of a word

I am currently trying to write a regular expression to pull links out of a page I have. The problem is the links need to be pulled out only if the links have 'stock' for example. This is an outline of what I have code wise:

<td class="prd-details">
   <a href="somepage">
   ...
   <span class="collect unavailable">
   ...
</td>

<td class="prd-details">
   <a href="somepage">
   ...
   <span class="collect available">
   ...
</td>

What I would like to do is pull out the links only if 'collect available' is in the tag. I have tried to do this with the regular expression:

(?s)prd-details[^=]+="([^"]+)" .+?collect{1}[^\s]+ available

However on running it, it will find the first 'prd-details' class and keep going until it finds 'collect available', thereby taking the incorrect results. I thought by specifying the {1} after the word collect it would only use the first instance of the word it finds, but apparently I'm wrong. I've been trying to use different things such as positive and negative lookaheads but I cant seem to get anything to work.

Might anyone be a开发者_StackOverflowble to help me with this issue?

Thanks,

Dan


You need an expression that knows "collect unavailable" is junk. You should be able to use a negative lookahead with your wildcard after the link capture. Something like:

prd-details[^=]+="([^"]+)"(.(?!collect un))+?collect available

This will collect any character after the link that isn't followed by "collect un". This should eliminate capturing the "collect unavailable" chunk along with "collect available".

I tested in C# treating the text as a single line. You may need a slightly different syntax and options depending on your language and regex library.


If you insist on doing this with regex, I recommend a 2-step split-then-check approach:

  • First, split into each prd-details.
  • Then, within each prd-details, see if it contains collect available
    • If yes, then pull out the href

This is easier than trying to do everything in one step. Easier to read, write, and maintain.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜