开发者

Finding all instances of a substring in a string

In my last question I asked about parsing the links out of an HTML page. Since I haven't found a solution yet I thought I tried something else in the meantime: search for every <a href= and copy whatever is there until I hit a </a>.

Now, my C is a bit rusty but I do remember i can use strstr() to get the f开发者_运维知识库irst instance of that string, but how do I get the rest?

Any help is appreciated.

PS: No. This is not homework on school or something like that. Just so you know.


You can use a loop:

char   *ptr = haystack;
size_t nlen = strlen (needle);

while (ptr != NULL) {
  ptr = strstr (ptr, needle);
  if (ptr != NULL) {
    // do whatever with ptr
    ptr += nlen;  // hat tip to @larsman
  }
}


Why not use libxml which has a very good HTML parser built in?


Okay, the original answer and my comments seemed to require more information than was comfortable in commenting section, so I decided to create a new answer.

First off, what you are attempting to do IS a programming task already, which WILL require some programming aptitude, depending on your exact needs.

Secondly, there have been some answers provided that suggest you use loops of char finding and regexps. Both of these are horribly error-prone ways to do things, as discussed, for example, here.

The normal way for parsing HTML/XML stuff nowadays is by using an external library designed for this. In fact these libraries are by now sort of standard and in many programming languages they are already built-in.

For your particular needs, I am rusty on both C and XPath either, but it should work approximately like this:

  • start up an XML/HTML parser.
  • load into it your HTML document as character string
  • tell the parser to find all instances of tag (using XPath)
  • it will return to you a "set of nodes"
  • process the set of nodes in a loop, doing with each tag whatever you need

I found some other examples, maybe this one is better: http://xmlsoft.org/example.html

As you can see there, there is an XML document (which doesn't matter, since HTML is just subset of XML, your HTML document should work too).

In Python or similar language this would be extremely easy, in some pseudocode this would look like this:

p=new HTMLParser
p->load(my html document)
resultset=p->XPath_Search("//a") # this will find all A elements in the HTML document
for each result of resultset:
   write(result.href)
end for

this would generally write out HREF part of all A elements in document. A decent tutorial on what can you use XPath for is eg here.

I am afraid in C this would be somewhat more convoluted, but the idea is the same and it IS a programming task.

If this is some quick-and-dirty work you might use suggested strstr() or regexp searches, with no external libraries. However, please keep in mind that depending on your exact task, you are very likely to miss a number of outgoing links or misread their contents.


C strings are just pointers to the first character; to get the next match simply call it again and pass the pointer to the end of the previous match you got.


Here is what I would do (not tested, just my idea):

char* hRef_start  = "<a href=";
char* hRef_end    = "</a>";

Assume your text is in

char text[1000];
char * first = strstr(text , hRef_start);
if(first)
{
    char * last = strstr(first , hRef_end);
    if(last)
         last--;
    else
         //Error here.

    char * link = malloc((last - first + 2) * sizeof(char));
    copy_link(link , first , last);
}

void copy_link(char * link , const char * first , const char * last)
{

     while(first < last)
     {
           *link = *first;
           ++first;
     }
     *link = 0;
}

You should check if malloc() succedded, and make sure you free(), also make sure on copy_link() that none of the args is null.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜