Finding all instances of a substring in a string

2023-02-14 04:15 问答作者：

In my last question I asked about parsing the links out of an HTML page. Since I haven't found a solution yet I thought I tried something else in the meantime: search for every <a href= and copy whatever is there until I hit a </a>.

Now, my C is a bit rusty but I do remember i can use strstr() to get the f开发者_运维知识库irst instance of that string, but how do I get the rest?

Any help is appreciated.

PS: No. This is not homework on school or something like that. Just so you know.

You can use a loop:

char   *ptr = haystack;
size_t nlen = strlen (needle);

while (ptr != NULL) {
  ptr = strstr (ptr, needle);
  if (ptr != NULL) {
    // do whatever with ptr
    ptr += nlen;  // hat tip to @larsman
  }
}

Why not use libxml which has a very good HTML parser built in?

Okay, the original answer and my comments seemed to require more information than was comfortable in commenting section, so I decided to create a new answer.

First off, what you are attempting to do IS a programming task already, which WILL require some programming aptitude, depending on your exact needs.

Secondly, there have been some answers provided that suggest you use loops of char finding and regexps. Both of these are horribly error-prone ways to do things, as discussed, for example, here.

The normal way for parsing HTML/XML stuff nowadays is by using an external library designed for this. In fact these libraries are by now sort of standard and in many programming languages they are already built-in.

For your particular needs, I am rusty on both C and XPath either, but it should work approximately like this:

start up an XML/HTML parser.
load into it your HTML document as character string
tell the parser to find all instances of tag (using XPath)
it will return to you a "set of nodes"
process the set of nodes in a loop, doing with each tag whatever you need

I found some other examples, maybe this one is better: http://xmlsoft.org/example.html

As you can see there, there is an XML document (which doesn't matter, since HTML is just subset of XML, your HTML document should work too).

In Python or similar language this would be extremely easy, in some pseudocode this would look like this:

p=new HTMLParser
p->load(my html document)
resultset=p->XPath_Search("//a") # this will find all A elements in the HTML document
for each result of resultset:
   write(result.href)
end for

this would generally write out HREF part of all A elements in document. A decent tutorial on what can you use XPath for is eg here.

I am afraid in C this would be somewhat more convoluted, but the idea is the same and it IS a programming task.

If this is some quick-and-dirty work you might use suggested strstr() or regexp searches, with no external libraries. However, please keep in mind that depending on your exact task, you are very likely to miss a number of outgoing links or misread their contents.

C strings are just pointers to the first character; to get the next match simply call it again and pass the pointer to the end of the previous match you got.

Here is what I would do (not tested, just my idea):

char* hRef_start  = "<a href=";
char* hRef_end    = "</a>";

Assume your text is in

char text[1000];
char * first = strstr(text , hRef_start);
if(first)
{
    char * last = strstr(first , hRef_end);
    if(last)
         last--;
    else
         //Error here.

    char * link = malloc((last - first + 2) * sizeof(char));
    copy_link(link , first , last);
}

void copy_link(char * link , const char * first , const char * last)
{

     while(first < last)
     {
           *link = *first;
           ++first;
     }
     *link = 0;
}

You should check if malloc() succedded, and make sure you free(), also make sure on copy_link() that none of the args is null.

继续阅读：c string

Finding all instances of a substring in a string

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？