Find the nth occurrence of substring in a string

2022-12-13 15:30 问答作者：

This seems like it should be pretty trivial, but I am new at Python and want to do it the most Pythonic way.

I want to find the index corresponding to the n'th occurrence of a substring within a string.

There's got to be something eq开发者_运维百科uivalent to what I WANT to do which is

mystring.find("substring", 2nd)

How can you achieve this in Python?

Here's a more Pythonic version of the straightforward iterative solution:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

Example:

>>> find_nth("foofoofoofoo", "foofoo", 2)
6

If you want to find the nth overlapping occurrence of needle, you can increment by 1 instead of len(needle), like this:

def find_nth_overlapping(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+1)
        n -= 1
    return start

Example:

>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2)
3

This is easier to read than Mark's version, and it doesn't require the extra memory of the splitting version or importing regular expression module. It also adheres to a few of the rules in the Zen of python, unlike the various re approaches:

Simple is better than complex.
Flat is better than nested.
Readability counts.

Mark's iterative approach would be the usual way, I think.

Here's an alternative with string-splitting, which can often be useful for finding-related processes:

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

And here's a quick (and somewhat dirty, in that you have to choose some chaff that can't match the needle) one-liner:

'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')

This will find the second occurrence of substring in string.

def find_2nd(string, substring):
   return string.find(substring, string.find(substring) + 1)

Edit: I haven't thought much about the performance, but a quick recursion can help with finding the nth occurrence:

def find_nth(string, substring, n):
   if (n == 1):
       return string.find(substring)
   else:
       return string.find(substring, find_nth(string, substring, n - 1) + 1)

Understanding that regex is not always the best solution, I'd probably use one here:

>>> import re
>>> s = "ababdfegtduab"
>>> [m.start() for m in re.finditer(r"ab",s)]
[0, 2, 11]
>>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 
11

I'm offering some benchmarking results comparing the most prominent approaches presented so far, namely @bobince's findnth() (based on str.split()) vs. @tgamblin's or @Mark Byers' find_nth() (based on str.find()). I will also compare with a C extension (_find_nth.so) to see how fast we can go. Here is find_nth.py:

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

def find_nth(s, x, n=0, overlap=False):
    l = 1 if overlap else len(x)
    i = -l
    for c in xrange(n + 1):
        i = s.find(x, i + l)
        if i < 0:
            break
    return i

Of course, performance matters most if the string is large, so suppose we want to find the 1000001st newline ('\n') in a 1.3 GB file called 'bigfile'. To save memory, we would like to work on an mmap.mmap object representation of the file:

In [1]: import _find_nth, find_nth, mmap

In [2]: f = open('bigfile', 'r')

In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

There is already the first problem with findnth(), since mmap.mmap objects don't support split(). So we actually have to copy the whole file into memory:

In [4]: %time s = mm[:]
CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s
Wall time: 17.7 s

Ouch! Fortunately s still fits in the 4 GB of memory of my Macbook Air, so let's benchmark findnth():

In [5]: %timeit find_nth.findnth(s, '\n', 1000000)
1 loops, best of 3: 29.9 s per loop

Clearly a terrible performance. Let's see how the approach based on str.find() does:

In [6]: %timeit find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 774 ms per loop

Much better! Clearly, findnth()'s problem is that it is forced to copy the string during split(), which is already the second time we copied the 1.3 GB of data around after s = mm[:]. Here comes in the second advantage of find_nth(): We can use it on mm directly, such that zero copies of the file are required:

In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 1.21 s per loop

There appears to be a small performance penalty operating on mm vs. s, but this illustrates that find_nth() can get us an answer in 1.2 s compared to findnth's total of 47 s.

I found no cases where the str.find() based approach was significantly worse than the str.split() based approach, so at this point, I would argue that @tgamblin's or @Mark Byers' answer should be accepted instead of @bobince's.

In my testing, the version of find_nth() above was the fastest pure Python solution I could come up with (very similar to @Mark Byers' version). Let's see how much better we can do with a C extension module. Here is _find_nthmodule.c:

#include <Python.h>
#include <string.h>

off_t _find_nth(const char *buf, size_t l, char c, int n) {
    off_t i;
    for (i = 0; i < l; ++i) {
        if (buf[i] == c && n-- == 0) {
            return i;
        }
    }
    return -1;
}

off_t _find_nth2(const char *buf, size_t l, char c, int n) {
    const char *b = buf - 1;
    do {
        b = memchr(b + 1, c, l);
        if (!b) return -1;
    } while (n--);
    return b - buf;
}

/* mmap_object is private in mmapmodule.c - replicate beginning here */
typedef struct {
    PyObject_HEAD
    char *data;
    size_t size;
} mmap_object;

typedef struct {
    const char *s;
    size_t l;
    char c;
    int n;
} params;

int parse_args(PyObject *args, params *P) {
    PyObject *obj;
    const char *x;

    if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) {
        return 1;
    }
    PyTypeObject *type = Py_TYPE(obj);

    if (type == &PyString_Type) {
        P->s = PyString_AS_STRING(obj);
        P->l = PyString_GET_SIZE(obj);
    } else if (!strcmp(type->tp_name, "mmap.mmap")) {
        mmap_object *m_obj = (mmap_object*) obj;
        P->s = m_obj->data;
        P->l = m_obj->size;
    } else {
        PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0");
        return 1;
    }
    P->c = x[0];
    return 0;
}

static PyObject* py_find_nth(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyObject* py_find_nth2(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyMethodDef methods[] = {
    {"find_nth", py_find_nth, METH_VARARGS, ""},
    {"find_nth2", py_find_nth2, METH_VARARGS, ""},
    {0}
};

PyMODINIT_FUNC init_find_nth(void) {
    Py_InitModule("_find_nth", methods);
}

Here is the setup.py file:

from distutils.core import setup, Extension
module = Extension('_find_nth', sources=['_find_nthmodule.c'])
setup(ext_modules=[module])

Install as usual with python setup.py install. The C code plays at an advantage here since it is limited to finding single characters, but let's see how fast this is:

In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 218 ms per loop

In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 216 ms per loop

In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000)
1 loops, best of 3: 307 ms per loop

In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000)
1 loops, best of 3: 304 ms per loop

Clearly quite a bit faster still. Interestingly, there is no difference on the C level between the in-memory and mmapped cases. It is also interesting to see that _find_nth2(), which is based on string.h's memchr() library function, loses out against the straightforward implementation in _find_nth(): The additional "optimizations" in memchr() are apparently backfiring...

In conclusion, the implementation in findnth() (based on str.split()) is really a bad idea, since (a) it performs terribly for larger strings due to the required copying, and (b) it doesn't work on mmap.mmap objects at all. The implementation in find_nth() (based on str.find()) should be preferred in all circumstances (and therefore be the accepted answer to this question).

There is still quite a bit of room for improvement, since the C extension ran almost a factor of 4 faster than the pure Python code, indicating that there might be a case for a dedicated Python library function.

Simplest way?

text = "This is a test from a test ok" 

firstTest = text.find('test')

print text.find('test', firstTest + 1)

I'd probably do something like this, using the find function that takes an index parameter:

def find_nth(s, x, n):
    i = -1
    for _ in range(n):
        i = s.find(x, i + len(x))
        if i == -1:
            break
    return i

print find_nth('bananabanana', 'an', 3)

It's not particularly Pythonic I guess, but it's simple. You could do it using recursion instead:

def find_nth(s, x, n, i = 0):
    i = s.find(x, i)
    if n == 1 or i == -1:
        return i 
    else:
        return find_nth(s, x, n - 1, i + len(x))

print find_nth('bananabanana', 'an', 3)

It's a functional way to solve it, but I don't know if that makes it more Pythonic.

This will give you an array of the starting indices for matches to yourstring:

import re
indices = [s.start() for s in re.finditer(':', yourstring)]

Then your nth entry would be:

n = 2
nth_entry = indices[n-1]

Of course you have to be careful with the index bounds. You can get the number of instances of yourstring like this:

num_instances = len(indices)

For the special case where you search for the n'th occurence of a character (i.e. substring of length 1), the following function works by building a list of all positions of occurences of the given character:

def find_char_nth(string, char, n):
    """Find the n'th occurence of a character within a string."""
    return [i for i, c in enumerate(string) if c == char][n-1]

If there are fewer than n occurences of the given character, it will give IndexError: list index out of range.

This is derived from @Zv_oDD's answer and simplified for the case of a single character.

Here is another approach using re.finditer.
The difference is that this only looks into the haystack as far as necessary

from re import finditer
from itertools import dropwhile
needle='an'
haystack='bananabanana'
n=2
next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()

Here's another re + itertools version that should work when searching for either a str or a RegexpObject. I will freely admit that this is likely over-engineered, but for some reason it entertained me.

import itertools
import re

def find_nth(haystack, needle, n = 1):
    """
    Find the starting index of the nth occurrence of ``needle`` in \
    ``haystack``.

    If ``needle`` is a ``str``, this will perform an exact substring
    match; if it is a ``RegexpObject``, this will perform a regex
    search.

    If ``needle`` doesn't appear in ``haystack``, return ``-1``. If
    ``needle`` doesn't appear in ``haystack`` ``n`` times,
    return ``-1``.

    Arguments
    ---------
    * ``needle`` the substring (or a ``RegexpObject``) to find
    * ``haystack`` is a ``str``
    * an ``int`` indicating which occurrence to find; defaults to ``1``

    >>> find_nth("foo", "o", 1)
    1
    >>> find_nth("foo", "o", 2)
    2
    >>> find_nth("foo", "o", 3)
    -1
    >>> find_nth("foo", "b")
    -1
    >>> import re
    >>> either_o = re.compile("[oO]")
    >>> find_nth("foo", either_o, 1)
    1
    >>> find_nth("FOO", either_o, 1)
    1
    """
    if (hasattr(needle, 'finditer')):
        matches = needle.finditer(haystack)
    else:
        matches = re.finditer(re.escape(needle), haystack)
    start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1))
    try:
        return next(start_here)[1].start()
    except StopIteration:
        return -1

Building on modle13's answer, but without the re module dependency.

def iter_find(haystack, needle):
    return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]

I kinda wish this was a builtin string method.

>>> iter_find("http://stackoverflow.com/questions/1883980/", '/')
[5, 6, 24, 34, 42]

>>> s="abcdefabcdefababcdef"
>>> j=0
>>> for n,i in enumerate(s):
...   if s[n:n+2] =="ab":
...     print n,i
...     j=j+1
...     if j==2: print "2nd occurence at index position: ",n
...
0 a
6 a
2nd occurence at index position:  6
12 a
14 a

Providing another "tricky" solution, which use split and join.

In your example, we can use

len("substring".join([s for s in ori.split("substring")[:2]]))

# return -1 if nth substr (0-indexed) d.n.e, else return index
def find_nth(s, substr, n):
    i = 0
    while n >= 0:
        n -= 1
        i = s.find(substr, i + 1)
    return i

Solution without using loops and recursion.

Use the required pattern in compile method and enter the desired occurrence in variable 'n' and the last statement will print the starting index of the nth occurrence of the pattern in the given string. Here the result of finditer i.e. iterator is being converted to list and directly accessing the nth index.

import re
n=2
sampleString="this is history"
pattern=re.compile("is")
matches=pattern.finditer(sampleString)
print(list(matches)[n].span()[0])

Here is my solution for finding nth occurrance of b in string a:

from functools import reduce


def findNth(a, b, n):
    return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1)

It is pure Python and iterative. For 0 or n that is too large, it returns -1. It is one-liner and can be used directly. Here is an example:

>>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1)
7

I used findnth() function and ran into some issues, so I rewrote a faster version of the function (no list splitting):

def findnth(haystack, needle, n):
    if not needle in haystack or haystack.count(needle) < n:
        return -1

    last_index = 0
    cumulative_last_index = 0
    for i in range(0, n):
        last_index = haystack[cumulative_last_index:].find(needle)
        cumulative_last_index += last_index
        
        # if not last element, then jump over it
        if i < n-1:
            cumulative_last_index += len(needle)

    return cumulative_last_index

The replace one liner is great but only works because XX and bar have the same lentgh

A good and general def would be:

def findN(s,sub,N,replaceString="XXX"):
    return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)

Def:

def get_first_N_words(mytext, mylen = 3):
    mylist = list(mytext.split())
    if len(mylist)>=mylen: return ' '.join(mylist[:mylen])

To use:

get_first_N_words('  One Two Three Four ' , 3)

Output:

'One Two Three'

Avoid a failure or incorrect output when the input value for occurrence provided is higher than the actual count of occurrence. For example, in a string 'overflow' if you would check the 3rd occurrence of 'o' ( it has only 2 occurrences ) then below code will return a warning or message indicating that the occurrence value has exceeded.

Input Occurrence entered has exceeded the actual count of Occurrence.

def check_nth_occurrence (string, substr, n):

## Count the Occurrence of a substr
    cnt = 0
    for i in string:
        if i ==substr:
            cnt = cnt + 1
        else:
            pass

## Check if the Occurrence input has exceeded the actual count of Occurrence

    if n > cnt:
        print (f' Input Occurrence entered has exceeded the actual count of Occurrence')
        return

## Get the Index value for first Occurrence of the substr

   index = string.find(substr)

## Get the Index value for nth Occurrence of Index
    while index >= 0 and n > 1:
        index = string.find(substr, index+ 1)
        n -= 1
  return index

Just in-case anyone wants to find n-th from the back:

def find_nth_reverse(haystack: str, needle: str, n: int) -> int:
    end = haystack.rfind(needle)

    while end >= 0 and n > 1:
        end = haystack.rfind(needle, 0, end - len(needle))
        n -= 1

    return end

Here's a simple and fun way to do it:

def index_of_nth(text, substring, n) -> int:
    index = 0
    for _ in range(n):
        index = text.index(substring, index) + 1
    return index - 1

I solved it like this.

def second_index(text: str, symbol: str) -> [int, None]:
"""
    returns the second index of a symbol in a given text
"""
first = text.find(symbol)
result = text.find(symbol,first+1)
if result > 0: return result

This is the answer you really want:

def Find(String,ToFind,Occurence = 1):
index = 0 
count = 0
while index <= len(String):
    try:
        if String[index:index + len(ToFind)] == ToFind:
            count += 1
        if count == Occurence:
               return index
               break
        index += 1
    except IndexError:
        return False
        break
return False

A simple solution for those with basic programming knowledge:

# Function to find the nth occurrence of a substring in a text
def findnth(text, substring, n):

# variable to store current index in loop
count = -1

# n count
occurance = 0

# loop through string
for letter in text:
    
    # increment count
    count += 1
    
    # if current letter in loop matches substring target
    if letter == substring:
        
        # increment occurance
        occurance += 1
        
        # if this is the nth time the substring is found
        if occurance == n:
            
            # return its index
            return count
        
# otherwise indicate there is no match
return "No match"

# example of how to call function
print(findnth('C$100$150xx', "$", 2))

How about:

c = os.getcwd().split('\\')
print '\\'.join(c[0:-2])

继续阅读：python string substring

Find the nth occurrence of substring in a string

Input Occurrence entered has exceeded the actual count of Occurrence.

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Input Occurrence entered has exceeded the actual count of Occurrence.

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？