开发者

Match longest substring in Python

Consider I have following string with a tab in between left & right part in a text file:

The dreams of REM (Geo) sleep         The sleep paralysis

I want to match the above string that match both left part & right part in each line of another following file:

The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep. 

If can not match with fill string, then try to match with substring.

I want to search with leftmost and rightmost pattern. eg.(leftmost cases)

The dreams of REM  sleep     paralysis
The dreams of REM  sleep     The 开发者_运维知识库sleep

eg.(Right most cases):

REM  sleep    The sleep paralysis
The dreams of   The sleep paralysis

Thanks a lot again for any kind of help.


(Ok, you clarified most of what you want. Let me restate, then clarify the points I listed below as remaining unclear... Also take the starter code I show you, adapt it, post us the result.)

You want to search, line-by-line, case-insensitive, for the longest contiguous matches to each of a pair of match-patterns. All the patterns seem to be disjoint (impossible to get a match on both patternX and patternY, since they use different phrases, e.g. can't match both 'frontal lobe' and 'prefrontal cortex').

Your patterns are supplied as a sequence of pairs ('dom','rang'), => let's just refer to them by their subscript [0] and [1, you can use string.split('\t') to get that.) The important thing is a matching line must match both the dom and rang patterns (fully or partially). Order is independent, so we can match rang then dom, or vice versa => use 2 separate regexes per line, and test d and r matched.

Patterns have optional parts, in parentheses => so just write/convert them to regex syntax using (optionaltext)? syntax already, e.g.: re.compile('Frontallobes of (leftside)? the brain', re.IGNORECASE)

The return value should be the string buffer with the longest substring match so far.

Now this is where several things remain to be clarified - please edit your question to explain the following:

  • If you find full matches to any pair of patterns, then return that.
  • If you can't find any full matches, then search for partial matches of both of the pair of patterns. Where 'partial match' means 'the most words' or 'the highest proportion(%) of words' from a pattern? Presumably we exclude spurious matches to words like 'the', in which case we lose nothing by simply omitting 'the' from your search patterns, then this guarantees that all partial matches to any pattern are significant.
  • We score the partial matches (somehow), e.g. 'contains most words from pattern X', or 'contains highest % of words from pattern X'. We should do this for all patterns, then return the pattern with the highest score. You'll need to think about this a little, is it better to match 2 words of a 5-word pattern (40%) e.g. 'dreams of', or 1 of 2 (50%) e.g. 'prefrontal BUT NOT cortex'? How do we break ties, etc? What happens if we match 'sleep' but nothing else?

Each of the above questions will affect the solution, so you need to answer them for us. There's no point in writing pages of code to solve the most general case when you only needed something simple. In general this is called 'NLP' (natural language processing). You might end up using an NLP library.

The general structure of the code so far is sounding like:

import re

# normally, read your input directly from file, but this allows us to test:
input = """The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
The optic tract is a part of the visual system in the brain.
The inferior frontal gyrus is a gyrus of the frontal lobe of the human brain.
The prefrontal cortex (PFC) is the anterior part of the frontallobes of the brain, lying in front of the motor and premotor areas.
There are three possible ways to define the prefrontal cortex as the granular frontal cortex as that part of the frontal cortex whose electrical stimulation does not evoke movements.
This allowed the establishment of homologies despite the lack of a granular frontal cortex in nonprimates.
Modern  tracing studies have shown that projections of the mediodorsal nucleus of the thalamus are not restricted to the granular frontal cortex in primates.
""".split('\n')

patterns = [
    ('(dreams of REM (Geo)? sleep)', '(sleep paralysis)'),
    ('(frontal lobe)',            '(inferior frontal gyrus)'),
    ('(prefrontal cortex)',       '(frontallobes of (leftside )?(the )?brain)'),
    ('(modern tract)',            '(probably mediodorsal nucleus)') ]

# Compile the patterns as regexes
patterns = [ (re.compile(dstr),re.compile(rstr)) for (dstr,rstr) in patterns ]

def longest(t):
    """Get the longest from a tuple of strings."""
    l = list(t) # tuples can't be sorted (immutable), so convert to list...
    l.sort(key=len,reverse=True)
    return l[0]

def custommatch(line):
    for (d,r) in patterns:
        # If got full match to both (d,r), return it immediately...
        (dm,rm) = (d.findall(line), r.findall(line))
        # Slight design problem: we get tuples like: [('frontallobes of the brain', '', 'the ')]
        #... so return the longest match strings for each of dm,rm
        if dm and rm: # must match both dom & rang
            return [longest(dm), longest(rm)]
        # else score any partial matches to (d,r) - how exactly?
        # TBD...
    else:
        # We got here because we only have partial matches (or none)
        # TBD: return the 'highest-scoring' partial match
        return ('TBD... partial match')

for line in input:
    print custommatch(line)

and running on the 7 lines of input you supplied currently gives:

TBD... partial match
TBD... partial match
['frontal lobe', 'inferior frontal gyrus']
['prefrontal cortex', ('frontallobes of the brain', '', 'the ')]
TBD... partial match
TBD... partial match
TBD... partial match
TBD... partial match
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜