Python regexp find two keywords in a line

2022-12-15 10:34 问答作者：

I'm having a hard time understanding this regex stuff...

I have a string like this:

<wn20s开发者_StackOverflow社区chema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">

I want to use findall() and groups to get this:

['56242','saddelmageri']

I can match the number with something like "synset-[0-9]" and the word with something like "{(.*?)}" but how do I write it to get the above result?

And here's a follow-up question - some lines look like this:

<wn20schema:NounSynset rdf:about="&dn;synset-2589" rdfs:label="**{cykel_3: trehjulet cykel; tricykel,1_1}**">

In this case I want to extract the stuff between the {} with this result:

['2589', ['cykel', 'trehjulet cykel', 'tricykel']]

so that I can drop it in a dictionary later as a key(2589) : value(['cykel', 'trehjulet cykel', 'tricykel']) pair.

Any thoughts?

Please see the top answer to this question. It is generally a terrible idea to parse xml with regular expressions. XML parsers are built for this purpose.

The quickest way to do this would probably be python's built-in minidom

Since this appears to be xml data, you would be better off using an xml parser, since parsing xml with regular expressions is very, very difficult to do right.

However, since you specifically asked for a regular expression...

Your specifications are a bit imprecise, and with regular expressions you need to be very precise in what constitutes a match. For example, will the rdfs:label value always have a _1 that you want to strip off? Will there always only be one of these blocks of data per line, or multiple per line? Also, is the order of the result important?

Here's a quick hack that might give you close to what you want:

import re
data=r'<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">"'

matches=re.findall('synset-([0-9]+).*label="{(.*)_1}"', data)
print "matches:", matches

When I run the above, I get the following output, which is a list of two-tuples containing the two strings you wanted (though in a different order):

matches: [('56242', 'saddelmageri')]

If you do a lot with this data, consider even a specialized RDF library (e.g. RDFLib). If not, an XML parser is definitely the way to go!

What if tomorrow it won't be on a single line?
What if tomorrow the label will come before the about?
There are at a least a dozen more ways in which it can remain valid XML but break your regexp!

Anyway, I tried applying an XML parser, but I'm getting an "undefined entity error" for the &dn; there. Can you post the top of the file (doctype, namespace definitions, and the like)?

You're doing two different kinds of parsing here, and you'll need to use two different tools.

First, you're parsing XML. For that, you're going to need to use an XML parser, not regular expressions. Because these elements are functionally identical XML:

<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">
</wn20schema:NounSysnset>

<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}"/>

<wn20schema:NounSynset rdfs:label="{saddelmageri_1}" rdf:about="&dn;synset-56242"/>

and conceivably even:

<NounSynset xmlns="my_wn20schema_namespace_urn" C:label='not_of_interest' A:label='{saddelmageri_1}' B:about='&dn;synset-56242'/>

To parse that element, you need to know the names of the namespaces that the element and the attributes you're interested in belong to, and then use an XML parser to find them - specifically, an XML parser that properly supports XML namespaces and XPath, like lxml.

You'll end up with something like this to find the attributes you're looking for (assuming that doc is the parsed XML document, and that variables ending in _urn are strings containing the various namespace URNs):

def find_attributes(doc):
    for elm in doc.xpath('//x:NounSynset', namespaces={'x': wn20schema_namespace_urn}):
        yield (elm.get(rdf_namespace_urn + "about"), elm.get(rdfs_namespace_urn + "label"))

Now you can look at the second part of the problem, which is parsing the values you need out of the attribute values you have. For that, you would use regular expressions. To parse the about attribute, this might work:

re.match(r'[^\d]*(\d*)', about).groups()[0]

which returns the first series of digit characters found. And to parse the label attribute, you might use:

re.match(r'{([^_]*)', label).groups()[0]

which returns all characters in label after a leading left brace and up to but not including the first underscore. (As far as parsing the second form of label that you posted, you haven't posted enough information for me to guess what a regular expression to parse that would look like.)

继续阅读：findall python regex

Python regexp find two keywords in a line

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？