How can I access namespaced XML elements using BeautifulSoup?

2023-01-03 14:58 问答作者：

I have an XML document which reads like this:

<xml>
<web:Web>
<web:Total>4000</开发者_运维百科web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>

my question is how do I access them using a library like BeautifulSoup in python?

xmlDom.web["Web"].Total ? does not work?

BeautifulSoup isn't a DOM library per se (it doesn't implement the DOM APIs). To make matters more complicated, you're using namespaces in that xml fragment. To parse that specific piece of XML, you'd use BeautifulSoup as follows:

from BeautifulSoup import BeautifulSoup

xml = """<xml>
  <web:Web>
    <web:Total>4000</web:Total>
    <web:Offset>0</web:Offset>
  </web:Web>
</xml>"""

doc = BeautifulSoup( xml )
print doc.find( 'web:total' ).string
print doc.find( 'web:offset' ).string

If you weren't using namespaces, the code could look like this:

from BeautifulSoup import BeautifulSoup

xml = """<xml>
  <Web>
    <Total>4000</Total>
    <Offset>0</Offset>
  </Web>
</xml>"""

doc = BeautifulSoup( xml )
print doc.xml.web.total.string
print doc.xml.web.offset.string

The key here is that BeautifulSoup doesn't know (or care) anything about namespaces. Thus web:Web is treated like a web:web tag instead of as a Web tag belonging to th eweb namespace. While BeautifulSoup adds web:web to the xml element dictionary, python syntax doesn't recognize web:web as a single identifier.

You can learn more about it by reading the documentation.

This is an old question but somebody might not know that at least BeautifulSoup 4 does handle namespaces well if you pass 'xml' as second argument to the constructor:

soup = BeautifulSoup("""<xml>
<web:Web>
<web:Total>4000</web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>""", 'xml')

print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<xml>
 <Web>
  <Total>
   4000
  </Total>
  <Offset>
   0
  </Offset>
 </Web>
</xml>

Environment

import bs4
bs4.__version__
---
4.10.0'

import sys
print(sys.version)
---
3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0]

BS4/XML Parser on XML with namespace definition

from bs4 import BeautifulSoup

xbrl_with_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<xbrl
    xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31"
>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""

soup = BeautifulSoup(xbrl_with_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant.prettify())
---
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>

BS4/XML Parser on XML without namespace definition

xbrl_without_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""

soup = BeautifulSoup(xbrl_without_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None

BS4/HTML Parser on XML without namespace definition

BS4/HTML parser regards <namespace>:<tag> as a single tag, besides it lower the letters.

soup = BeautifulSoup(xbrl_without_namespace, 'html.parser')
registrant = soup.find("dei:EntityRegistrantName".lower()) 

print(registrant)
---
<dei:entityregistrantname>
Hoge, Inc.
</dei:entityregistrantname>

Does not match with capital letters as they have been converted into lower letters.

registrant = soup.find("dei:EntityRegistrantName") 
print(registrant)
---
None

Conclusion

Provide the namespace definitions to use namespaces with XML parser, OR
Use HTML parser and handle with all small letters.

You should explicitly define your namespace on root element, using xmlns:prefix="URI"syntax (see examples here), and then you access you attribute via prefix:tag from BeautifulSoup. Keep in mind,what you also should explicitly define, how BeautifulSoup should process you document, in that case:

xml = BeautifulSoup(xml_content, 'xml')

For the examples below I’m assuming you:

have your namespaces declared at the top of your XML file: xmlns:ns_name="http://example.com"
have your XML parsed as xml: BeautifulSoup(data, 'xml')

Extracting known tags in a namespace

If <ns_name:tag_name> is known, the find() and find_all() methods will work just fine - as mentioned in this thread already.

# extract the first element with tag name
xml_soup.find('web:Web')

# extract all elements with tag name
xml_soup.find_all('web:Web')

Searching within a namespace with CSS selectors

BS4 also allows you to search within namespaces using CSS selectors by using a prefix: your namespace, a pipe symbol | and finally your CSS selector. Template: ns_name|css_selector.

# select all elements in the namespace 'web'
xml_soup.select('web|*')

# selecting specific elements within the namespace 'web'
xml_soup.select('web|Web > Total')

More complex searches within a namespace

For anything more complex, you’ll want to write a custom boolean function:

def ns_and_regex_match(tag) -> bool:
  if tag.prefix != 'web':
    return False
  return bool(re.search('^Off.*$', tag.name))

xml_soup.find_all(ns_and_regex_match)

继续阅读：python xml xml-namespaces xml-parsing

How can I access namespaced XML elements using BeautifulSoup?

Environment

BS4/XML Parser on XML with namespace definition

BS4/XML Parser on XML without namespace definition

BS4/HTML Parser on XML without namespace definition

Conclusion

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Environment

BS4/XML Parser on XML with namespace definition

BS4/XML Parser on XML without namespace definition

BS4/HTML Parser on XML without namespace definition

Conclusion

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？