How to track the source line (location) of an XML element?
I assume that there is probably no satisfactory answer to this question, but I ask it anyway in case I missed something.
Basic开发者_JAVA技巧ally, I want to find out the line in the source document from which a certain XML element originated, given the element instance. I want this only for better diagnostic error messages - the XML is part of a configuration file, and if there is something wrong with it, I want to be able to point the reader of the error message to exactly the right place in the XML document so he can correct the error.
I understand that the standard Scala XML support probably has no built-in feature like this. After all, it would be wasteful to annotate every single NodeSeq
instance with such information, and not every XML element even has a source document from which it has been parsed. It seems to me that the standard Scala XML parser throws the line information away, and later on there is no way to retrieve it.
But switching to another XML framework is not an option. Adding another library dependency "only" for the sake of better diagnostic error messages seems inappropriate to me. Also, despite some shortcomings, I really like the built-in pattern matching support for XML.
My only hope is that you can show me a way to alter or subclass the standard Scala XML parser such that the nodes it produces will be annotated with the number of the source line. Maybe a special subclass of NodeSeq
can be created for this. Or maybe only Atom
can be subclassed because NodeSeq
is too dynamic? I don't know.
Anyway, my hopes are close to zero. I don't think there is a place in the parser where we can hook in to change the way nodes are created, and that at that place the line information is available. Still, I wonder why I have not found this question before. Please point me to the original if this is a duplicate.
I had no idea how to do that, but Pangea showed me the way. First, let's create a trait to handle location:
import org.xml.sax.{helpers, Locator, SAXParseException}
trait WithLocation extends helpers.DefaultHandler {
var locator: org.xml.sax.Locator = _
def printLocation(msg: String) {
println("%s at line %d, column %d" format (msg, locator.getLineNumber, locator.getColumnNumber))
}
// Get location
abstract override def setDocumentLocator(locator: Locator) {
this.locator = locator
super.setDocumentLocator(locator)
}
// Display location messages
abstract override def warning(e: SAXParseException) {
printLocation("warning")
super.warning(e)
}
abstract override def error(e: SAXParseException) {
printLocation("error")
super.error(e)
}
abstract override def fatalError(e: SAXParseException) {
printLocation("fatal error")
super.fatalError(e)
}
}
Next, let's create our own loader overriding XMLLoader
's adapter
to include our trait:
import scala.xml.{factory, parsing, Elem}
object MyLoader extends factory.XMLLoader[Elem] {
override def adapter = new parsing.NoBindingFactoryAdapter with WithLocation
}
And that's all there is to it! The object XML
adds little to XMLLoader
-- basically, the save
methods. You might want to look at its source code if you feel the need for a full replacement. But this is only if you want to handle all of this yourself, since Scala already have a trait to produce errors:
object MyLoader extends factory.XMLLoader[Elem] {
override def adapter = new parsing.NoBindingFactoryAdapter with parsing.ConsoleErrorHandler
}
The ConsoleErrorHandler
trait extract its line and number information from the exception, by the way. For our purposes, we need the location outside exceptions too (I'm assuming).
Now, to modify node creation itself, look at the scala.xml.factory.FactoryAdapter
abstract methods. I have settled on createNode
, but I'm overriding at the NoBindingFactoryAdapter
level, because that returns Elem
instead of Node
, which enables me to add attributes. So:
import org.xml.sax.Locator
import scala.xml._
import parsing.NoBindingFactoryAdapter
trait WithLocation extends NoBindingFactoryAdapter {
var locator: org.xml.sax.Locator = _
// Get location
abstract override def setDocumentLocator(locator: Locator) {
this.locator = locator
super.setDocumentLocator(locator)
}
abstract override def createNode(pre: String, label: String, attrs: MetaData, scope: NamespaceBinding, children: List[Node]): Elem = (
super.createNode(pre, label, attrs, scope, children)
% Attribute("line", Text(locator.getLineNumber.toString), Null)
% Attribute("column", Text(locator.getColumnNumber.toString), Null)
)
}
object MyLoader extends factory.XMLLoader[Elem] {
// Keeping ConsoleErrorHandler for good measure
override def adapter = new parsing.NoBindingFactoryAdapter with parsing.ConsoleErrorHandler with WithLocation
}
Result:
scala> MyLoader.loadString("<a><b/></a>")
res4: scala.xml.Elem = <a line="1" column="12"><b line="1" column="8"></b></a>
Note that it got the last location, the one at the closing tag. That's one thing that can be improved by overriding startElement
to keep track of where each element started in a stack, and endElement
to pop from this stack into a var
used by createNode
.
Nice question. I learned a lot! :-)
I see that scala internally uses SAX for parsing. SAX allows you to set a Locator on the ContentHandler, which can be used to retrieve the current location where the error occurred. I am not sure how you can tap into the internal workings of Scala though. Here is one article I found that might be of some help to see if this is doable.
I don't know anything about Scala, but the same issue pops up in other environments. For example, an XML transformation sends its results down a SAX pipeline to a validator, and when the validator tries to find line numbers for its validation errors, they're gone. Or the XML in question was never serialized or parsed, and therefore never had line numbers.
One way to address the problem is by generating (human-readable) XPath expressions to say where the error occurred. These are not as easy to use as line numbers but they're a lot better than nothing: they uniquely identify a node, and they're often pretty easy for humans to interpret (especially if they have an XML editor).
For example, this XSLT template by Ken Holman (I think) used by Schematron generates an XPath expression to describe the location/identity of the context node:
<xsl:template match="node() | @*" mode="schematron-get-full-path-2">
<!--report the element hierarchy-->
<xsl:for-each select="ancestor-or-self::*">
<xsl:text>/</xsl:text>
<xsl:value-of select="name(.)"/>
<xsl:if test="preceding-sibling::*[name(.)=name(current())]">
<xsl:text>[</xsl:text>
<xsl:value-of
select="count(preceding-sibling::*[name(.)=name(current())])+1"/>
<xsl:text>]</xsl:text>
</xsl:if>
</xsl:for-each>
<!--report the attribute-->
<xsl:if test="not(self::*)">
<xsl:text/>/@<xsl:value-of select="name(.)"/>
</xsl:if>
</xsl:template>
I don't know if you can use XSLT in your scenario, but you could apply the same principle with whatever tools you have available.
Although you indicated that you would not want to use different library or framework, it is worth noting that all good Java streaming parsers (Xerces for Sax, Woodstox and Aalto for Stax) do make location information available for all events/tokens they serve.
Although this information is not always retained by higher-level abstractions like DOM trees (due to additional storage needed; performance isn't big concern since location information is always tracked as it is needed for error reporting anyway) this may be easy or at least possible to fix.
精彩评论