An error has occurred opening extern DTD (w3.org, xhtml1-transitional.dtd). 503 Server Unavailable
I'm trying to do xpath queries over an xhtml document. Using .NET 3.5.
The document looks like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
....
</head>
<body>
...
</body>
</html>
Because the document includes various char entities (
and so on), I need to use the DTD, in order to load it with an XmlReader. So my code looks like this:
var s = File.OpenRead(fileToRead)
var reader = XmlReader.Create(s, new XmlReaderSettings{ ProhibitDtd=false });
But when I run this, it returns
An error has occurred while opening external DTD 'http://www.w3.org/TR/xht开发者_开发知识库ml1-transitional.dtd': The remote server returned an error: (503) Server Unavailable.
Now, I know why I am getting the 503 error. W3C explained it very clearly.
I've seen "workarounds" where people just disable the DTD. This is what ProhibitDtd=true
can do, and it eliminates the 503 error.
But in my case that leads to other problems - the app doesn't get the entity defintions and so isn't well-formed XML. How can I validate with the DTD, and get the entity definitions, without hitting the w3.org website?
I think .NET 4.0 has a nifty built-in capability to handle this situation: the XmlPreloadedResolver. But I need a solution for .NET 3.5.
related:
- java.io.IOException: Server returned HTTP response code: 503The answer is, I have to provide my own XmlResolver. I don't think this is built-in to .NET 3.5. That's baffling. It's also baffling that it has taken me this long to stumble onto this problem. It's also baffling that I couldn't find someone else who solved this problem already?
Ok, so.. the XmlResolver. I created a new class, derived from XmlResolver and over-rode three key things: Credentials (set), ResolveUri and GetEntity.
public sealed class XhtmlResolver : XmlResolver
{
public override System.Net.ICredentials Credentials
{
set { throw new NotSupportedException();}
}
public override object GetEntity(Uri absoluteUri, string role, Type t)
{
...
}
public override Uri ResolveUri(Uri baseUri, string relativeUri)
{
...
}
}
The documentation on this stuff is pretty skimpy, so I'll tell you what I learned. The operation of this class is like so: the XmlReader will call ResolveUri first, then, given a resolved Uri, will then call GetEntity. That method is expected to return an object of Type t (passed as a param). I have only seen it request a System.IO.Stream.
My idea is to embed local copies of the DTD and its dependencies for XHTML1.0 into the assembly, using the csc.exe /resource
option, and then retrieve the stream for that resouce.
private System.IO.Stream GetStreamForNamedResource(string resourceName)
{
Assembly a = Assembly.GetExecutingAssembly();
return a.GetManifestResourceStream(resourceName);
}
Pretty simple. This gets called from GetEntity().
But I can improve on that. Instead of embedding the DTDs in plaintext, I gzipped them first. Then modify the above method like so:
private System.IO.Stream GetStreamForNamedResource(string resourceName)
{
Assembly a = Assembly.GetExecutingAssembly();
return new System.IO.Compression.GZipStream(a.GetManifestResourceStream(resourceName), System.IO.Compression.CompressionMode.Decompress);
}
That code opens the stream for an embedded resource, and returns a GZipStream configured for decompression. The reader gets the plaintext DTD.
What I wanted to do is resolve only URIs for DTDs from Xhtml 1.0. So I wrote the ResolveUri and GetEntity to look for those specific DTDs, and respond affirmatively only for them.
For an XHTML document with the DTD statement, the flow is like this;
XmlReader calls ResolveUri with the public URI for the XHTML DTD, which is
"-//W3C//DTD XHTML 1.0 Transitional//EN"
. If the XmlResolver can resolve, it should return... a valid URI. If it cannot resolve, it should throw. My implementation just throws for the public URI.XmlReader then calls ResolveUri with the System Identifier for the DTD, which in this case is
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
. In this case, the XhtmlResolver returns a valid Uri.XmlReader then calls GetEntity with that URI. XhtmlResolver grabs the embedded resource stream and returns it.
The same thing happens for the dependencies - xhtml_lat1.ent, and so on. In order for the resolver to work, all those things need to be embedded.
And yes, if the Resolver cannot resolve a URI, it is expected to throw an Exception. This isn't officially documented as far as I could see. It seems a bit surprising. (An egregious violation of the principle of least astonishment). If instead, ResolveUri returns null, the XmlReader will call GetEntity on the null URI, which .... ah, is hopeless.
This works for me. It should work for anyone who does XML processing on XHTML from .NET. If you want to use this in your own applications, grab the DLL. The zip includes full source code. Licensed under the MS Public License.
You can plug it into your XML apps that fiddle with XHTML. Use it like this:
// for an XmlDocument...
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.XmlResolver = new Ionic.Xml.XhtmlResolver();
doc.Load(xhtmlFile);
// for an XmlReader...
var xmlReaderSettings = new XmlReaderSettings
{
ProhibitDtd = false,
XmlResolver = new XhtmlResolver()
};
using (var stream = File.OpenRead(fileToRead))
{
XmlReader reader = XmlReader.Create(stream, xmlReaderSettings);
while (reader.Read())
{
...
}
You can disallow an XmlReader to open any external resources by setting the XmlReaderSettings.XmlResolver property to null.
System.Xml.XmlReaderSettings xmlReaderSettings = new System.Xml.XmlReaderSettings ();
xmlReaderSettings.XmlResolver = null;
System.Xml.XmlReader xmlReader = System.Xml.XmlReader.Create(myUrl, xmlReaderSettings);
When your ResolveUri
method gets a request for a "public" form of the URI like -//W3C//ELEMENTS XHTML Images 1.0//EN
then does your method throw and wait for the subsequent web-like URI which begins with http://
?
Instead of throwing, I resolve the public URI to the corresponding http://
URI (and then in my GetEntity
method I intercept requests to the http://
URIs).
I therefore never have to throw, which I think is the right solution.
That's a smart way to do it. How big is your dictionary? The library I pointed you to handles only XHTML 1.0, and there is just one public URI base that would need to be mapped.
I'm using XHTML 1.1 which is 'modular' so I have to map about 40 files.
Beware that the Framework's behaviour may have changed! I have a library (including my XhtmlUrlResolver class) which is built with the .NET Framework 2, but it's invoked differently depending on whether the application (which uses the library) is build for .NET 2 or .NET 4.
With .NET 2, when my ResolveUri method always only delegated transparently to a XmlUrlResolver, then it would:
- Ask to ResolveUri the public of the DTD.
- Try to GetEntity the DTD from disk (throws one DirectoryNotFoundException)
- Try to GetEntity the DTD from http (which I'd serve from local resources)
- Try to GetEntity every other file from http (which I'd serve from local resources)
With .NET 4 there was an extra call for every resource:
- Ask to ResolveUri the public of the sub-resource (e.g. the
*.mod
file), which my implementation just delegated to XmlUrlResolver - Ask to GetEntity the 'resolved' public of the sub-resource, which wasn't really resolved at all, it just had an http-like prefix added (throws WebException)
Throwing all those WebExceptions slowed down processing a lot, which is why I revisited this to look for a fix.
Your suggestion, that I throw from ResolveUri, solved that problem, for which I thank you; but instead of throwing, returning something from ResolveUri is more elegant (and a bit faster: 40 fewer exceptions).
Here's my current source code.
using System;
using System.Collections.Generic;
using System.Text;
using System.Reflection;
using System.IO;
using System.Xml;
//don't obfuscate the file names of the embedded resources,
//which are contained in a "Files" subfolder of the project
[assembly: Obfuscation(Feature = "Apply to ModelText.ModelXml.Files.*: all", Exclude = true, ApplyToMembers = true)]
namespace ModelText.ModelXml
{
/// <summary>
/// This class provides local (i.e. faster) access to the XHTML DTD.
/// </summary>
/// <remarks>
/// Another way to implement this class is described in MSDN "Customizing the XmlUrlResolver Class"
/// which shows as an example a "class XmlCachingResolver"
/// and which is implemented using WebRequest and HttpRequestCachePolicy
/// </remarks>
[System.Reflection.ObfuscationAttribute(Feature = "renaming", ApplyToMembers = true)]
public class XhtmlUrlResolver : XmlResolver
{
XmlUrlResolver m_xmlUrlResolver = new XmlUrlResolver();
Assembly m_assembly = Assembly.GetExecutingAssembly();
public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)
{
string uriString = absoluteUri.ToString();
if (s_resources.uriExists(uriString))
{
//Console.WriteLine("XhtmlUrlResolver Found {0} -- {1}", uriString, DateTime.Now);
//to get the filename of the embedded resource, remove the http: directory
//this is OK because the filenames are unique and map 1-to-1 with resource names
string filename = uriString.Substring(uriString.LastIndexOf('/') + 1);
Stream stream = m_assembly.GetManifestResourceStream(typeof(XhtmlUrlResolver), "Files." + filename);
return stream;
}
//Console.WriteLine("XhtmlUrlResolver Throwing {0} -- {1}", uriString, DateTime.Now);
throw new ArgumentException();
//Console.WriteLine("XhtmlUrlResolver Getting {0} -- {1}", uriString, DateTime.Now);
//object o = m_xmlUrlResolver.GetEntity(absoluteUri, role, ofObjectToReturn);
//Console.WriteLine("XhtmlUrlResolver Got {0} -- {1}", uriString, DateTime.Now);
//return o;
}
public override Uri ResolveUri(Uri baseUri, string relativeUri)
{
string resolved = s_resources.resolve(relativeUri);
if (resolved != null)
{
//Console.WriteLine("ResolveUri resolving {0}, {1} -- {2}", baseUri, relativeUri, DateTime.Now);
return new Uri(resolved);
}
//Console.WriteLine("ResolveUri passing {0}, {1} -- {2}", baseUri, relativeUri, DateTime.Now);
return m_xmlUrlResolver.ResolveUri(baseUri, relativeUri);
}
public override System.Net.ICredentials Credentials
{
set { m_xmlUrlResolver.Credentials = value; }
}
static Resources s_resources = new Resources();
class Resources
{
Dictionary<string, string> m_publicToUri = new Dictionary<string, string>();
internal Resources()
{
for (int i = 0, n = array.GetLength(0); i < n; ++i)
{
m_publicToUri.Add(array[i, 1], array[i, 0]);
}
}
internal bool uriExists(string absoluteUri)
{
return m_publicToUri.ContainsValue(absoluteUri);
}
internal string resolve(string relativeUri)
{
string resolved;
if (m_publicToUri.TryGetValue(relativeUri, out resolved))
{
return resolved;
}
return null;
}
static string[,] array = {
{ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd", "-//W3C//DTD XHTML 1.1//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod", "-//W3C//ENTITIES XHTML 1.1 Document Model 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod", "-//W3C//ENTITIES XHTML Common Attributes 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-base-1.mod", "-//W3C//ELEMENTS XHTML Base Element 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-bdo-1.mod", "-//W3C//ELEMENTS XHTML BIDI Override Element 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod", "-//W3C//ELEMENTS XHTML Block Phrasal 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-blkpres-1.mod", "-//W3C//ELEMENTS XHTML Block Presentation 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod", "-//W3C//ELEMENTS XHTML Block Structural 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod", "-//W3C//ENTITIES XHTML Character Entities 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-csismap-1.mod", "-//W3C//ELEMENTS XHTML Client-side Image Maps 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-datatypes-1.mod", "-//W3C//ENTITIES XHTML Datatypes 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-edit-1.mod", "-//W3C//ELEMENTS XHTML Editing Elements 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod", "-//W3C//ENTITIES XHTML Intrinsic Events 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-form-1.mod", "-//W3C//ELEMENTS XHTML Forms 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod", "-//W3C//ENTITIES XHTML Modular Framework 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-hypertext-1.mod", "-//W3C//ELEMENTS XHTML Hypertext 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-image-1.mod", "-//W3C//ELEMENTS XHTML Images 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod", "-//W3C//ELEMENTS XHTML Inline Phrasal 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-inlpres-1.mod", "-//W3C//ELEMENTS XHTML Inline Presentation 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod", "-//W3C//ELEMENTS XHTML Inline Structural 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod", "-//W3C//ELEMENTS XHTML Inline Style 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent", "-//W3C//ENTITIES Latin 1 for XHTML//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-link-1.mod", "-//W3C//ELEMENTS XHTML Link Element 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-list-1.mod", "-//W3C//ELEMENTS XHTML Lists 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-meta-1.mod", "-//W3C//ELEMENTS XHTML Metainformation 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-object-1.mod", "-//W3C//ELEMENTS XHTML Embedded Object 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-param-1.mod", "-//W3C//ELEMENTS XHTML Param Element 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-pres-1.mod", "-//W3C//ELEMENTS XHTML Presentation 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod", "-//W3C//ENTITIES XHTML Qualified Names 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-script-1.mod", "-//W3C//ELEMENTS XHTML Scripting 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-special.ent", "-//W3C//ENTITIES Special for XHTML//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-ssismap-1.mod", "-//W3C//ELEMENTS XHTML Server-side Image Maps 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-struct-1.mod", "-//W3C//ELEMENTS XHTML Document Structure 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-style-1.mod", "-//W3C//ELEMENTS XHTML Style Sheets 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent", "-//W3C//ENTITIES Symbols for XHTML//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-table-1.mod", "-//W3C//ELEMENTS XHTML Tables 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-target-1.mod", "-//W3C//ELEMENTS XHTML Target 1.0//EN" },
{ "http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod", "-//W3C//ELEMENTS XHTML Text 1.0//EN" },
{ "http://www.w3.org/TR/ruby/xhtml-ruby-1.mod", "-//W3C//ELEMENTS XHTML Ruby 1.0//EN" }
};
}
}
}
精彩评论