Splitting a large XML file in two using C# console app
I need to split am XML file (~400 MB) in two, so that a legacy a开发者_开发知识库pp can process the file. At the moment its throwing an exception when the file is over around 300 MB.
As I can't change the app which is doing the processing, I thought I could write a console app to split the file in two first. What's the best way of doing this? It needs to be automated so I can't use a text editor, and I'm using C#.
I suppose the considerations are:
- writing a header to the new files after the split
- finding a good place to split (not in middle of 'object')
- closing off tags and file correctly in first file, opening tags correctly in second file
Any suggestions?
The "best" way is likely to be based on XmlReader
and XmlWriter
. Using these "streaming" APIs avoids needing to load the whole XML object model in memory (and with DOM –XmlDocument
– that can need considerably more memory than the text data).
Using these APIs is harder than just loading the document: your implementation needs to track the context (eg. current node and ancestor list), but in this case that wouldn't be complex (just enough to open the elements to the current state when opening each output document).
You might want to consider making a full copy of the file and then deleting elements from each. You will have to decide at what level the deletions could occur.
It should then be fairly straightforward, from a count of how many elements have been deleted from FileA, to identify how many (and from what starting point) should be deleted from FileB.
Is that feasible for your circumstance?
I have put together the following to describe my thinking. It is not tested, but I would value the comments of the group. Downvote me if you want but I would prefer constructive criticism.
using System.Xml;
using System.Xml.Schema;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
SplitXML(args[0], args[1]);
}
private static void SplitXML(string fileNameA, string fileNameB)
{
int deleteCount;
XmlNodeList childNodes;
XmlReader reader;
XmlTextWriter writer;
XmlDocument doc;
// ------------- Process FileA
reader = XmlReader.Create(fileNameA);
doc = new XmlDocument();
doc.Load(reader);
childNodes = doc.DocumentElement.ChildNodes;
deleteCount = childNodes.Count / 2;
for (int i = 0; i < deleteCount; i++)
{
doc.DocumentElement.RemoveChild(childNodes.Item(0));
}
writer = new XmlTextWriter("FileC", null);
doc.Save(writer);
// ------------- Process FileB
reader = XmlReader.Create(fileNameB);
doc = new XmlDocument();
doc.Load(reader);
childNodes = doc.DocumentElement.ChildNodes;
for (int i = deleteCount + 1; i < childNodes.Count; i++)
{
doc.DocumentElement.RemoveChild(childNodes.Item(deleteCount +1));
}
writer = new XmlTextWriter("FileD", null);
doc.Save(writer);
}
}
}
If it's pure C#, running it as a 64-bit process might solve the problem for no effort at all (assuming you have a 64-bit Windows at hand).
精彩评论