Parsing large xml files to add new line between elements really slow
I have a scenario where I need to pull data out of a DB and write it out as xml. The problem is that the users want every element (DB Column) to be separated by a new line. The db table I am extracting has about 20,000 rows and has a lot of ntext columns (Table is about 3 Gig in size).
I am breaking the file up into 250 rows each so each file comes out to be around 14MB each. The problem is that the parsing is really really slow. In order to add a new line between each element/column I am adding some unique strings between each column coming out of the db so that I can use a Regex.Split function and append a new line to each item in that array.
I am sure that the slowness is user error / ignorance on my part as I live mostly in DBs, but I am not really sure what to do to try and speed up the parsing. Extracting the data as xml from the db is fast and writes rather quickly. But, introducing the parsing and adding a new line between each element has made each file write about 3 minutes per file.
Any suggestions on what I should be using in C# to parse and add the newline would be greatly appreciated.
As always I appreciate the input / comments I get on Stack.
Code I am using to parse the xml data:
//parsing the xml anywhere I see the string AddNewLine
public static void WriteFile(string xml,int fileNum)
{
string[] xmlArray = Regex.Split(xml, "AddNewLine");
string newXml = "";
//Getting filepath to write file out to
Connection filePath = new Connection();
string fileName = filePath.FilePath;
//foreach item in the array append carriage and new line
foreach(string xmlRow in xmlArray)
{
newXml = newXml + xmlRow + "\n\r\n";
}
//use StreamWriter to write file
using (StreamWriter sw = new StreamWriter(fileName + fileNum + ".xml"))
{
sw.Write(newXml);
}
//XmlDocument doc = new XmlDocument();
//doc.LoadXml(newXml);
//doc.Save(@"C:\TestFileWrite\PatentSchemaNew_" + fileNum + ".xml");
}
Example XML output where I would want a new line between each element:
<products>
<product>
<ProductID>1</ProductID>
<!--New Line-->
<Product>TestProduct1</Product>
<!--New Line-->
<ProductDescription>With the introduction of the LE820 Series, Sharp once again establishes its leadership in LCD and LED technology. In a monumental engineering breakthrough, Sharp’s proprietary QuadPixel Technology, a 4-color filter that adds yellow to the traditional RGB, enables more than a trillion colors to be displayed for the first time. A stunning new contemporary edge-light design with full-front glass proudly announces a new AQUOS direction for 2010. The proprietary AQUOS LED system comprised of the X-Gen LCD panel and UltraBrilliant LEDs enables an incredible dynamic contrast ratio of 5,000,000:1 and picture quality that is second to none. The LE820 series is very fully featured, including the addition of Netflix™ streaming video capability through the AQUOS Net™ service, along with the industry’s leading online support system, AQUOS Advantage Live. A built in media player allows for playback of music and photos via USB port.
QuadPixel Technology 4-Color Filter adds yellow to the traditional RGB sub-pixel components, enabling the display of more than a trillion colors.
Full HD 1080p (1920 x 1080) Resolution for the sharpest picture possible.
UltraBrilliant LED System includes a “double-dome” light amplifier lens and multi-fluorescents, enabling high brightness and color purity.
Full HD 1080p X-Gen LCD Panel with 10-bit processing is designed with advanced pixel control to minimize light leakage and wider aperture to let more light through.
120Hz Fine Motion Advanced for fast-motion picture quality.
Wide Viewing Angles (176°H x 176°W) Sharp's AQUOS® LCD TVs’ viewing angles are so wide, you can view the TV clearly from practically anywhere in the room.
High Brightness (450 cd/m2) AQUOS LCD TVs are very bright. You can put them virtually anywhere – even near windows, doors or other light sources – and the picture is still vivid.
AQUOS Net delivers streaming video with Netflix™, customized Internet content and live customer support via Ethernet, viewable in widget, full-screen or split-screen mode.
USB Media Player adds the convenience of viewing high-resolution photos and music on the TV.</ProductDescription>
<!--New Line-->
<ProductAccessories> What You'll Need
Add
Monster Cable MC BNDLF OL150F Bundle HDTV Performance Kit with Flat Panel Wall Bracket
Monster Cable HT700 8 Outlet Surge Protector
Monster's SurgeGuard™ protects components from harmful surges and...
$208.95
Get More Performance
Add
AudioQuest AQ Kit4 1-4ft. and 1-8ft. Black HDTV Performance Pack with HDMI Cables, Screen Cleaner & Mitt
Uncompressed digital signal for the highest quality picture and sound. One cable for video, audio and control. Two-way communication for expanded system control. Automatic display and source matching for resolution, format and aspect ratio. Computer and gaming compatibility. $79.75
Recommended Accessories
General Accessory
Add
Monster Cable ScreenClean 6oz. Ultimate Performance TV Screen Cleaner
Safe for use on your iPad, iPhone, iPod Touch, laptops, monitors, and TV screens Includes a high-tech reusable MicroFiber cloth that cleans screens without scratching Powerful cleaning solution removes dust, dirt, and oily fingerprints for ultimate clarity Advanced formula cleans without dripping, streaking, or staining like ordinary cleaners $13.94
Add
AudioQuest CleanScreen TV Screen Cleaning Kit
$19.75
Protection Plans
Add
TechShield TTL200S5 5-Year Service Warranty for LCD TVs $1,000-$2,000 (In-Home Service)
Parts and labor coverage with no deductibles No-lemon guarantee 50% value guarantee if you never use the warranty service $314.95
Add
TechShield TTL200S3 3-Year Service Warranty for LCD TVs $1,000-$2,000 (In-Home Service)
$157.95
Add
TechShield TTL200S4 4-Year Service Warranty for LCD TVs $1,000-$2,000 (In-Home Service)
$262.95
Add
TechShield TTL200S2 2-Year Service Warranty for LCD TVs $1,000-$2,000 (In-Home Service)
$104.95
Flat Panel Wall Mount - Fixed
Add
OmniMount OL150F Flat Panel Wall Bracket
Eco-friendly design and packaging Low mounting profile Includes universal rails and spacers for greater panel compatibility Small footprint provides ample room for power and A/V cutouts behind panel Lift n’ Lock™ allows you to easily attach your flat panel to the mount Sliding lateral on-wall adjustment Locking system secures panel to mount Installation template for simple and accurate mounting Includes end caps for a clean side view Includes complete hardware kit $99.95
Add
OmniMount NC200F Black Fixed Wall Mount for 37-63 inch Flat Panels
$129.95
Flat Panel Wall Mount - Tilt
Add
OmniMount NC200T Black Tilt Mount for 37-63 inch Flat Panels
Universal rails for greater panel compatibility Sliding lateral on-wall adjustment Locking bar works with padlock or screw End caps cover locking hardware and present a clean side view Installation template for simple and accurate mounting $179.95
Flat Panel Wall Mount - Cantilever/Articulating
Add
OmniMount UCL-X Platinum Wishbone Cantilever Mount Heavy Duty Dual Arm Double Stud
Tilt, pan and swivel for maximum viewing flexibility Weight capacity: 200 lbs Double-arm i-beam design for added strength Integrated cable management hides wires Lift and lock mounting system $279.88
Add
OmniMount NC125C Black Cantilever Mount for 37-52 inch Flat Panels
$299.95
Line Conditioner/Surge Protector
Add
Panamax PM8-GAV Surge Protector with Current Sense Control
8 Outlets (4 switched, 4 always on) Exclusive Protect or Disconnect circuitry Telephone line protection Cable and Satellite protection $59.89
Add
Monster Cable DL MDP 900 Monster Digital PowerCenter MDP 900 w/ Green Power and USB Charging
$74.77
HDMI Cable
Add
AudioQuest HDMI-X 2m (6.56 ft) HDMI Digital Audio Video Cable with Braided Jacket
Large 1.25% silver conductors Critical Twist Geometry Solid High-Density Polyethylene is used to minimize loss caused by insulation Uncompressed digital signal for the highest quality picture and sound $40.00
Add
Icarus ECB-HDM2 2m (6.56 ft) HDMI Digital Audio Video Cable
$16.95
Add
Monster Cable MC HDMIB 2m (6.56 ft.) HDMI Cable
$39.00
Component Video Cable
Add
Monster Cable MC 400CV-2m (6.56 ft.) Advanced Performance Component Video Cable
Get All the High Resolution Picture You Paid For
Your new DVD player, cable/satellite receiver, and TV might be more advanced... $49.00
Add
Monster Cable MC 400CV-1m (3.28 ft.) Advanced Performance Component Video Cable
$39.00
Add
AudioQuest YIQ-A 2m (6.6 ft) Component Video Cable
$44.75
General Accessory
Add
Monster Cable ScreenClean 6oz. Ultimate Performance TV Screen Cleaner
Safe for use on your iPad, iPhone, iPod Touch, laptops, monitors, and TV screens Includes a high-tech reusable MicroFiber cloth that cleans screens without scratching Powerful cleaning solution removes dust, dirt, and oily fingerprints for ultimate clarity Advanced formula cleans without dripping, streaking, or staining like ordinary cleaners $13.94
Add
AudioQuest CleanScreen TV Screen Cleaning Kit
$19.75</ProductAccessories>
<ProductFeatures>Detailed Specifications:
Basic Specifications
10-bit LCD Panel Yes
120HzFrameRate Yes
Aspect Ratio 16:09
Audio System 10W + 10W +15W (Subwoofer)
Backlight System Edge LED
Panel Type X-Gen LCD Panel
Pixel Resolution 1920 x 1080 (x4 sub-pixels) 8 million dots
Response Time 4ms
Tuning System ATSC / QAM / NTSC
Viewing Angles 176° H / 176° V Features
AQUOS Net Yes
AQUOS AdvantageSM Support Yes
AQUOS® Series Yes
Digital Still Picture Display Yes
Quattron quad pixel technology Yes
Included Accessories
Remote Control Yes
Table Stand Yes P开发者_JAVA技巧ower
Power Consumption AC (watts) 160W
Power Source 120 V, 60 Hz
Terminals
Audio Inputs (L/R) RCA X 2
Composite Video 1
Ethernet Input 1
HD Component 1
HDMI® 4
PC 1 (15-pin D-sub)
RS-232C 1
Weight & Dimensions Dimensions
Dimensions (wxhxd) (inches) 49-39/64" x 31-59/64" x 1-37/64
Dimensions with Stand(wxhxd) (inches) 49-39/64" x 33-57/64" x 13-25/64" Weight
Product Weight (lbs.) 66.1
Weight with Stand & Speakers (lbs.) 79.4</ProductFeatures>
<!--New Line-->
<CreatedDate>2011-03-13T12:59:54.627</CreatedDate>
<!--New Line-->
<LastModifiedDate>2011-03-13T12:59:54.627</LastModifiedDate>
<!--New Line-->
</product>
</products>
Thanks,
S
If I understand correctly the question and you have already AddNewLine separator in you input 14 MB XML files, possible you don't need load all file and split into parts at all. - Just read from input file line by line, replace AddNewLine text with new line in each line, where the separator exists and write modified line to new output file.
Following code will replace your AddNewLine text with \n\r\n in several orders faster than your function - less then 1 sec.
using (var streamOut = new StreamWriter(outputFileName)
{
using (var streamIn = new StreamReader(inputFileName)
{
while (!streamIn.EndOfStream)
{
string line = streamIn.ReadLine();
line = line.Replace("AddNewLine", "\n\r\n");
streamOut.WriteLine(line);
}
}
}
I think that you should investigate vtd-xml for at least three reasons:
- Parsing performance and memory usage
- Incremental update: DOM's problem is that it will construct a tree by taking apart the input document, then write the whole thing back out by concatnation. VTD-XML doesn't take apart the input doc, and the modification is by directly inserting the whitespace char (in your situation) into the docoument's byte representation. SAX and Pull will have the similar issue.
- Support for xpath and random access.
Based on the info given above, I fully expect the performance to be below 1 sec for each file. What does your file look like? I would be glad to provide some sample code
Ok here is the code that does the white space insertion
using System;
using System.Text;
using System.Net;
using com.ximpleware;
public static void insertWS()
{
VTDGen vg = new VTDGen();
if (vg.parseFile("input.xml",false){
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
XMLModifier xm = new XMLModifier(vn);
ap.selectXPath("/products/product/*");
while(ap.evalXPath()!=-1){
xm.insertAfterElement("\n");
}
xm.output("output.xml");
}
}
If I were you, I would abandon the string replace method and approach this from different angle. I would add the new lines as part of the xml when creating the xml and not after the fact. Something along the lines of:
void WriteXml(string xmlFileName, DataRowCollection rows)
{
var xmlSettings = new XmlWriterSettings { Indent = true };
using(StreamWriter stream = new StreamWriter(xmlFileName))
using(XmlWriter writer = XmlWriter.Create(stream, settings))
{
writer.WriteStartElement("products");
foreach(DataRow row in rows)
{
writer.WriteStartElement("product");
writer.WriteElementString("ProductID", row["ProductID"].ToString());
writer.Flush();
stream.WriteLine(); //insert new line
writer.WriteElementString("Product", row["Product"].ToString());
writer.Flush();
stream.WriteLine(); //insert new line
//repeat for rest of columns/elements
//...
writer.WriteEndElement(); //end product
}
writer.WriteEndElement(); //end products
}
}
精彩评论