Apache POI Java Excel Performance for Large Spreadsheets
I have a spreadsheet I'm trying to read with POI (I have both xls and xlsx formats), but in this case, the problem is with the xls file. My spreadsheet has about 10,000 rows and 75 columns, and reading it in can take several minutes (though Excel opens in a few seconds). I'm using the event based reading, rather than reading the whole file into memory. The meat of my code is below. It's a bit messy right now, but it's really just a long switch statement that was mostly copied from the POI examples.
Is it typical for POI performance using the event model to be so slow? Is there anything I an do to speed this up? I think several minutes will be unacceptable for my application.
POIFSFileSystem poifs = new POIFSFileSystem(fis);
InputStream din = poifs.createDocumentInputStream("Workbook");
try
{
HSSFRequest req = new HSSFRequest();
listener = new FormatTrackingHSSFListener(new HSSFListener() {
@Override
public void processRecord(Record rec)
{
thisString = null;
int sid = rec.getSid();
switch (sid)
{
case SSTRecord.sid:
strTable = (SSTRecord) rec;
break;
case LabelSSTRecord.sid:
LabelSSTRecord labelSstRec = (LabelSSTRecord) rec;
thisString = strTable.getString(labelSstRec
.getSSTIndex()).getString();
row = labelSstRec.getRow();
col = labelSstRec.getColumn();
break;
case RKRecord.sid:
RKRecord rrk = (RKRecord) rec;
thisString = "";
row = rrk.getRow();
col = rrk.getColumn();
break;
case LabelRecord.sid:
LabelRecord lrec = (LabelRecord) rec;
thisString = lrec.getValue();
row = lrec.getRow();
col = lrec.getColumn();
break;
case BlankRecord.sid:
BlankRecord blrec = (BlankRecord) rec;
thisString = "";
row = blrec.getRow();
col = blrec.getColumn();
break;
case BoolErrRecord.sid:
BoolErrRecord berec = (BoolErrRecord) rec;
row = berec.getRow();
col = berec.getColumn();
byte errVal = berec.getErrorValue();
thisString = errVal == 0 ? Boolean.toString(berec
.getBooleanValue()) : ErrorConstants
.getText(errVal);
break;
case FormulaRecord.sid:
FormulaRecord frec = (FormulaRecord) rec;
switch (frec.getCachedResultType())
{
case Cell.CELL_TYPE_NUMERIC:
double num = frec.getValue();
if (Double.isNaN(num))
{
// Formula result is a string
// This is stored in the next record
outputNextStringRecord = true;
}
else
{
thisString = formatNumericValue(frec, num);
}
break;
case Cell.CELL_TYPE_BOOLEAN:
thisString = Boolean.toString(frec
.getCachedBooleanValue());
bre开发者_如何学JAVAak;
case Cell.CELL_TYPE_ERROR:
thisString = HSSFErrorConstants
.getText(frec.getCachedErrorValue());
break;
case Cell.CELL_TYPE_STRING:
outputNextStringRecord = true;
break;
}
row = frec.getRow();
col = frec.getColumn();
break;
case StringRecord.sid:
if (outputNextStringRecord)
{
// String for formula
StringRecord srec = (StringRecord) rec;
thisString = srec.getString();
outputNextStringRecord = false;
}
break;
case NumberRecord.sid:
NumberRecord numRec = (NumberRecord) rec;
row = numRec.getRow();
col = numRec.getColumn();
thisString = formatNumericValue(numRec, numRec
.getValue());
break;
case NoteRecord.sid:
NoteRecord noteRec = (NoteRecord) rec;
row = noteRec.getRow();
col = noteRec.getColumn();
thisString = "";
break;
case EOFRecord.sid:
inSheet = false;
}
if (thisString != null)
{
// do something with the cell value
}
}
});
req.addListenerForAllRecords(listener);
HSSFEventFactory factory = new HSSFEventFactory();
factory.processEvents(req, din);
f you are using Apache POI to generate large excel file, please take note the following line :
sheet.autoSizeColumn((short) p);
Because this will degrade the performance.
I did also some processing with thousands of large excel files and in my opinion POI is very fast. Loading that excel files tooks also about 1 minute in Excel itself. So i would confirm that the problem lies out of POI code
I would attempt to use the streaming hssf as well introduced in poi-beta3. This helped the memory issues on large spreadsheets with 1000+ columns.
If you are using Apache POI to generate large excel file, please take note the sheet.autoSizeColumn((short) p); line because this will impact the performance.
http://stanicblog.blogspot.sg/2013/07/generate-large-excel-report-by-using.html
I did some more detailed profiling and it looks like the problem is actually in code outside of POI. I just assumed this was the bottleneck, but I believe this is incorrect.
精彩评论