unit testing file parsing routines?

2022-12-12 04:01 问答作者：

I am struggling a bit with how I can unit test parsing a file... Let's say I have a file with 25 columns that could be anywhere from 20-1000 records long... How do I write a unit 开发者_开发百科test against that? The function takes the file as a string as parameter and returns a DataTable with the file contents...

The best I can come up with is parsing a 4 record file and only checking the top left and bottom right 'corners'... e.g. the first few fields in the 2 top records and the last few fields of the 2 bottom records... I couldn't imagine having to tediously hand-type assert statements for every single field in the file. And doing just one record and every field seems just as weak, since it doesn't account for scenarios of multiple record files or unexpected data.

That seemed 'good enough' at the time... however now I'm working on a new project which is essentially the parsing of various PDF files coming in from 10 different sources, each source has 4-6 different formats for their files, so about 40-60 parsing routines. We may eventually fully automate 25 additional sources down the road. We take the PDF and convert it to excel using a 3rd party tool.. then we sit and analyze the patterns in the output, and write the code that calls the API of the tool, takes the excel file and parses it - stripping out the garbage, sorting around data thats in different places, cleaning it etc..

How realitically can I unit test something like this?

I am not sure I fully understand the problem, but here is one idea. Collect a bunch of sample files that represent diverse formats and edge cases. Run the conversion to your DataTables and manually inspect the DataTables the first time to ensure they are correct. Then serialize the DataTable's to XML format and store them in your unit test suite along with your test case PDF files.

Your automated unit tests could perform the conversion from PDF to DataTable and compare the results against the respective "approved" serialized DataTable representation.

You could build up a library of test documents over time using this method. Failures in your unit tests would indicate that changes to the parsing routines have broken a particular edge case.

There's one 'catch' though. I my first example I was talking of a .NET application. However, this new project with the 40 possibly 'scrubbing scripts' is written in VBA.... The input is an Excel Spreadsheet and the output is an Excel spreadsheet... how could I serialize this? Maybe do a checksum on the entire file????

For the second example if the Excel spreadsheets are not too complicated you could try to create a cell by cell comparison routine like this one; perhaps you could wrap this into a custom Assert.AreExcelWorksheetsEqual(). You are right though, a checksum might work just as well.

When you have to build unit testing around sample of data, use second sample of expected output data. 10K lines of text or megabyte of binary. It does not matter.

You can just prepare expected input sample and output data table, no matter what size. Store it in files/scripts next to your source code. Include into test the steps of fetching the data sample, processing it and comparing output bit to bit with the expected result using some generic comparing tool or SQL statement.

继续阅读：parsing pdf unit-testing

unit testing file parsing routines?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？