A test data set for auto-testing UTF-8 string validator

2023-02-21 18:25 问答作者：

I wrote the UTF-8 string va开发者_如何转开发lidator function.

The function takes a buffer of bytes and its length in UTF-8 characters, and validates that the buffer consists exactly of given number of valid UTF-8 characters.

If buffer is too short or large, or if it contains invalid UTF8-characters, validation fails.

Now I want to write auto-tests for my validator.

Is there a data-set that I can reuse?

I've found this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, but it looks like that it does not suit my purposes well — it is more for visualization tests, as I understand.

Any clues?

Valid UTF-8 data, to see that it passes
- Strings containing characters needing 1 code unit, 2, 3, and 4! (Don't just test "ABC" or "café")
Clearly invalid data, say some ISO-8859-1 string (that isn't also valid UTF-8)
A string containing overlong forms (A 1-byte character encoded as 2, for example.) These should not pass as UTF-8
A string containing code points above U+10FFFF
Everything listed here: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Depending on how good your code is:

Catching a UTF-8 string that encodes anything from U+D800 to U+DFFF (surrogate pairs, which should never be present in a UTF-8 string)

Those test cases:

Should pass: "ABC"    41 42 43
Should pass: "ABÇ"    41 42 c3 87
Should pass: "ABḈ"    41 42 e1 b8 88
Should pass: "AB

继续阅读：language-agnosticunicodeunit-testingutf-8


                            更多精彩内容
                            Java基于Log4j2实现异步日志系统的性能优化实践指南
C++将字符串转换为整数和浮点数的几种方法
Linux中配置Java环境变量实现过程
Java实现经纬度坐标转换的示例代码
java正则表达式校验篇(附详细代码示例)

A test data set for auto-testing UTF-8 string validator

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？