What is the VInt in Lucene?
I want to know what is the VInt in Lucene ?
I read this article , but i don't understand what is it and where does Lucene use开发者_如何学编程 it ? Why Lucene doesn't use simple integer or big integer ?
Thanks .
VInt is extremely space efficient. It could theoretically save upto 75% space.
In Lucene, many of the structures are list of integers. For example, list of documents for a given term, positions (and offsets) of the terms in documents, among others. These lists form bulk of the lucene data.
Think of Lucene indices for millions of documents that need tens of GBs of space. Shrinking space by more than half reduces disk space requirements. While savings of disk space may not be a big win, given that disk space is cheap, the real gain comes reduced disk IO. Disk IO for reading VInt data is lower than reading integers which automatically translates to better performance.
VInt refers to Lucene's variable-width integer encoding scheme. It encodes integers in one or more bytes, using only the low seven bits of each byte. The high bit is set to zero for all bytes except the last, which is how the length is encoded.
For your first question: A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on. https://lucene.apache.org/core/3_0_3/fileformats.html.
So, to save a list of n integers the amount of memory you would need is [eg] 4*n bytes. But with Vint all numbers under 128 would be stored using only 1 byte [and so on] saving a lot of memory.
Vint provides a compressed representation of integers and Shashikant's answer already explains the requirements and benefits of compression in Lucene.
精彩评论