How do I generate a unique id using Lucene?

2023-02-12 19:45 问答作者：

I am using Lucene to store (as well as index) various documents.

Each document needs a persistent unique identifier (to be used as part of a URL).

If I was using a SQL database, I could use an integer primary key auto_increment (or similar) field to automatically generate a unique id for every record that was 开发者_运维知识库added.

Is there any way of doing this with Lucene?

I am aware that documents in Lucene are numbered, but have noted that these numbers are reallocated over time.

(I'm using the Java version of Lucene 3.0.3.)

As larsmans said, you need to store this in a separate field. I suggest that you make the field indexed as well as stored, and index it using a KeywordAnalyzer. You can keep a counter in memory and update it for each new document.

What remains is the problem of persistence - how to store the maximal id when the Lucene process stops. One possibility is to use a text file which saves the maximal id.

I believe Flexible Indexing will allow you to add the maximal id to the index as a "global" field. If you are willing to work with Lucene's trunk, you can try flexible indexing to see whether it fits the bill.

For similar situations, I use following algorithm (has nothing to do with Lucene, but you can use it anyway).

Create new AtomicLong. Start with initial value obtained from System.currentTimeMillis() or System.nanoTime()
Each next ID is generated by calling .incrementAndGet or .getAndIncrement on that AtomicLong.
if the system is restarted, AtomicLong is again initialized to current timestamp during the startup.

Pros: simple, effective, thread-safe, non-blocking. If you need clustered id support, just add space for hi/lo algorithm on top of existing long or sacrifice some high bytes.

Cons: does not work if the frequency of adding new entities if more than 1/ms (for System.currentTimeMillis()) or 1/ns (for System.nanoTime()). Does not tolerate clock abnormalities.

Can consider using UUID as yet another alternative. Probability of a duplicate in UUID is virtually non-existant.

EDIT: Several commenters have raised possible issues with this approach and I don't have time to test it thoroughly. I'm leaving it here because Yuval F. refers to it. Please don't downvote unnecessarily.

Given an IndexWriter w, you can use w.maxDoc() + 1 as an id and store that (as a string) in a separate Field. Make sure the Field is stored.

Try to find a unique value in the data source you are indexing, and store it in the lucene document. A data source could be a mysql database, files from a file system, etc.

For example, if you are indexing content from a mysql database, you can assemble a unique id using the tablename and primary key id "tablename_rowID".

Lets say you are indexing from two tables 'pages' and 'comments' table; for every row in the pages table, you can generate a unique id using "page_28" for row with id 28 in your pages table. Similarly, lets say you index row 36 in comments table, your unique id would be "comment_36".

If all options fail, then I would stick to a UUID. With some additional paranoia, this could be a UUID appended to a timestamp of now().

继续阅读：lucene

How do I generate a unique id using Lucene?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？