Does Lucene Support Unicode?

2023-02-02 12:24 问答作者：

I am building a full te开发者_如何学Cxt search facility for my website coded in asp.net mvc with mysql database. This website is for a non-english language. I have started work on it using Lucense as the engine for searching the text, but I can't find any info on whether it supports unicode?

Does anyone have any information on whether Lucene supports Unicode? I don't want a nasty surprise..

Also links to beginner articles on implementing lucene.net will be appreciated.

Yes. It fully support unicode.
But for analyzing you should explicitly assign appropriate stemmers and correct stopwords. As for sample. Here is copy from our last project

directory = new RAMDirectory();
            analyzer = new StandardAnalyzer(version, new Hashtable());
            var indexWriter = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
            using (var session = sessionFactory.OpenStatelessSession())
            {
                organizations = session.CreateCriteria(typeof(Organization)).List<Organization>();
                foreach (var organization in organizations)
                {
                    var document = new Document();
                    document.Add(new Field("Id", organization.ID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
                    document.Add(new Field("FullName", organization.FullName, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS));
                    document.Add(new Field("ObjectTypeInvariantName", typeof(Organization).FullName, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
                    indexWriter.AddDocument(document);
                }

                var persistentType = typeof(Order);
                var classMetadata = DbContext.SessionFactory.GetClassMetadata(persistentType);


                var properties = new List<PropertyInfo>();
                for (int i = 0; i < classMetadata.PropertyTypes.Length; i++)
                {
                    var propertyType = classMetadata.PropertyTypes[i];
                    if (propertyType.IsCollectionType || propertyType.IsEntityType) continue;
                    properties.Add(typeof(Order).GetProperty(classMetadata.PropertyNames[i]));
                }

                orders = session.CreateCriteria(typeof(Order)).List<Order>();
                var idProperty = typeof(Order).GetProperty(classMetadata.IdentifierPropertyName);

                foreach (var order in orders)
                {
                    var document = new Document();
                    document.Add(new Field("Id", idProperty.GetValue(order, null).ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
                    document.Add(new Field("ObjectTypeInvariantName", typeof(Order).FullName, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
                    foreach (var property in properties)
                    {
                        var value = property.GetValue(order, null);
                        if (value != null)
                        {

                            document.Add(new Field(property.Name, value.ToString(), Field.Store.NO, Field.Index.ANALYZED_NO_NORMS));
                        }
                    }
                    indexWriter.AddDocument(document);
                }
                indexWriter.Optimize(true);
                indexWriter.Commit();
                return indexWriter.GetReader();
            }

I'm querying Organization objects from NHibernate and put them into Lucene.NET

Here is simple search

var searchValue = textEdit1.Text;

                var parser = new QueryParser(version, "FullName", analyzer);
                parser.SetLocale(new CultureInfo("ru-RU"));
                Query query = parser.Parse(searchValue);
                var indexSearcher = new IndexSearcher(directory, true);

                var docs = indexSearcher.Search(query, 10);
                lblSearchTotal.Text = string.Format(totalPattern, docs.totalHits, organizations.Count() + orders.Count);
                resultPanel.Controls.Clear();
                foreach (var found in docs.scoreDocs)
                {
                    var document = indexSearcher.Doc(found.doc);
                    var objectId = document.Get("Id");
                    var objectType = document.Get("ObjectTypeInvariantName");

                    if (resultPanel.Controls.Count > 0)
                    {
                        var labelSeparator = CreateSeparatorLabelControl();
                        resultPanel.Controls.Add(labelSeparator);
                    }
                    var labelCard = CreateFoundLabelControl();
                    resultPanel.Controls.Add(labelCard);

                    var organization = organizations.Where(o => o.ID.ToString() == objectId).FirstOrDefault();
                    if (organization != null)
                    {
                        labelCard.Text = string.Format("<b>{0}</b></br>{1}", organization.AccountNumber, organization.FullName);
                        labelCard.Tag = organization;
                        //labels[count].Text = string.Format("<b>{0}</b></br>{1}", organization.AccountNumber, organization.FullName);
                        //labels[count].Visible = true;
                    }
                    else
                    {
                        labelCard.Text = string.Format("Найден объект типа '{0}' с идентификатором '{1}'", objectType, objectId);
                        labelCard.Tag = mainForm.GetObject(objectType, objectId); 
                    }
                    labelCard.Visible = true;
                    //count++;
                }

Yes, Lucene supports unicode because it stores strings in UTF-8 format.

http://lucene.apache.org/java/3_0_3/fileformats.html

Chars

Lucene writes unicode character sequences as UTF-8 encoded bytes.

String

Lucene writes strings as UTF-8 encoded bytes. First the length, in bytes, is written as a VInt, followed by the bytes.

String --> VInt, Chars

Lucene does support unicode, but there are limitations. For example some document readers don't support unicode. Also, lucene does things like pluralize or un-pluralize words. When you are using a foreign language some of that goes away.

继续阅读：asp.net full-text-search lucene lucene.net

Does Lucene Support Unicode?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？