Does Lucene Support Unicode?
I am building a full te开发者_如何学Cxt search facility for my website coded in asp.net mvc with mysql database. This website is for a non-english language. I have started work on it using Lucense as the engine for searching the text, but I can't find any info on whether it supports unicode?
Does anyone have any information on whether Lucene supports Unicode? I don't want a nasty surprise..
Also links to beginner articles on implementing lucene.net will be appreciated.
Yes. It fully support unicode.
But for analyzing you should explicitly assign appropriate stemmers and correct stopwords.
As for sample. Here is copy from our last project
directory = new RAMDirectory();
analyzer = new StandardAnalyzer(version, new Hashtable());
var indexWriter = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
using (var session = sessionFactory.OpenStatelessSession())
{
organizations = session.CreateCriteria(typeof(Organization)).List<Organization>();
foreach (var organization in organizations)
{
var document = new Document();
document.Add(new Field("Id", organization.ID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
document.Add(new Field("FullName", organization.FullName, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS));
document.Add(new Field("ObjectTypeInvariantName", typeof(Organization).FullName, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
indexWriter.AddDocument(document);
}
var persistentType = typeof(Order);
var classMetadata = DbContext.SessionFactory.GetClassMetadata(persistentType);
var properties = new List<PropertyInfo>();
for (int i = 0; i < classMetadata.PropertyTypes.Length; i++)
{
var propertyType = classMetadata.PropertyTypes[i];
if (propertyType.IsCollectionType || propertyType.IsEntityType) continue;
properties.Add(typeof(Order).GetProperty(classMetadata.PropertyNames[i]));
}
orders = session.CreateCriteria(typeof(Order)).List<Order>();
var idProperty = typeof(Order).GetProperty(classMetadata.IdentifierPropertyName);
foreach (var order in orders)
{
var document = new Document();
document.Add(new Field("Id", idProperty.GetValue(order, null).ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
document.Add(new Field("ObjectTypeInvariantName", typeof(Order).FullName, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
foreach (var property in properties)
{
var value = property.GetValue(order, null);
if (value != null)
{
document.Add(new Field(property.Name, value.ToString(), Field.Store.NO, Field.Index.ANALYZED_NO_NORMS));
}
}
indexWriter.AddDocument(document);
}
indexWriter.Optimize(true);
indexWriter.Commit();
return indexWriter.GetReader();
}
I'm querying Organization objects from NHibernate and put them into Lucene.NET
Here is simple search
var searchValue = textEdit1.Text;
var parser = new QueryParser(version, "FullName", analyzer);
parser.SetLocale(new CultureInfo("ru-RU"));
Query query = parser.Parse(searchValue);
var indexSearcher = new IndexSearcher(directory, true);
var docs = indexSearcher.Search(query, 10);
lblSearchTotal.Text = string.Format(totalPattern, docs.totalHits, organizations.Count() + orders.Count);
resultPanel.Controls.Clear();
foreach (var found in docs.scoreDocs)
{
var document = indexSearcher.Doc(found.doc);
var objectId = document.Get("Id");
var objectType = document.Get("ObjectTypeInvariantName");
if (resultPanel.Controls.Count > 0)
{
var labelSeparator = CreateSeparatorLabelControl();
resultPanel.Controls.Add(labelSeparator);
}
var labelCard = CreateFoundLabelControl();
resultPanel.Controls.Add(labelCard);
var organization = organizations.Where(o => o.ID.ToString() == objectId).FirstOrDefault();
if (organization != null)
{
labelCard.Text = string.Format("<b>{0}</b></br>{1}", organization.AccountNumber, organization.FullName);
labelCard.Tag = organization;
//labels[count].Text = string.Format("<b>{0}</b></br>{1}", organization.AccountNumber, organization.FullName);
//labels[count].Visible = true;
}
else
{
labelCard.Text = string.Format("Найден объект типа '{0}' с идентификатором '{1}'", objectType, objectId);
labelCard.Tag = mainForm.GetObject(objectType, objectId);
}
labelCard.Visible = true;
//count++;
}
Yes, Lucene supports unicode because it stores strings in UTF-8 format.
http://lucene.apache.org/java/3_0_3/fileformats.html
Chars
Lucene writes unicode character sequences as UTF-8 encoded bytes.
String
Lucene writes strings as UTF-8 encoded bytes. First the length, in bytes, is written as a VInt, followed by the bytes.
String --> VInt, Chars
Lucene does support unicode, but there are limitations. For example some document readers don't support unicode. Also, lucene does things like pluralize or un-pluralize words. When you are using a foreign language some of that goes away.
精彩评论