In Solr, How to query against one field for distinct set of values in a multi-valued field
I basically want Solr to search each record of the multivalued field for my search parameter.. read on for my example:
I am using Solr to index my data. I have application data in parallel arrays (in the form of multi-valued fields) that match a given product. See the following example, where make, model, and year are multivalued fields:
<-solr record start->
sku: 1234
make: acura, acura, acura
model: integra, rsx, rsx
year: 1997, 2004, 2000
engine: 3.4, 4.5, 4.5
<-solr record end->
I am using filter queries (&fq=) to narrow my selections. The problem is, if someone looks up a 2000 Acura Integra, it will match the above record, but since 开发者_JAVA百科the make, model, and year data is encoded in parallel, there actually is no 2000 Acura Integra for this product. Solr is matching the make in the make field, the model in the model field, and the year in the year field (as it should) and returning this result, and not respecting my parallelism. My Query would look like this so far:
fq=make:"acura"&fq=model:"integra"&fq=year:2000 (I would normally escape URL characters when I POST to Solr, this is just an example)
So my solution was to create another multivalued field, called summary field,in which I would put all the make, model, year and other data (like engine) together separated by a space. It is necessary to have quotations around the words so terms with multiple words don't match search parameters inadvertently. The above example would now look like this:
<-solr record start->
sku: 1234
make: acura, acura, acura
model: integra, rsx, rsx
year: 1997, 2004, 2000
engine: 3.4, 4.5, 4.5
summary: "acura" "integra" "1997" "3.4", "acura" "rsx" "2004" "4.5", "acura" "rsx" "2000", "4.5"
<-solr record end->
I then add to my query the following:
summary:(""acura" AND "integra" AND "2000")
I would expect, if I added that to my query, that this record would no longer come up, since there is no acura integra 2000 in the summary field. However, this doens't work. The record still comes up. I am stumped. Does anyone have a solution to this problem. It's been killing me for days.
I basically want Solr to search each record of the multivalued field for my search parameter.. is this possible? Is there a better way to do what I am trying to do?
Thanks
It seems that your schema isn't quite right. You need to completely denormalize your data and create one document per vehicle. What a "vehicle" means depends on what kind of searches you will run. For example, a possible schema would be:
sku: 1234
make: acura
model: integra
years: 1997
engines: 3.4, 4.5
sku: 1235
make: acura
model: rsx
years: 2000, 2004
engines: 4.5
The summary field would be a copyField of make+model+years+engines
I am still not sure on how to maintain parallelism without a summary field, but I figured out how to do it with a summary field. Instead of using AND statements, which I believe search each record in the multivalued field for a match (each AND'ed term could match a different row in the Multivalued field, not necessarily the same row), you instead put the exact terms you're looking for, in the same order that you built your original summary record, and use the ~ operator.
Take a look at the following example:
The following are the contents of the summary field in one of the rows in the multivalued field, which I wish to match:
"Honda" "Accord" "2004" "3.5L"
Here is the query I will run:
summary_field:("\"Honda\" \"2004\"")
The above query alone will not work. Even though I can have a function that puts user input from the application into the same order that the original summary field was built with, because users in the application can enter a piece of data (a make, model year) in any order, there may be other words in between the data I am trying to match. In the above eample, I want to match Honda 2004 to that record. However, Accord is between it.
To get around this problem, simply use the ~n operator, where n is the maximum number of other terms in between the terms your are searching for. So if I instead use:
summary_field:("\"Honda\" \"2004\""~1)
I am saying that between Honda and 2004, there is a possibility of there being 1 other word. Therefore, this above query will match. Even if you add multiple terms to your summary field, as long as you query against it with the values in the same order, and your fuzzy search logic uses a number that will be the maximum distance between 2 values, your query will always correctly match the correct summary field. So, if you have 20 fields that you add to your summary field to maintain parallelism, you simply need to use ~18, as that is the maximum possible distance in a worst case scenario between words that could be picked by the user.
Can you not just do a query as follows?
make:acura AND model:integra AND year:2000
I.e. Without the Quotes around the make and model.
精彩评论