Making dtSearch highlight one hit per phrase, rather than one hit per word-in-a-phrase

2022-12-28 00:54 问答作者：

I'm using dtSearch to highlight text search matches within a document. The code to do this, minus some details and cleanup, is roughly along these lines:

SearchJob sj = new SearchJob();
sj.Request = "\"audit trail\""; // the user query
sj.FoldersToSearch.Add(path_to_src_document);
sj.Execute();
FileConverter fileConverter = new FileConverter();
fileConverter.SetInputItem(sj.Results, 0);
fileConvert.BeforeHit = "<a name=\"HH_%%ThisHit%%\"/><b>";
fileConverter.AfterHit = "</b>";
fileConverter.Execute();
string myHighlightedDoc = fileConverter.OutputString;

If I give dtSearch a quoted phrase query like

"audit trail"

then dtSearch will do hit highlighting like this:

An <a name="HH_0"/>audit <a name="HH_1"/>trail is a fun thing to have an <a name="HH_2"/>audit <a name="HH_last"/>trail about!

Note that each word of the phrase is highlighted separately. Instead, I would like phrases to get highlighted as whole units, like this:

An <a name="HH_0"/&g开发者_StackOverflowt;audit trail is a fun thing to have an <a name="HH_last"/>audit trail about!

This would A) make highlighting look better, B) improve behavior of my javascript that helps users navigate from hit to hit, and C) give more accurate counts of total # hits.

Is there good ways to make dtSearch highlight phrases this way?

Note: I think the text and code here could use some more work. If people want to help revise the answer or the code, this can probably become community wiki.

I asked dtSearch about this (4/26/2010). Their response was two-part:

First, it's not possible get the desired highlighting behavior just by, say, altering a flag.

Second, it is possible to get some lower-level hit information where phrase matches are treated as wholes. In particular if you set both the dtsSearchWantHitsByWord and the dtsSearchWantHitsArray flags in your SearchJob, then your search results will be annotated with the word offsets of where each word or phrase in your query matches. For example, if your input document is

An audit trail is a fun thing to have an audit trail about!

and your query is

"audit trail"

then (in the .NET API), sj.Results.CurrentItem.HitsByWord[0] will contain a string like this:

audit trail (2 11 )

indicating that the phrase "audit trail" is found starting at the 2nd word and the 11th word in the document.

One thing you can do with this information is to create a "skip list" indicating which of the dtSearch highlights are insignificant (i.e. which ones are phrase continuations, rather than being the start of a word or phrase). For example, if your skip list was [4, 7, 9], that might mean that the 4th, 7th and 9th hits were insignificant, whereas the other hits were legit. A "skip list" of this sort could be used in at least two ways:

You can change your code that navigates from hit to hit, so that it skips over hit number i iff skipList.contains(i).
Depending on requirements, you may also be able to rewrite the HTML generated by the dtSearch FileConverter. In my case I have dtSearch annotate hits with something like <name="HH_1"/>hitword, and use the A tags (and the fact that they're numbered sequentially -- HH_1, HH_2, HH_3, etc.) as the basis of hit navigation. So what I've tried, with some success, is walking the HTML, and stripping out all the A tags where the i in HH_i is featured in the skip list. Depending on your hit navigation code, you probably need to renumber the A tags so you don't have any gaps between, say, HH_1 and HH_3.

Supposing these "skip lists" are indeed useful, how would you generate them? Well here's some code that mostly works:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using NUnit.Framework;

public class DtSearchUtil
{
    /// <summary>
    /// Makes a "skip list" for the dtSearch result document with the specified
    /// WordArray data. The skip list indicates which hits in the dtSearch markup
    /// should be skipped during hit navigation. The reason to skip some hits
    /// is to allow navigation to be phrase aware, rather than forcing the user
    /// to visit each word in the phrase as if it were an independent hit.
    /// The skip list consists of 1-indexed hit offsets. 2, for example, would
    /// mean that the second hit should be skipped during hit navigation.
    /// </summary>
    /// <param name="dtsHitsByWordArray">dtSearch HitsByWord data. You'll get this from SearchResultItem.HitsByWord
    /// if you did your search with the dtsSearchWantHitsByWord and dtsSearchWantHitsArray
    /// SearchFlags.</param>
    /// <param name="userHitCount">How many total hits there are, if phrases are counted
    /// as one hit each.</param>
    /// <returns></returns>
    public static List<int> MakeHitSkipList(string[] dtsHitsByWordArray, out int userHitCount)
    {
        List<int> skipList = new List<int>();
        userHitCount = 0;

        int curHitNum = 0; // like the dtSearch doc-level highlights, this counts hits word-by-word, rather than phrase by phrase
        List<PhraseRecord> hitRecords = new List<PhraseRecord>();
        foreach (string dtsHitsByWordString in dtsHitsByWordArray)
        {
            hitRecords.Add(PhraseRecord.ParseHitsByWordString(dtsHitsByWordString));
        }
        int prevEndOffset = -1;

        while (true)
        {
            int nextOffset = int.MaxValue;
            foreach (PhraseRecord rec in hitRecords)
            {
                if (rec.CurOffset >= rec.OffsetList.Count)
                    continue;

                nextOffset = Math.Min(nextOffset, rec.OffsetList[rec.CurOffset]);
            }
            if (nextOffset == int.MaxValue)
                break;

            userHitCount++;

            PhraseRecord longestMatch = null;
            for (int i = 0; i < hitRecords.Count; i++)
            {
                PhraseRecord rec = hitRecords[i];
                if (rec.CurOffset >= rec.OffsetList.Count)
                    continue;
                if (nextOffset == rec.OffsetList[rec.CurOffset])
                {
                    if (longestMatch == null ||
                        longestMatch.LengthInWords < rec.LengthInWords)
                    {
                        longestMatch = rec;
                    }
                }
            }

            // skip subsequent words in the phrase
            for (int i = 1; i < longestMatch.LengthInWords; i++)
            {
                skipList.Add(curHitNum + i);
            }

            prevEndOffset = longestMatch.OffsetList[longestMatch.CurOffset] +
                (longestMatch.LengthInWords - 1);

            longestMatch.CurOffset++;

            curHitNum += longestMatch.LengthInWords;

            // skip over any unneeded, overlapping matches (i.e. at the same offset)
            for (int i = 0; i < hitRecords.Count; i++)
            {
                while (hitRecords[i].CurOffset < hitRecords[i].OffsetList.Count &&
                    hitRecords[i].OffsetList[hitRecords[i].CurOffset] <= prevEndOffset)
                {
                    hitRecords[i].CurOffset++;
                }
            }
        }

        return skipList;
    }

    // Parsed form of the phrase-aware hit offset stuff that dtSearch can give you 
    private class PhraseRecord
    {
        public string PhraseText;

        /// <summary>
        /// Offsets into the source text at which this phrase matches. For example,
        /// offset 300 would mean that one of the places the phrase matches is
        /// starting at the 300th word in the document. (Words are counted according
        /// to dtSearch's internal word breaking algorithm.)
        /// See also:
        /// http://support.dtsearch.com/webhelp/dtSearchNetApi2/frames.html?frmname=topic&frmfile=dtSearch__Engine__SearchFlags.html
        /// </summary>
        public List<int> OffsetList;

        // BUG: We calculate this with a whitespace tokenizer. This will probably
        // cause bad results in some places. (Better to figure out how to count
        // the way dtSearch would.)
        public int LengthInWords
        {
            get
            {
                return Regex.Matches(PhraseText, @"[^\s]+").Count;
            }
        }

        public int CurOffset = 0;

        public static PhraseRecord ParseHitsByWordString(string dtsHitsByWordString)
        {
            Match m = Regex.Match(dtsHitsByWordString, @"^([^,]*),\s*\d*\s*\(([^)]*)\).*");
            if (!m.Success)
                throw new ArgumentException("Bad dtsHitsByWordString. Did you forget to set dtsHitsByWordString in dtSearch?");

            string phraseText = m.Groups[1].Value;
            string parenStuff = m.Groups[2].Value;

            PhraseRecord hitRecord = new PhraseRecord();
            hitRecord.PhraseText = phraseText;
            hitRecord.OffsetList = GetMatchOffsetsFromParenGroupString(parenStuff);
            return hitRecord;
        }

        static List<int> GetMatchOffsetsFromParenGroupString(string parenGroupString)
        {
            List<int> res = new List<int>();
            MatchCollection matchCollection = Regex.Matches(parenGroupString, @"\d+");
            foreach (Match match in matchCollection)
            {
                string digitString = match.Groups[0].Value;
                res.Add(int.Parse(digitString));
            }
            return res;
        }
    }
}


[TestFixture]
public class DtSearchUtilTests
{
    [Test]
    public void TestMultiPhrasesWithoutFieldName()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 );",
            @"bana*, 4 (490 505 689 713 )"
            };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: banana-something@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: banana-something@505
        // 6: apple@552
        // 7: pie@553 [should skip]
        // 8: apple@578
        // 9: pie@579 [should skip]
        // 10: apple@589
        // 11: pie@590 [should skip]
        // 12: apple@683
        // 13: pie@684 [skip]
        // 14: banana-something@689
        // 15: apple@706
        // 16: pie@707 [skip]
        // 17: banana-something@713

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(11, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(7, skipList[2]);
        Assert.AreEqual(9, skipList[3]);
        Assert.AreEqual(11, skipList[4]);
        Assert.AreEqual(13, skipList[5]);
        Assert.AreEqual(16, skipList[6]);
        Assert.AreEqual(7, skipList.Count);
    }

    [Test]
    public void TestPhraseOveralap1()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 );",
            @"apple, 4 (482 490 499 552)"
            };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: apple@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: apple@552
        // 6: pie@553 [should skip]

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(4, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(6, skipList[2]);
        Assert.AreEqual(3, skipList.Count);
    }

    [Test]
    public void TestPhraseOveralap2()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 );",
@"pie, 4 (483 490 500 553)"
    };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: pie@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: apple@552
        // 6: pie@553 [should skip]

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(4, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(6, skipList[2]);
        Assert.AreEqual(3, skipList.Count);
    }

    // TODO: test "apple pie" and "apple", plus "apple pie" and "pie"

    // "subject" should not freak it out
    [Test]
    public void TestSinglePhraseWithFieldName()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 ), subject" };

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(7, userHitCount);

        Assert.AreEqual(7, skipList.Count);
        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(3, skipList[1]);
        Assert.AreEqual(5, skipList[2]);
        Assert.AreEqual(7, skipList[3]);
        Assert.AreEqual(9, skipList[4]);
        Assert.AreEqual(11, skipList[5]);
        Assert.AreEqual(13, skipList[6]);
    }
}

继续阅读：dtsearch hit-highlighting

Making dtSearch highlight one hit per phrase, rather than one hit per word-in-a-phrase

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？