How can I keep track of character positions after I remove elements from a string?

2022-12-20 15:55 问答作者：

Let us say I have the following string:

 "my ., .,dog. .jumps. , .and..he. .,is., .a. very .,good, .dog"  
  1234567890123456789012345678901234567890123456789012345678901 <-- char pos

Now, I have written a regular expression to remove certain elements from the string above, in this example, all whitespace, all periods, and all commas.

I am left with the following transformed string:

 "mydogjumpsandheisave开发者_如何学运维rygooddog"

Now, I want to construct k-grams of this string. Let us say I were to take 5-grams of the above string, it would look like:

  mydog ydogj dogju ogjum gjump jumps umpsa ...

The problem I have is that for each k-gram, I want to keep track of its original character position in the first source text I listed.

So, "mydog", would have a start position of "0" and an end position of "11". However, I have no mapping between the source text and the modified text. So, I have no idea where a particular k-gram starts and ends in relation to the original, unmodified text. This is important to my program to keep track of.

I am creating a list of k-grams like this:

public class Kgram
{
    public int start;  
    public int end;  
    public int text;  
}

where start and end are positions in the source text (top) and the text is that of the k-gram text after the modifications.

Can anyone point me in the right direction for the best way to solve this problem?

Don't use a regular expression 'replace' API to do your replacing. Only use regexps to find the places you want to modify, do the mod yourself, and maintain an offset mapping. One form I've used is an array of ints as big as the original string, storing 'n chars deleted' here values, but there are a host of other possibilities.

The basic data structure here is an array of pairs. Each pair contains an offset and a correction. Depending on time/space tradeoffs, you may prefer to spread the information out over a data structure as large as the original string.

Here's how I would solve this problem in Haskell:

kgramify k string =
  let charsWithPos = zip string [1..]  -- attach original position to each char
      goodCWP      = filter (not o isWhitePeriodOrComma o fst) charsWithPos -- drop nasty chars
      groups       = takeEveryK k goodCWP -- clump remaining chars in groups of size k
      posnOfGroup g = (snd (head g), map fst g) -- position of first char with group
  in  map posnOfGroup groups

In informal English:

Tag each character with its position
Filter out uninteresting (character, position) pairs
Take the remaining list of pairs and group them into a list of lists of length k
For each inner list, take the position of the first character, and pair it with the list of the all the characters (with the positions dropped)

In any functional language like Clean, Haskell, ML, or Scheme, this kind of thing is very easy. In a language with explicit memory allocation (new) or even worse, malloc and free, such a solution would be very tedious.

A C solution, to show that as Norman Ramsey says, it's pretty tedious. It takes the filter as a callback with context, just for kicks, but in your case you could pass 0 as the filterdata and not_wspc as the filter:

int not_wspc(void *, char c) {
    if isspace((unsigned char)c) return 0;
    if ((c == '.') || (c == ',')) return 0;
    return 1;
}

typedef struct {
    char c;
    int pos;
} charwithpos;

KGram *foo(const char *input, int (*filter)(void *,char), void *filterdata) {
    size_t len = strlen(input);
    charwithpos *filtered = malloc(len * sizeof(*filtered));
    assert(filtered);

    // combine Norman's zip and filter steps
    charwithpos *current = filtered
    for (size_t i = 0; i < len; ++i) {
        if (filter(filterdata, input[i])) {
            current->c = input[i];
            current->pos = i;
            ++current;
        }
    }
    size_t shortlen = (current - filtered);

    // wouldn't normally recommend returning malloced data, but
    // illustrates the point.
    KGram *result = malloc((shortlen / 5 + 1) * sizeof(*result));
    assert(result);

    // take each 5 step
    KGram *currentgram = result;
    current = filtered;
    for (size_t i = 0; i < shortlen; ++i) {
        currentgram->text[i%5] = current->c;
        if ((i % 5) == 0) {
            currentgram->start = current->pos;
        } else if ((i % 5) == 4) {
            currentgram->end = current->pos;
            ++currentgram;
        }
        ++current;
    }
    if (shortlen % 5) != 0 {
        currentgram->end = filtered[shortlen-1].pos;
        currentgram->text[shortlen%5] = 0;
    }

    free(filtered);
    return(result);
}

Or something like that, I can't be actually compiling and testing it. Obviously this has the significant weakness that filtered sees the chars one at a time, which means it cannot apply backtracking algorithms. You could get around it by passing the whole string into the filter, so that if necessary it can do a lot of work on the first call, and store the results to return on all the rest of the calls. But if you need to apply regular-expression-like logic to arbitrary types, then C is probably not the right language to use.

Here's the beginnings of a C++ solution, without even using <functional>. Not sure what Norman saying about languages with new: just because the language has it doesn't mean you have to use it ;-)

template <typename OutputIterator>
struct KGramOutput {
    OutputIterator dest;
    KGram kgram;
    KGramOutput(OutputIterator dest) : dest(dest) {}
    void add(char, size_t);
    void flush(void);
};

template <typename InputIterator, typename OutputIterator, typename Filter>
void foo(InputIterator first, InputIterator last, OutputIterator dest, Filter filter) {
    size_t i = 0;
    KGramOutput<OutputIterator> kgram(dest);
    while (first != last) {
        if (filter(*first)) kgram.add(*first, i);
        ++first;
        ++i;
    }
    kgram.flush();
}

The add and flush functions are a bit tedious, they have to bundle up 5 pairs into a KGram struct, and then do *dest++ = kgram. The user could pass for example a pushback_iterator over a vector<KGram> as the output iterator. Btw the '5' and the 'char' could be template parameters too.

This can be done in a single pass without needing to construct intermediate character-position pairs:

(defclass k-gram ()
  ((start :reader start :initarg :start)
   (end :accessor end)
   (text :accessor text)))

(defmethod initialize-instance :after ((k-gram k-gram) &rest initargs &key k)
  (declare (ignorable initargs))
  (setf (slot-value k-gram 'text) (make-array k :element-type 'character)))

(defun k-gramify (string k ignore-string)
  "Builds the list of complete k-grams with positions from the original
   text, but with all characters in ignore-string ignored."
  (loop
     for character across string
     for position upfrom 0
     with k-grams = ()
     do (unless (find character ignore-string)
          (push (make-instance 'k-gram :k k :start position) k-grams)
          (loop
             for k-gram in k-grams
             for i upfrom 0 below k
             do (setf (aref (text k-gram) i) character
                      (end k-gram) (1+ position))))
     finally (return (nreverse (nthcdr (- k 1) k-grams)))))

继续阅读：algorithm language-agnostic string

How can I keep track of character positions after I remove elements from a string?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？