开发者

How to remove repeated terms

My problem is as below:

If I have a string with terms sorted based on their importance (separated by comma):

text = "light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing, device light, semiconductor device, lightemitting device, device electrode, compact lightemitting, compact lightemitting device, compact lightemitting device sealing, device lightemitting diode, device photocoupler device, device sealing, emitting type, emitting type light, emitting type light emitting, light output, lightemitting device sealing, optical transmitter, package assembly, photocoupler device, photosensitive, semiconductor device electrode, semiconductor device photocoupler, transmissive, transmitter, type light, type light emitting, type light emitting diode"

The terms in variable text could be split by function strsplit or by function str_split of package stringr.

library(stringr)
str_split = strsplit(text[1], ", ")

As we can see, object str_split is consist of 40 separated terms.

Now, I would like to extract the first 10 non-repeated terms.

Let pocket = {light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor}

In 1st iteration: light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor.

Term "light" is the subset of the "light emitting", so we remove term "light" and supplement 11st terms in variable text, i.e. device light emitting.

Update: pocket = {device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting}

In 2nd iteration: device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting

Term "device" is the subset of the "device light emitting", so we remove term "device" and supplement 12st terms in variable text, i.e. device photocoupler.

Update: pocket = {emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler}

In 3rd iteration: emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler

Term "emitting" is the subset of the "light emitting", so we remove term "emitting" and supplement 13st terms in variable text, i.e. resin.

Update: pocket = {light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin}

In 4th iteration: light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin

Term "light emitting" is the subset of the "device light emitting", so we remove term "light emitting" and supplement 14st terms in variable text, i.e. sealing.

Update: pocket = {optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing}

In 5th iteration: optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing

Term "photocoupler" is the subset of the "device photocoupler", so we remove term "photocoupler" and supplement 15st te开发者_如何转开发rms in variable text, i.e. device light.

Update: pocket = {optical, lightemitting, diode, electrode, semiconductor, device light emitting, device photocoupler, resin, sealing, device light}

In 6th iteration: optical, lightemitting, diode, electrode, semiconductor, device light emitting, device photocoupler, resin, sealing, device light

Term "device light" is the subset of the "device light emitting", so we remove term "device light" and supplement 16st terms in variable text, i.e. semiconductor device.

Update: pocket = {optical, lightemitting, diode, electrode, semiconductor, device light emitting, device photocoupler, resin, sealing, semiconductor device}

The rest may be deduced by analogy.

It is difficult to me to imply such a idea by R language.

Could anyone do me a favor?

Best


You can do this with a combination based on grepl. Just get all non-repeated terms and take the first ten, it's easy as that. This little function also controls for matching within words: In this case "light" doesn't match with "lightemitting". Hence the paste function at the start (adds a space to every term).

Remove <- function(x){
    tmp <- paste(x,"")
    id <- colSums(sapply(tmp,grepl,tmp))==1
    x[id]
}

Txt <- "light, device, emitting, light emitting, optical, lightemitting, diode, 
        electrode, photocoupler, semiconductor, device light emitting, 
        device photocoupler, resin, sealing, device light, semiconductor device,
        lightemitting device, device electrode, compact lightemitting"

Txt_split <- unlist(strsplit(Txt[1], ", "))

> Remove(Txt_split)
 [1] "optical"               "diode"                 "device light emitting"
 "device photocoupler"  
 [5] "resin"                 "sealing"               "semiconductor device" 
 "lightemitting device" 
 [9] "device electrode"      "compact lightemitting"

EDIT : this one doesn't follow your outlined algorithm, as that would take ages on very large datasets, and as it grows a vector (which should be avoided in R due to risk for memory issues).


The basic idea: loop over values in the list, checking to see if the current value is not a sibset of previous matches. If so, add to list of matches.

text <- "light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing, device light, semiconductor device, lightemitting device, device electrode, compact lightemitting"

vars <- str_split(text, ", ")[[1]]

matches <- "__something_not_in_your_list_"
for(i in seq_along(vars))
{
  if(!any(str_detect(vars[i], matches))) matches <- c(matches, vars[i])
}
matches[-1]

Having an initial value in the list of matches is a little hack because str_detect doesn't like it when the second variable has length zero.


A further thought: If you don't care about phraseology then the simplest thing to do would just be to pick out all the unique words in your list.

vars <- str_split(text, ", ")[[1]]
all_words <- unlist(str_split(vars, " "))
unique(all_words)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜