How to remove repeated terms

2023-02-04 03:15 问答作者：

My problem is as below:

If I have a string with terms sorted based on their importance (separated by comma):

text = "light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing, device light, semiconductor device, lightemitting device, device electrode, compact lightemitting, compact lightemitting device, compact lightemitting device sealing, device lightemitting diode, device photocoupler device, device sealing, emitting type, emitting type light, emitting type light emitting, light output, lightemitting device sealing, optical transmitter, package assembly, photocoupler device, photosensitive, semiconductor device electrode, semiconductor device photocoupler, transmissive, transmitter, type light, type light emitting, type light emitting diode"

The terms in variable text could be split by function strsplit or by function str_split of package stringr.

library(stringr)
str_split = strsplit(text[1], ", ")

As we can see, object str_split is consist of 40 separated terms.

Now, I would like to extract the first 10 non-repeated terms.

Let pocket = {light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor}

In 1st iteration: light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor.

Term "light" is the subset of the "light emitting", so we remove term "light" and supplement 11st terms in variable text, i.e. device light emitting.

Update: pocket = {device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting}

In 2nd iteration: device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting

Term "device" is the subset of the "device light emitting", so we remove term "device" and supplement 12st terms in variable text, i.e. device photocoupler.

Update: pocket = {emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler}

In 3rd iteration: emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler

Term "emitting" is the subset of the "light emitting", so we remove term "emitting" and supplement 13st terms in variable text, i.e. resin.

Update: pocket = {light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin}

In 4th iteration: light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin

Term "light emitting" is the subset of the "device light emitting", so we remove term "light emitting" and supplement 14st terms in variable text, i.e. sealing.

Update: pocket = {optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing}

In 5th iteration: optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing

Term "photocoupler" is the subset of the "device photocoupler", so we remove term "photocoupler" and supplement 15st te开发者_如何转开发rms in variable text, i.e. device light.

Update: pocket = {optical, lightemitting, diode, electrode, semiconductor, device light emitting, device photocoupler, resin, sealing, device light}

In 6th iteration: optical, lightemitting, diode, electrode, semiconductor, device light emitting, device photocoupler, resin, sealing, device light

Term "device light" is the subset of the "device light emitting", so we remove term "device light" and supplement 16st terms in variable text, i.e. semiconductor device.

Update: pocket = {optical, lightemitting, diode, electrode, semiconductor, device light emitting, device photocoupler, resin, sealing, semiconductor device}

The rest may be deduced by analogy.

It is difficult to me to imply such a idea by R language.

Could anyone do me a favor?

Best

You can do this with a combination based on grepl. Just get all non-repeated terms and take the first ten, it's easy as that. This little function also controls for matching within words: In this case "light" doesn't match with "lightemitting". Hence the paste function at the start (adds a space to every term).

Remove <- function(x){
    tmp <- paste(x,"")
    id <- colSums(sapply(tmp,grepl,tmp))==1
    x[id]
}

Txt <- "light, device, emitting, light emitting, optical, lightemitting, diode, 
        electrode, photocoupler, semiconductor, device light emitting, 
        device photocoupler, resin, sealing, device light, semiconductor device,
        lightemitting device, device electrode, compact lightemitting"

Txt_split <- unlist(strsplit(Txt[1], ", "))

> Remove(Txt_split)
 [1] "optical"               "diode"                 "device light emitting"
 "device photocoupler"  
 [5] "resin"                 "sealing"               "semiconductor device" 
 "lightemitting device" 
 [9] "device electrode"      "compact lightemitting"

EDIT : this one doesn't follow your outlined algorithm, as that would take ages on very large datasets, and as it grows a vector (which should be avoided in R due to risk for memory issues).

The basic idea: loop over values in the list, checking to see if the current value is not a sibset of previous matches. If so, add to list of matches.

text <- "light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing, device light, semiconductor device, lightemitting device, device electrode, compact lightemitting"

vars <- str_split(text, ", ")[[1]]

matches <- "__something_not_in_your_list_"
for(i in seq_along(vars))
{
  if(!any(str_detect(vars[i], matches))) matches <- c(matches, vars[i])
}
matches[-1]

Having an initial value in the list of matches is a little hack because str_detect doesn't like it when the second variable has length zero.

A further thought: If you don't care about phraseology then the simplest thing to do would just be to pick out all the unique words in your list.

vars <- str_split(text, ", ")[[1]]
all_words <- unlist(str_split(vars, " "))
unique(all_words)

继续阅读：algorithm r

How to remove repeated terms

更多精彩内容

精彩评论

最新问答

37岁女人该怎样保养卵巢早衰？

电视果是什么东西?？

输卵管不通畅哪里医院治？

爱奇艺冰激凌套餐赠送的电视果如何使用?？

华为智慧屏SE65挂壁孔在哪？

问答排行榜

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？