开发者

How to remove duplicate words using java

I have text file. In that i want t开发者_开发技巧o remove duplicate words.My text file contains words like

    அந்தப்
    சத்தம்
    அந்த
    இந்தத்
    பாப்பா
    இந்த
    கனவுத்
    அந்த
    கனவு

I remove duplicate words. But the words which has ending 'ப்' , 'த்' are consider as seperate words and not able to remove as duplicate word. If i remove 'ப்' , 'த்' it remove from some other words like பாப்பா, சத்தம். Please suggest any ideas to solve this problem using java.Thanks in advance.


I think I would use a Set with a custom comperator (such as a TreeSet). That way you can define equals any way you like.


I don't understand the given language (google translate's guess is Tamil), but from your question I read, that there are special rules for 'equality' for words written in that language - like words can be equal even if they're written differently (e.g. with different endings).

So you may want to wrap the strings containing words of that language in special object where you can define a custom 'equals' method, like this:

public class TamilWord {

  String writtenWord = null;

  public TamilWord(String writtenWord) {
    this.writtenWord = writtenWord;
  }

  public String getWrittenWord() {
    return writtenWord;
  }

  @Overwrite
  public boolean equals(Object other) {

    // Define your custom rules here, so that two words that
    // are written differently may be considered as equal        

  }
}

Then you can create TamilWord objects for all parsed Strings and drop them into a Set. So if we have the word abcd and abcD which are different in writing but according to rules considered equal, only one of those will be added to the set.


Use a scanner to scan in each line as a string into a set then write the strings in the set to a file.


First you should explain us how you parse your file, as it seems that your tokenization is not working appropriately. Then, to my mind, the obvious suggestion to a query for unduplication is to use a Set (and even a TreeSet) which should ensure uniqueness of your elements according to given Set contains rules.


My way to solve this:

Read word by word and put it to java.util.Set<TheWord>. Finally, you will have the Set with no duplicates. You also should define TheWord class:

class TheWord {
  String word;

  public TheWord() {}

  public String getWord() {
    return word;
  }

  public void setWord(String word) {
    this.word = word;
  }

  public boolean equals(TheWord o) {
    // put here your specific way to compare words 
    // taking into account your language rules and considerations
  }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜