开发者

python实现Simhash算法

1、simhash步骤

simhash包含分词、hash、加权、合并、降维五大步骤

simhash代码如下:

import jieba
import jieba.analyse
import numpy as np

class SimHash(object):
  def simHash(self, content):
    seg = jieba.cut(content)
    # jieba.analyse.set_stop_words('stopword.txt')
    # jieba基于TF-IDF提取关键词
    keyWords = jiehttp://www.cppcns.comba.analyse.extract_tags("|".join(seg), topK=10, withWeight=True)

    keyList = []
    for feature, weight in keyWords:
      # print('feature:' + feature)
      print('weight: {}'.format(weight))
      # weight = math.ceil(w编程客栈eight)
      weight = int(weight)
      binstr = self.string_hash(feature)
      print('feature: %s , string_hash %s' % (feature, binstr))
      temp = []
      for c in binstr:
        if (c == '1'):
          temp.append(weight)
        else:
          temp.append(-weight)
      keyList.append(temp)
    listSum = np.sum(np.array(keyList), axis=0)
    if (keyList == []):
      return '00'
    simhash = ''
    for i in listSum:
      if (i > 0):
        simhash = simhash + '1'
      else:
        simhash = simhash + '0'
    return simhash

  def string_hash(self, source):
    if source == "":
      return 0
    else:
      temp = source[yqrSJFck0]
      temp1 = ord(temp)
      x = ord(source[0]) << 7
      m = 1000003
      mask = 2 ** 128 - 1
      for c in source:
        x = ((x * m) ^ ord(c)) & mask
      x ^= len(source)
      if x == -1:
        x = -2
      x = bin(x).replace('0b', '').zfill(64)[-64:]

      return str(x)

  def getDistance(self, hashstr1, hashstr2):
    '''
      计算两个simhash的汉明距离
    '''
    length = 0
    for index, char in enumerate(hashstr1):
      if char == hashstr2[index]:
        continue
      else:
        length += 1

    return length

1.1分词

分词是将文本文档进行分割成不同的词组,比如词1为:今天星期四,词2为:今天星期五

得出分词结果为【今天,星期四】【今天,星期五】

编程客栈

1.2hash

hash是将分词结果取hash值

星期四hash为:0010001100100000101001101010000000101111011010010001100011011110

今天hawww.cppcns.comsh为:0010001111010100010011110001110010100011110111111011001011110101

星期五hash为:0010001100100000101001101010000000101111011010010000000010010001

1.3加权

python实现Simhash算法

1.4合并

python实现Simhash算法

1.5降维

降维是将合并的结果进行降维,如果值大于0,则置为1小于0 则置为0,因此得到的结果为:

python实现Simhash算法

2、simhash比对

一般simhash采用海明距离来进行计算相似度,海明距离计算如下:

对于A,B两个n维二进制数

python实现Simhash算法

二者的海明距离为:

python实现Simhash算法

其中:

python实现Simhash算法

举例:

1000与1111的海明距离为3

到此这篇关于python实现Simhash算法的文章就介绍到这了,更多相关python实现Simhash算法内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新开发

开发排行榜