python实现Simhash算法

2022-12-10 13:30 开发作者： AlanDreamer

1、simhash步骤

simhash包含分词、hash、加权、合并、降维五大步骤

simhash代码如下：

import jieba
import jieba.analyse
import numpy as np

class SimHash(object):
  def simHash(self, content):
    seg = jieba.cut(content)
    # jieba.analyse.set_stop_words('stopword.txt')
    # jieba基于TF-IDF提取关键词
    keyWords = jiehttp://www.cppcns.comba.analyse.extract_tags("|".join(seg), topK=10, withWeight=True)

    keyList = []
    for feature, weight in keyWords:
      # print('feature:' + feature)
      print('weight: {}'.format(weight))
      # weight = math.ceil(w编程客栈eight)
      weight = int(weight)
      binstr = self.string_hash(feature)
      print('feature: %s , string_hash %s' % (feature, binstr))
      temp = []
      for c in binstr:
        if (c == '1'):
          temp.append(weight)
        else:
          temp.append(-weight)
      keyList.append(temp)
    listSum = np.sum(np.array(keyList), axis=0)
    if (keyList == []):
      return '00'
    simhash = ''
    for i in listSum:
      if (i > 0):
        simhash = simhash + '1'
      else:
        simhash = simhash + '0'
    return simhash

  def string_hash(self, source):
    if source == "":
      return 0
    else:
      temp = source[yqrSJFck0]
      temp1 = ord(temp)
      x = ord(source[0]) << 7
      m = 1000003
      mask = 2 ** 128 - 1
      for c in source:
        x = ((x * m) ^ ord(c)) & mask
      x ^= len(source)
      if x == -1:
        x = -2
      x = bin(x).replace('0b', '').zfill(64)[-64:]

      return str(x)

  def getDistance(self, hashstr1, hashstr2):
    '''
      计算两个simhash的汉明距离
    '''
    length = 0
    for index, char in enumerate(hashstr1):
      if char == hashstr2[index]:
        continue
      else:
        length += 1

    return length

1.1分词

分词是将文本文档进行分割成不同的词组，比如词1为：今天星期四，词2为：今天星期五

得出分词结果为【今天，星期四】【今天，星期五】

编程客栈

1.2hash

hash是将分词结果取hash值

星期四hash为：0010001100100000101001101010000000101111011010010001100011011110

今天hawww.cppcns.comsh为：0010001111010100010011110001110010100011110111111011001011110101

星期五hash为：0010001100100000101001101010000000101111011010010000000010010001

1.3加权

python实现Simhash算法

1.4合并

python实现Simhash算法

1.5降维

降维是将合并的结果进行降维，如果值大于0，则置为1小于0 则置为0，因此得到的结果为：

python实现Simhash算法

2、simhash比对

一般simhash采用海明距离来进行计算相似度，海明距离计算如下：

对于A，B两个n维二进制数

python实现Simhash算法

二者的海明距离为：

python实现Simhash算法

其中：

python实现Simhash算法

举例：

1000与1111的海明距离为3

到此这篇关于python实现Simhash算法的文章就介绍到这了,更多相关python实现Simhash算法内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们！

继续阅读：python实现Simhash算法 Simhash算法

python实现Simhash算法

1、simhash步骤

1.1分词

1.2hash

1.3加权

1.4合并

1.5降维

2、simhash比对

更多精彩内容

精彩评论

最新开发

Go语言使用select监听多个channel的示例详解

C#中的高性能内存操作的利器:Span<T>和Memory<T>

SpringBoot中加载与Bean处理的细节剖析教程

Win10安装Maven与环境变量配置过程

Java使用Thumbnailator库实现图片处理与压缩功能

开发排行榜

springboot后端存储富文本内容的思路与步骤(含图片内容)

PyCharm运行python测试,报错“没有发现测试”/“空套件”的解决

return base64.b64encode(b).decode(

基于C语言实现钻石棋游戏的示例代码

Sublime Text 3解决中文乱码问题（实测可用）