文本相似度

  1. 词汇语义相似度
    1. 基于word_net语义相似度计算
  2. 字面相似度
    1. FuzzyWuzzy
      1. 安装
      2. 使用

词汇语义相似度

基于word_net语义相似度计算

参考:WordNet介绍及相似度计算
获取单词的所有含义:

print(wn.synsets("dog"))

>>>[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'),
 Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'),
 Synset('andiron.n.01'), Synset('chase.v.01')]

计算语义相似度:

dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
# 当dog词性为'dog.n.01'与'cat.n.01'的语义相似度
similar = dog.path_similarity(cat)
print(similar)

>>>0.2

字面相似度

FuzzyWuzzy

FuzzyWuzzy是字符串模糊匹配工具

安装

pip install fuzzywuzzy

使用

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# 简单匹配
fuzz.ratio("this is a test", "this is a test!")
>>> 97

# 非完全匹配(Partial Ratio)
fuzz.partial_ratio("this is a test", "this is a test!")
>>> 100

# 忽略顺序匹配(Token Sort Ratio)
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
>>> 91
fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
>>> 100

# 去重子集匹配(Token Set Ratio)
fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
>>> 84
fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
>>> 100

# 从候选字符串中选出最相似的字符串
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
process.extract("new york jets", choices, limit=2)
>>> [('New York Jets', 100), ('New York Giants', 78)]
process.extractOne("cowboys", choices)
>>>  ("Dallas Cowboys", 90)

转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 changzeyan@foxmail.com

×

喜欢就点赞,疼爱就打赏