Jump to content

Mongo

Members
  • Posts

    1
  • Joined

  • Last visited

Profile Information

  • Application Season
    2013 Spring
  • Program
    Phd Information system

Mongo's Achievements

Decaf

Decaf (2/10)

0

Reputation

  1. I'm working on transforming one set of data to another based on a certain variable (length). Here's how the actual problem is like: list1=['red', 'yellow','blue'] doc1=['yellow', 'green', red'] list2=['red', 'yellow','green', 'black','purple', 'brown'] doc2=['yellow','red','blue','grey','pink','pale','colours','indigo'] Jaccard similarity between list1 and doc1 gives a score of: 0.667 Jaccard similarity between list2 and doc2 gives a score of: 0.182 The first comparison has two overlaps (red and yellow), and has a higher score than the second comparison which has the same amount of overlaps. Hence the larger the size of the compared items, the smaller the similarity score and vice versa. My goal now is to determine a transformation/normalisation factor that will cancel out the effect of size difference and measure similarity based on actual overlap. Here's my attempt: I multiplied the similarity scores by the log of the average length of the compared items. first comparison average item length =3, final score == log(3) * 0.667=0.73277 second comparison average item length =7, final score == log(7) * 0.187=0.35416 Multiplying by item's length favours longer items, thus reducing the difference in scores that results from different sizes (length). However, my method didn't reduce the score margin enough, hence I'm looking for a method that will cancel out the effect of item sizes and focus on similarity based on overlaps. Any ideas?
×
×
  • Create New...

Important Information

This website uses cookies to ensure you get the best experience on our website. See our Privacy Policy and Terms of Use