Mongo Posted November 19, 2014 Posted November 19, 2014 I'm working on transforming one set of data to another based on a certain variable (length). Here's how the actual problem is like: list1=['red', 'yellow','blue'] doc1=['yellow', 'green', red'] list2=['red', 'yellow','green', 'black','purple', 'brown'] doc2=['yellow','red','blue','grey','pink','pale','colours','indigo'] Jaccard similarity between list1 and doc1 gives a score of: 0.667 Jaccard similarity between list2 and doc2 gives a score of: 0.182 The first comparison has two overlaps (red and yellow), and has a higher score than the second comparison which has the same amount of overlaps. Hence the larger the size of the compared items, the smaller the similarity score and vice versa. My goal now is to determine a transformation/normalisation factor that will cancel out the effect of size difference and measure similarity based on actual overlap. Here's my attempt: I multiplied the similarity scores by the log of the average length of the compared items. first comparison average item length =3, final score == log(3) * 0.667=0.73277 second comparison average item length =7, final score == log(7) * 0.187=0.35416 Multiplying by item's length favours longer items, thus reducing the difference in scores that results from different sizes (length). However, my method didn't reduce the score margin enough, hence I'm looking for a method that will cancel out the effect of item sizes and focus on similarity based on overlaps. Any ideas?
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now