Jump to content

Recommended Posts

Posted

I'm working on transforming one set of data to another based on a certain variable (length). Here's how the actual problem is like:

list1=['red', 'yellow','blue']

doc1=['yellow', 'green', red']


list2=['red', 'yellow','green', 'black','purple', 'brown']

doc2=['yellow','red','blue','grey','pink','pale','colours','indigo']

Jaccard similarity between list1 and doc1 gives a score of: 0.667 Jaccard similarity between list2 and doc2 gives a score of: 0.182

The first comparison has two overlaps (red and yellow), and has a higher score than the second comparison which has the same amount of overlaps. Hence the larger the size of the compared items, the smaller the similarity score and vice versa.

My goal now is to determine a transformation/normalisation factor that will cancel out the effect of size difference and measure similarity based on actual overlap.

Here's my attempt: I multiplied the similarity scores by the log of the average length of the compared items.

first comparison average item length =3, final score == log(3) * 0.667=0.73277
second comparison average item length =7, final score == log(7) * 0.187=0.35416

Multiplying by item's length favours longer items, thus reducing the difference in scores that results from different sizes (length).

However, my method didn't reduce the score margin enough, hence I'm looking for a method that will cancel out the effect of item sizes and focus on similarity based on overlaps.

Any ideas?

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

This website uses cookies to ensure you get the best experience on our website. See our Privacy Policy and Terms of Use