How to determine a data transformation/normalisation factor

Mongo · November 19, 2014

I'm working on transforming one set of data to another based on a certain variable (length). Here's how the actual problem is like:

list1=['red', 'yellow','blue']

doc1=['yellow', 'green', red']

list2=['red', 'yellow','green', 'black','purple', 'brown']

doc2=['yellow','red','blue','grey','pink','pale','colours','indigo']

Jaccard similarity between list1 and doc1 gives a score of: 0.667 Jaccard similarity between list2 and doc2 gives a score of: 0.182

The first comparison has two overlaps (red and yellow), and has a higher score than the second comparison which has the same amount of overlaps. Hence the larger the size of the compared items, the smaller the similarity score and vice versa.

My goal now is to determine a transformation/normalisation factor that will cancel out the effect of size difference and measure similarity based on actual overlap.

Here's my attempt: I multiplied the similarity scores by the log of the average length of the compared items.

first comparison average item length =3, final score == log(3) * 0.667=0.73277
second comparison average item length =7, final score == log(7) * 0.187=0.35416

Multiplying by item's length favours longer items, thus reducing the difference in scores that results from different sizes (length).

However, my method didn't reduce the score margin enough, hence I'm looking for a method that will cancel out the effect of item sizes and focus on similarity based on overlaps.

Any ideas?

Sign In

How to determine a data transformation/normalisation factor

Recommended Posts

Mongo

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Search

Results

Important Information