Distorting Probability Distributions2016.09.04Data Science

This is a quick post on one way you might distort a probability distribution - and in particular the set of probabilities associated with a Multinomial distribution.

There are lots of reasons why you might want to do this. For example, say you run a website and are making item recommendations where the feature recommends an item to a customer at that item's probability.

So if you have 3 items, you'd have a probability vector of size 3. Now, hopefully there's a model that can tell use based on past interactions what is the probability a certain item is selected. For example, here our model says item A has a 50% chance of being purchased (or clicked or whatever).

Item Prob Plot

But you're also worried that you're potentially overfitting and recommending item A too often, so you need to diversify the recommendations - or distort the distribtion.

One way to do this, is to use the softmax and borrow the temperature parameter from reinforcement learning. The one slight difference is the incoming probabilities are log-transformed so that when they're re-run through exponentiation the result is identity (if t = 1).

Here's what the function looks like...

def distort(ps, t=1.):
  ps = np.log(ps) / t
  return np.exp(ps) / np.exp(ps).sum()
  • ps is the vector of probabilities
  • t is the temperature

Hopefully the math is straightforward, but it is helpful to look at the various stages.


The key thing to notice is that when T > 1 then after the log transformation and division the numbers have been pulled closer to 0, so after the exponentiation in the softmax the resulting probabilities are closer together.

When T < 1, the large probabilities get even larger relative to the small ones, and thus after the softmax, they're farther apart.

The next step is to get a sense for how the probability changes for various values of T. We'll work with two dimensions because it's easy to visualize, and since a probability distribution sums to unity, if we know one of the probabilities we can find the other.

For example, imagine the probability distribution is [0.15, 0.85].

Now, looking at the probability vs temperature plot there's a couple things notice.

  1. The vertical line is where T = 1. When T = 1, the transformation is the identity.
  2. As the temperature increase the probability tends to 0.5, or in N dimensions, 1/N. This is how the parameter gets its name temperature -- as the temperature increases the the probability of different states tends to converge.


That's pretty neat if I do say so myself. However, plotting it is only really useful in low dimensions, and particularly easy in two.

Also, it doesn't give a measurement as to the change. To which the next question is how to measure it. Because we have created two distributions, the basic one from the model and our distorted version for some temperature, taking the KL divergence between the two is a natural measurement. Especially because we're distorting our vector p into some new vector q. We really don't care how far p is from q, we care how far q is from p.

To investigate the KL divergence let's create a new vector in 10D space.

ps = array([1.40763181e-04,   7.16806295e-09,
            3.57692841e-01,   1.56331228e-06,
            8.85156502e-04,   3.13365979e-01,
            2.49412860e-36,   3.27913690e-01,
            2.00958248e-57,   5.11916402e-11]) 

New Probs

Then we can calculate the KL Divergence between ps, the base distribution and qs the new distribution, and plotting this divergence for different values of Temperature.

New Probs Entropy

The obvious thing to point out is that the KL divergence is minimized when the pdfs are the same, or in this case, T = 1.

In summation, we looked at how to distort a vector of probabilities, why we might want to do that, and how to quantify the amount of change due to that distortion via KL divergence.