This is a quick post on one way you might distort a probability distribution - and in
particular the set of probabilities associated with a Multinomial distribution.
There are lots of reasons why you might want to do this. For example, say you
run a website and are making item recommendations where the feature recommends
an item to a customer at that item's probability.
So if you have 3 items, you'd have a probability vector of size 3. Now,
hopefully there's a model that can tell use based on past interactions what is
the probability a certain item is selected. For example, here our model says
item A has a 50% chance of being purchased (or clicked or whatever).
But you're also worried that you're potentially overfitting and recommending
item A too often, so you need to diversify the recommendations - or distort the
One way to do this, is to use the softmax and borrow the temperature parameter
from reinforcement learning. The one slight difference is the incoming
probabilities are log-transformed so that when they're re-run through
exponentiation the result is identity (if t = 1).
Here's what the function looks like...
def distort(ps, t=1.):
ps = np.log(ps) / t
return np.exp(ps) / np.exp(ps).sum()
ps is the vector of probabilities
t is the temperature
Hopefully the math is straightforward, but it is helpful to look at the
The key thing to notice is that when
T > 1 then after the log transformation and
division the numbers have been pulled closer to 0, so after the exponentiation
in the softmax the resulting probabilities are closer together.
T < 1, the large probabilities get even larger relative to the small ones,
and thus after the softmax, they're farther apart.
The next step is to get a sense for how the probability changes for various
values of T. We'll work with two dimensions because it's easy to visualize, and
since a probability distribution sums to unity, if we know one of the
probabilities we can find the other.
For example, imagine the probability distribution is
Now, looking at the probability vs temperature plot there's a couple things
- The vertical line is where T = 1. When T = 1, the transformation is the
- As the temperature increase the probability tends to 0.5, or in N
dimensions, 1/N. This is how the parameter gets its name temperature
-- as the temperature increases the the probability of different
states tends to converge.
That's pretty neat if I do say so myself. However, plotting it is only really
useful in low dimensions, and particularly easy in two.
Also, it doesn't give a measurement as to the change. To which the next question
is how to measure it. Because we have created two distributions, the basic one
from the model and our distorted version for some temperature, taking the KL
divergence between the two is a natural measurement. Especially because we're
distorting our vector p into some new vector q. We really don't care how far p
is from q, we care how far q is from p.
To investigate the KL divergence let's create a new vector in 10D space.
ps = array([1.40763181e-04, 7.16806295e-09,
Then we can calculate the KL Divergence between
ps, the base distribution and
qs the new distribution, and plotting this divergence for different
values of Temperature.
The obvious thing to point out is that the KL divergence is minimized when the
pdfs are the same, or in this case, T = 1.
In summation, we looked at how to distort a vector of probabilities, why we
might want to do that, and how to quantify the amount of change due to that
distortion via KL divergence.