Histograms are useful, but their storage requirements can be large: if each bin is N bytes, then the storage is (number of bins × N) bytes.
For many applications (imaging, profiling), the exact count is not important.
Here's a way of reducing the storage required for each bin. Instead of holding a simple count, the bin contains an approximation to the logarithm to base b of the count. This logarithm is a smaller number than the count, so can use less storage.
In this scheme, when incrementing a bin with value n, which represents bn, the probability that it should increment is given by:
For b = 2 then the probability is:
A simple demonstration using a bases of 1.1, 2.0 and 1.0:
import random
class LinearHistogram:
def __init__(self):
self.bins = [ 0 for i in range(10) ]
def increment(self, i):
self.bins[i] += 1
def counts(self):
return self.bins
class StochasticHistogram:
def __init__(self, base):
self.bins = [ 0 for i in range(10) ]
self.base = base
def increment(self, i):
prob = 1.0 / ((self.base ** self.bins[i]) * (self.base - 1))
if random.random() < prob:
self.bins[i] += 1
def counts(self):
return [int(self.base ** n) for n in self.bins]
data = [random.randrange(10) for i in range(1000000)]
lh = LinearHistogram()
for d in data:
lh.increment(d)
print 'linear ', lh.counts()
for base in [1.1, 2, 10]:
sh = StochasticHistogram(base)
for d in data:
sh.increment(d)
print 'stochastic %g:' % base, sh.counts()
Gives:
linear [100398, 99812, 99592, 100007, 100130, 100025, 99895, 100272, 99970, 99899] stochastic 1.1: [101979, 76619, 84280, 84280, 84280, 101979, 135735, 84280, 92709, 123395] stochastic 2: [131072, 65536, 65536, 262144, 131072, 65536, 131072, 131072, 131072, 131072] stochastic 10: [100000, 100000, 100000, 100000, 10000, 100000, 100000, 10000, 100000, 100000]