Fitting Poisson data
====================

Data from poisson processes, such as the number of counts per unit time
or counts per unit area, do not have the same pattern of uncertainties
as data from gaussian processes.  Poisson data consists of natural
numbers occurring at some underlying rate.  The fitting process checks
if the number of counts observed is consistent with the proposed rate
for each point in the dataset, much like the fitting process for gaussian
data checks if the observed value is consistent with the proposed value
within the measurement uncertainty.

Using :class:`bumps.curve.PoissonCurve` instead of :class:`bumps.curve.Curve`,
we can fit a set of *counts* at conditions *x* using a function
*f(x, p1, p2, ...)* to propose rates for the various *x* values given the
parameters, yielding parameter values *p1, p2, ...* that are most consistent
with the *counts* at *x*. When measuring poisson processes, the underlying
rate is not known, so the measurement variance, which is a property of the
rate, is not associated with the data but instead associated with the
theory function which predicts the rates.  This is opposite from what we
have with gaussian data, in which the uncertainty is associated with the
measurement device, and explains why the call to PoissonCurve only accepts
*x* and *counts*, not *x*, *y*, and *dy*.

One property of the Poisson distribution is that it is well approximated
by a gaussian distribution for values above about 10.   It will never be
perfect match since numbers from a poisson distribution can never be
negative, whereas gaussian numbers can always be negative, albeit with
vanishingly small probability some of the time.  Below 10, there are
various ways you can approximate the poisson distribution with a gaussian.
This example explores some of the options.

In particular, the handling of zero counts can be problematic when treating
the measurement as gaussian.  You cannot simply drop the points with zero
counts. Once you've done various reduction steps, the resulting non-zero
value for the uncertainty will carry meaning.  The longer you count,
the smaller the uncertainty should be, once you've normalized for counting
time or monitor.  Being off by a factor of 2 on the residuals is much
better than being off by a factor of infinity using uncertainty = zero,
and better than dropping the point altogether.

There are a few things you can do with zero counts without being
completely arbitrary:

  1) $\lambda = (k+1) \pm \sqrt{k+1}$ for all $k$
  2) $\lambda = (k+1/2) \pm \sqrt{k+1/4}$ for all k
  3) $\lambda = k \pm \sqrt{k+1}$ for all k
  4) $\lambda = k \pm \sqrt{k}$ for $k>0$, $1/2 \pm 1/2$ for $k = 0$
  5) $\lambda = k \pm \sqrt{k}$ for $k>0$, $0 \pm 1$ for $k = 0$

See the notes from the CDF Statistics Committee for details at
`<https://www-cdf.fnal.gov/physics/statistics/notes/pois_eb.txt>`_.

Of these, option 5 works slightly better for fitting, giving the best
estimate of the background.

The ideal case is to have your model produce an expected number of counts
on the detector.  It is then trivial to compute the probability of
seeing the observed counts from the expected counts and fit the parameters
using PoissonCurve.  Unfortunately, this means incorporating all
instrumental effects when modelling the measurement rather than correcting
for instrumental effects in a data reduction program, and using a common
sample model independent of instrument.

Setting $\lambda = k$ is good since that is the maximum likelihood value
for $\lambda$ given observed $k$, but this breaks down at $k=0$, giving zero
uncertainty regardless of how long we measured.

Since the Poisson distribution is slightly skew, a good estimate is
$\lambda = k+1$ (option 1 above).  This follows from the formula for the
expected value of a distribution:

.. math::

   E[x] = \int_{-\infty}^\infty x P(x) dx

For the poisson distribution, this is:

.. math::

   E[\lambda] = \int_0^\infty \lambda \frac{\lambda^k e^{-\lambda}}{k!} d\lambda

Running some simulations, we can see that $\hat\lambda=(k+1)\pm\sqrt{k+1}$
(see `sim.py <sim.html>`_). This is the best fit RMS value to the distribution
of possible $\lambda$ values that could give rise to the observed $k$.

The current practice is to use $\hat\lambda=k\pm\sqrt{k}$. Convincing the
world to accept $\lambda = k+1$ would be challenging since the expected
value is not the most likely value.  As a compromise, one can use $0 \pm 1$
for zero counts, and $k \pm \sqrt{k}$ for other values. This provides a
reasonable estimate for the uncertainty on zero counts, which after
normalization becomes smaller for longer counting times or higher incident
flux.

Another option is to choose the center and bounds so that the
uncertainty covers $1-\sigma$ from the distribution (68%).  A simple
approximation which does this is $(n+1/2) \pm \sqrt{n+1/4}$.
Again, hard to convince the world to do, so one could compromise and
choose $1/2 \pm 1/2$ for $k=0$ and $k \pm \sqrt{k}$ otherwise.

What follows is a model which allows us to fit a simulated peak using
these various definitions of $\lambda$ and see which version best recovers
the true parameters which generated the peak.

::

    from bumps.names import *
    
Define the peak shape.  We are using a simple gaussian with center, width,
scale and background.

::

    def peak(x, scale, center, width, background):
        return scale*np.exp(-0.5*(x-center)**2/width**2) + background
    
Generate simulated peak data with poisson noise.  When running the fit,
you can choose various values for the peak intensity.  We are using a
large number of points so that the peak is highly constrained by the
data, and the returned parameters are consistent from run to run.  Real
data is likely not so heavily sampled.

::

    x = np.linspace(5, 20, 345)
    #y = np.random.poisson(peak(x, 1000, 12, 1.0, 1))
    #y = np.random.poisson(peak(x, 300, 12, 1.5, 1))
    y = np.random.poisson(peak(x, 3, 12, 1.5, 1))
    
Define the various conditions.  These can be selected on the command
line by listing the condition name after the model file.  Note that
bumps will make any option not preceded by "-" available to the model
file as elements of *sys.argv*.  *sys.argv[0]* is the model file itself.

The options correspond to the five options listed above, with an additional
option "poisson" which is used to select PoissonCurve rather than Curve
in the fit.

::

    cond = sys.argv[1] if len(sys.argv) > 1 else "pearson"
    if cond == "poisson": # option 0: use PoissonCurve rather than Curve to fit
        pass
    elif cond == "expected": # option 1: L = (y+1) +/- sqrt(y+1)
        y += 1
        dy = np.sqrt(y)
    elif cond == "pearson": # option 2: L = (y + 0.5)  +/- sqrt(y + 1/4)
        dy = np.sqrt(y+0.25)
        y = y + 0.5
    elif cond == "expected_mle": # option 3: L = y +/- sqrt(y+1)
        dy = np.sqrt(y+1)
    elif cond == "pearson_zero": # option 4: L = y +/- sqrt(y); L[0] = 0.5 +/- 0.5
        dy = np.sqrt(y)
        y = np.asarray(y, 'd')
        y[y == 0] = 0.5
        dy[y == 0] = 0.5
    elif cond=="expected_zero": # option 5: L = y +/- sqrt(y);  L[0] = 0 +/- 1
        dy = np.sqrt(y)
        dy[y == 0] = 1.0
    else:
        raise RuntimeError("Need to select uncertainty: pearson, pearson_zero, expected, expected_zero, expected_mle, poisson")
    
Build the fitter, and set the range on the fit parameters.

::

    if cond == "poisson":
        M = PoissonCurve(peak, x, y, scale=1, center=2, width=2, background=0)
    else:
        M = Curve(peak, x, y, dy, scale=1, center=2, width=2, background=0)
    dx = max(x)-min(x)
    M.scale.range(0, max(y)*1.5)
    M.center.range(min(x)-0.2*dx, max(x)+0.2*dx)
    M.width.range(0, 0.7*dx)
    M.background.range(0, max(y))
    
Set the fit problem as usual.

::

    problem = FitProblem(M)
    
We can now load and run the fit.  Be sure to substitute COND for one of the
conditions defined above:

.. parsed-literal::

   $ bumps.py poisson.py --fit=dream --burn=600 --store=/tmp/T1 COND

Comparing the results for the various conditions, we can see that all methods
yield a good fit to the underlying center, scale and width.  It is only the
background that causes problems.  Using poisson statistics for the fit gives
the proper background estimate, and using the traditional method of
$\lambda = k \pm \sqrt{k}$ for $k>0$, and $0 \pm 1$ for $k=1$ gives the
best gaussian approximation.

.. table:: Fit results

    = ================= ==========
    # method            background
    = ================= ==========
    0 poisson           1.0
    1 expected          1.55
    2 pearson           0.16
    3 expected_mle      0.55
    4 pearson_zero      0.34
    5 expected_zero     0.75
    = ================= ==========


.. only:: html

   Download: :download:`poisson.py <poisson.py>`.
