添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

This notebook aims to provide basic examples of how to run a variety of MCMC and nested sampling codes in Python . It won't go into detail about MCMC methods in general, and assumes a bit of knowledge about them, nor will it discuss all the various bells-and-whistles that each sampler can use, but it will hopefully help people get started with writing a sampler code using an understandable (albeit "toy") example. A very nice, quite similar, article on this topic by Jake VanderPlas can be found here . There are also very good online notes from the Computational Statistics and Statistical Computing course at Duke University by Cliburn Chan & Janice McCarthy , covering some of this material (including PyMC3 and PyStan examples that are very similar to what is covered here). Section 2 of Brewer & Foreman-Mackey (2016) gives a nice summary of the advantages and disadvantages of some of the algorithms shown here. A good general description of fitting a model to data in a Bayesian formalism, including various useful examples, can be found in Hogg, Bovy & Lang (2010) .

This code within this notebook is designed to work for Python 3. An older version that should be compatible with Python 2 can be found here .

You can jump straight to the examples for the different samplers with the links below:

MCMC samplers :

  • emcee
  • PyMC3
  • TensorFlow Probability
  • PyStan
  • PyJAGS
  • Nested samplers :

  • Nestle
  • CPNest
  • dynesty
  • UltraNest
  • PyMultiNest
  • DNest4
  • PyPolyChord
  • An example using the bilby package, which provides a unified interface to a variety of different samplers is also provided.

    Background

    MCMC methods are most often used in the context of Bayesian parameter estimation, e.g., you have a model, $y$, defined by a set of parameters, ${\vec{\theta}}$, with certain prior probability distributions $p(\vec{\theta}|I)$, and you want to efficiently sample from the marginalised posterior probability distributions on those parameters given some data, $\mathbf{d}$, and a particular likelihood function , $p(\mathbf{d}|\vec{\theta},I)$, ($I$ is just a substitute for all the implicit assumptions that have gone into defining our model). From Bayes theorem we have the joint posterior on the whole set of parameters,

    p(\vec{\theta}|\mathbf{d},I) = \frac{p(\mathbf{d}|\vec{\theta},I) p(\vec{\theta}|I)}{p(\mathbf{d}|I)}, $$

    with marginalised posteriors on an individual parameter, say $\theta_1$, given by nested integrals over all other parameters,

    p(\theta_1|\mathbf{d},I) = \frac{1}{p(\mathbf{d}|I)}\int^{\forall \theta_i \in (\vec{\theta}\neq \theta_1)} p(\mathbf{d}|\vec{\theta},I) p(\vec{\theta}|I) {\rm d}{\theta_i}, $$

    ($\forall \theta_i \in (\vec{\theta}\neq \theta_1)$ can be translated into "for all parameters $\theta_i$ in $\vec{\theta}$ that are not equal to $\theta_1$). $p(\mathbf{d}|I)$ is the marginal likelihood , or evidence , for the data given the particular model used, and is given by

    p(\mathbf{d}|I) = \int^{\forall \theta_i \in \vec{\theta}} p(\mathbf{d}|\vec{\theta},I) p(\vec{\theta}|I) {\rm d}{\theta_i}, $$

    which is a normalising constant for the posterior. It will be discussed more below in relation to nested sampling .

    This is the only context I (and probably many people in the physical sciences) have encountered using MCMC methods, and is the context I will use below.

    Toy model

    The very basic example that I will use to demonstrate coding up the samplers is that of sampling from the posterior of two parameters, $m$ and $c$, defining the model:

    y(\mathbf{x};m,c) = m\mathbf{x} + c, $$

    where $\mathbf{x}$ is a vector of values of $x$. This is basically fitting the gradient and $y$-intercept of a straight line, which should be fairly familiar as linear regression , and is obviously a solved problem ( assuming uniform priors! ) using, e.g., least squares fitting . But, this provides a simple example that can hopefully be extended to more realistics scenarios.

    Our aim is to produce samples drawn from the posterior $p(m, c|\mathbf{d},I)$, which can be used (by, e.g., histogramming the sample values) to approximate the marginal posteriors $p(m|\mathbf{d},I)$ and $p(c|\mathbf{d},I)$.

    The vector of data samples, $\mathbf{d}$, will consist of the model defined at a set of $M$ values, $\mathbf{x}$, with additive Gaussian noise of known standard deviation, i.e., data point $i$ will be defined by

    d_i = y(x_i;m,c) + n_i, $$

    with

    n_i \sim N(0, \sigma_i), $$

    which means the noise is drawn from a Gaussian (or Normal) distribution of zero mean and standard deviation of $\sigma_i$.

    Note : I'll generally use the term Gaussian distribution rather than "Normal" distribution, as this it what I first heard it called. But, in many packages discussed below "Normal" is the more common word for the distribution.

    Setting the likelihood

    A sensible and fairly standard likelihood function (due to the Central Limit Theorem and maximum entropy arguments ) for a single data point given values of $m$ and $c$ and a known noise standard deviation, will be given by a Gaussian distribution of the form

    p(d_i|m,c,I) = \left(2\pi\sigma_i^2\right)^{-1/2} \exp{\left(-\frac{\left[(d_i - y(x_i;m,c)\right]^2}{2\sigma_i^2} \right)}, $$

    and the joint likelihood for the whole dataset of $M$ points (assuming independent noise) will be the product of the individual likelihoods

    p(\mathbf{d}|m,c,I) = \prod_{i=1}^M p(d_i|m,c,I) = \left(2\pi\right)^{-M/2}\left(\prod_{i=1}^M \sigma_i^{-1} \right) \exp{\left(-\sum_{i=1}^M\frac{\left[(d_i - y(x_i;m,c)\right]^2}{2\sigma_i^2} \right)}. $$

    When computing likelihoods, numerical considerations mean that one almost always works with the natural logarithm of the likelihood, so we have

    \ln{p(\mathbf{d}|m,c,I)} \equiv \log{L} = -\frac{M}{2}\ln{(2\pi)} - \ln\left(\prod_{i=1}^M \sigma_i\right) - \sum_{i=1}^M\frac{\left[(d_i - y(x_i;m,c)\right]^2}{2\sigma_i^2}. $$

    For many cases we can ignore the terms that do not include our required parameters when sampling from their marginalised likelihoods (as they will just be constants), so often the log-likelihood can be set to be

    \log{L} = - \sum_{i=1}^M\frac{\left[(d_i - y(x_i;m,c)\right]^2}{2\sigma_i^2}. $$

    Note : If wanting to evaluate the marginal likelihood for the data (e.g., using nested sampling, which is discussed later) to use it for model comparison , the constants may be required.

    Setting the prior

    In addition to the likelihood, we need to define prior distributions on the parameters. To demonstrate a couple of different, and reasonably common, priors, I'll use a uniform probability distribution for the prior on $m$ and a Gaussian probability distribution for the prior on $c$:

    p(m|\mu_m,\sigma_m,I) = \frac{1}{\sqrt{2\pi\sigma_m^2}}\exp{\left(-\frac{(m-\mu_m)^2}{2\sigma_m^2}\right)}, $$

    and

    p(c|c_{\rm min}, c_{\rm max}) = \Bigg\{\begin{array}{cl} \frac{1}{c_{\rm max}-c_{\rm min}} & \text{if}~c_{\rm min} < c < c_{\rm max}, \\ 0 & \text{otherwise}. \end{array} $$

    In our example we'll set:

  • $\mu_m = 0$ and $\sigma_m = 10$,
  • $c_{\rm min} = -10$ and $c_{\rm max} = 10$.
  • Note : These priors are not intended to be the least informative for the parameterisation that we have, but are just chosen for demonstration purposes to show how to code up two different distributions. A better prior for this particular problem might be the one discussed here , which could be coded up for PyMC3 using the example here .

    Creating the data

    So, let's get started creating our data. I'll (arbitrarily) use:

  • $m = 3.5$,
  • $c = 1.2$,
  • for the model and

  • $\mathbf{x} \in [0, 10)$ in $M=50$ uniform steps ($[a,b)$ means inclusive of $a$, but exclusive of $b$ ),
  • $\sigma_i = \sigma = 2$ (i.e., each noise point is independent, but has the same standard deviation),
  • for the data.

    import numpy as np # import numpy from time import time # use for timing functions # useful modules! import os import sys # make the plots look a bit nicer with some defaults from matplotlib import pyplot as pl # import pyplot from matplotlib import matplotlib as mpl rcparams = {} rcparams [ 'axes.linewidth' ] = 0.5 rcparams [ 'font.family' ] = 'serif' rcparams [ 'font.size' ] = 22 rcparams [ 'legend.fontsize' ] = 16 rcparams [ 'mathtext.fontset' ] = "stix" # functions for plotting posteriors import corner from scipy.stats import gaussian_kde
    # set the true values of the model parameters for creating the data
    m = 3.5 # gradient of the line
    c = 1.2 # y-intercept of the line
    # set the "predictor variable"/abscissa
    M = 50
    xmin = 0.
    xmax = 10.
    stepsize = (xmax - xmin) / M
    x = np.arange(xmin, xmax, stepsize)
    # define the model function
    def straight_line(x, m, c):
        A straight line model: y = m*x + c
        Args:
            x (list): a set of abscissa points at which the model is defined
            m (float): the gradient of the line
            c (float): the y-intercept of the line
        return m * x + c
    # seed our random number generator, so we have reproducible data
    np.random.seed(sum([ord(v) for v in 'samplers']))
    # create the data - the model plus Gaussian noise
    sigma = 2.0  # standard deviation of the noise
    data
    
    
    
    
        
     = straight_line(x, m, c) + np.random.normal(scale=sigma, size=M)
    # plot the data
    mpl.rcParams.update(rcparams) # update plot parameters
    fig, ax = pl.subplots(figsize=(9,6))
    ax.plot(x, data, 'bo', alpha=0.5, label='data')
    ax.plot(x, straight_line(x, m, c), 'r-', lw=2, label='model')
    ax.legend()
    ax.set_xlim([xmin, xmax])
    ax.set_xlabel(r'$x$');
    

    MCMC samplers

    First up I'll deal with MCMC samplers that are purely written in Python, then a couple that are wrappers to other libraries.

    emcee

    emcee (Foreman-Mackey et al, 2013) is a Python MCMC implementation that uses an affine invariant ensemble sampler (Goodman & Weare, 2010). This basically means that it doesn't just evolve a single point in the model parameter space, but evolves and ensemble of points, and steps in the evolution are tuned based on the current state of the ensemble. The steps scale naturally (the affine invariant bit) as the ensemble closes in on the posterior. This code has been well used in the astrophysics community (and as an astrophysicist myself I'm showing it as my first example!).

    emcee is available on PyPI and is installable via pip with:

    pip install emcee
    

    or via Conda with:

    conda install -c conda-forge emcee
    

    The source code is available on GitHub here. The example here is very similar to the line model example given in the emcee documentation.

    First up, we need to define the likelihood function, prior functions, and posterior probability function. These all need to be defined as the natural logarithms of the functions. All the functions take in a sample tuple or list, where a sample is a vector $\vec{\theta}$ of parameter values, that can be unpacked to give the individual parameter values. Other arguments to the posterior function can be user defined.

    Posterior

    As we do not care about the marginal likelihood, the posterior will just be the product of the likelihood and prior, and will take the following form:

    def logposterior(theta, data, sigma, x):
        The natural logarithm of the joint posterior.
        Args:
            theta (tuple): a sample containing individual parameter values
            data (list): the set of data/observations
            sigma (float): the standard deviation of the data points
            x (list): the abscissa values at which the data/model is defined
        lp = logprior(theta) # get the prior
        # if the prior is not finite return a probability of zero (log probability of -inf)
        if not np.isfinite(lp):
            return -np.inf
        # return the likeihood times the prior (log likelihood plus the log prior)
        return lp + loglikelihood(theta, data, sigma, x)
    
    def loglikelihood(theta, data, sigma, x):
        The natural logarithm of the joint Gaussian likelihood.
        Args:
            theta (tuple): a sample containing individual parameter values
            data (list): the set of data/observations
            sigma (float): the standard deviation of the data points
            x (list): the abscissa values at which the data/model is defined
        Note:
            We do not include the normalisation constants (as discussed above).
        # unpack the model parameters from the tuple
        m, c = theta
        # evaluate the model (assumes that the straight_line model is defined as above)
        md = straight_line(x, m, c)
        # return the log likelihood
        return -0.5 * np.sum(((md - data) / sigma)**2)
        # uniform prior on c
        cmin = -10. # lower range of prior
        cmax = 10.  # upper range of prior
        # set prior to 1 (log prior to 0) if in the range and zero (-inf) outside the range
        # (we don't care about it being properly normalised, but you can if you want) 
        lp = 0. if cmin < c < cmax else -np.inf
        # Gaussian prior on m
        mmu = 0.     # mean of the Gaussian prior
        msigma = 10. # standard deviation of the Gaussian prior
        lp -= 0.5 * ((m - mmu) / msigma)**2
        return lp
    

    MCMC set up

    We need to decide on an initial number of ensemble points and initialise the samples, i.e., set starting points for them. The initial samples can be drawn from the prior distributions (in the emcee documentation example it starts the opposite way round, i.e., with initial samples tightly packed around the true value). We'll choose 100 ensemble points (the number of so-called walkers) and initialise them with:

    mmu = 0. # mean of the Gaussian prior msigma = 10. # standard deviation of the Gaussian prior mini = np.random.normal(mmu, msigma, Nens) # initial m points cmin = -10. # lower range of prior cmax = 10. # upper range of prior cini = np.random.uniform(cmin, cmax, Nens) # initial c points inisamples = np.array([mini, cini]).T # initial samples ndims = inisamples.shape[1] # number of parameters/dimensions

    Because our initial samples are drawn from the prior it can take time for them to converge on sampling from the posterior distributions that we want. To try and get around this we can run the sampler with a burn-in period (although see, e.g., here or here for some discussions about whether burn-ins are actually necessary), and we will ignore samples during this burn-in.

    Note: Alternative options could be to use some optimisation routine to try and find the posterior mode and start samples at that point, or use all samples, work out an autocorrelation length and thin the samples accordingly so they are uncorrelated (as shown a bit later), which gets rid of correlations during the initial convergence of the samples.

    We'll choose a burn-in of 500 samples, and then run the chain for another 500 samples, which will give an output with $(500+500)\times 100 = 100000$ samples.

    import emcee # import the emcee package
    print('emcee version: {}'.format(emcee.__version__))
    # for bookkeeping set number of likelihood calls to zero
    loglikelihood.ncalls = 0
    # set additional args for the posterior (the data, the noise std. dev., and the abscissa)
    argslist = (data, sigma, x)
    # set up the sampler
    sampler = emcee.EnsembleSampler(Nens, ndims, logposterior, args=argslist)
    
    # pass the initial samples and total number of samples required
    t0 = time() # start time
    sampler.run_mcmc(inisamples, Nsamples + Nburnin);
    t1 = time()
    timeemcee = (t1-t0)
    print("Time taken to run 'emcee' is {} seconds".format(timeemcee))
    # extract the samples (removing the burn-in)
    samples_emcee = sampler.get_chain(flat=True, discard=Nburnin)
    def plotposts(samples, **kwargs):
        Function to plot posteriors using corner.py and scipy's gaussian KDE function.
        if "truths" not in kwargs:
            kwargs["truths"] = [m, c]
        fig = corner
    
    
    
    
        
    .corner(samples, labels=[r'$m$', r'$c$'], hist_kwargs={'density': True}, **kwargs)
        # plot KDE smoothed version of distributions
        for axidx, samps in zip([0, 3], samples.T):
            kde = gaussian_kde(samps)
            xvals = fig.axes[axidx].get_xlim()
            xvals = np.linspace(xvals[0], xvals[1], 100)
            fig.axes[axidx].plot(xvals, kde(xvals), color='firebrick')
    # make plot
    plotposts(samples_emcee)
    resdict = {}
    resdict['memcee_mu'] = np.mean(samples_emcee[:,0])     # mean of m samples
    resdict['memcee_sig'] = np.std(samples_emcee[:,0])     # standard deviation of m samples
    resdict['cemcee_mu'] = np.mean(samples_emcee[:,1])     # mean of c samples
    resdict['cemcee_sig'] = np.std(samples_emcee[:,1])     # standard deviation of c samples
    resdict['ccemcee'] = np.corrcoef(samples_emcee.T)[0,1] # correlation coefficient between parameters
    

    The time to convergence can depend greatly on how the initial points are set up, so if you have a better guess as to location and shape of the posterior, then using those guesses is advised. But, beware overtuning! emcee has some handy utilities to draw samples from around an initial guess. In the example below I show the use of the (now deprecated) sample_ball() function, which creates a ball of samples assuming no correlations between parameters. However, if you knew that your parameter uncertainties had a particular covariance, you could use sample_ellipsoid() to generate a sample cloud using a supplied covarance matrix.

    from emcee.utils import sample_ball
    m_guess, c_guess = 1., 1. # guess at the mean of the distribution
    m_std, c_std = 1., 1.     # guess at the width of the distribution
    # generate samples from a ball based on those guesses
    inisamples = sample_ball((m_guess, c_guess), (m_std, c_std), size=Nens)
    # pass the initial samples and total number of samples required
    t0 = time() # start time
    sampler.run_mcmc(inisamples, Nsamples+Nburnin);
    t1 = time()
    timeemcee = (t1-t0)
    print("Time taken to run 'emcee' is {} seconds".format(timeemcee))
    # extract the samples (removing the burn-in)
    samples_emcee = sampler.get_chain(flat=True, discard=Nburnin)
    

    Once the sampler has settled, the samples that come out of MCMC will be drawn from the posterior. However, even in the best of situations, draws which are adjacent are often correlated with each other. The number of "independent" samples, that is, fully independent draws from the posterior can be much fewer than the samples in hand.

    The emcee EnsembleSampler class has an inbuilt function, get_autocorr_time(), to calculate the auto-correlation length (ACL) (or time) - this is an estimate the number of samples between independent samples in the chain. You can use this number to estimate convergence of chains, and also to thin out chains to produce independent samples.

    m_acl, c_acl = sampler.get_autocorr_time(quiet=True)
    print("The autocorrelation length for m is {0} and c is {1}".format(m_acl, c_acl))
    # thin out the chain
    samples_emcee = sampler.get_chain(flat=True, discard=Nburnin, thin=int(max([m_acl, c_acl])))
    print("Number of independent samples is {}".format(len(samples_emcee)))
    resdict['esemcee'] = len(samples_emcee)
    resdict['essemcee'] = int(resdict['esemcee'] / timeemcee)
    print("Effective samples per second: {}".format(resdict['essemcee']))
    # plot the resulting posteriors
    plotposts(samples_emcee)
    
    WARNING:root:The chain is shorter than 50 times the integrated autocorrelation time for 2 parameter(s). Use this estimate with caution and run a longer chain!
    N/50 = 20;
    tau: [23.98692287 25.13038681]
    
    The autocorrelation length for m is 23.986922869990487 and c is 25.130386813726226
    Number of independent samples is 2000
    Effective samples per second: 1059
    

    In practice, if you have a minimum number of independent samples that you require then you can repeatedly run the sampler, checking the ACL each time (and maybe adjusting the c input parameter for get_autocorr_time()), until you have collected the required number.

    Checking the ACL is something that can be done for any of the samplers discussed in the rest of this notebook. Different sampling methods can lead to different ACLs, and methods that may take more time per sample might end up having shorter ACLs. So, examining the various trade-offs between overall speed and the final number of independent samples can be worthwhile.

    PyMC3

    PyMC3 (Salvatier, Wiecki & Fonnesbeck, 2016) is a Python MCMC implementation that can use a variety of modern sampling method, including "No-U-turn sampling" (NUTS) (Hoffman & Gellman, 2014) and Hamiltonian Monte Carlo (Duane et al, 1987), both of which make use of the gradient of the posterior to efficiently sample the space. An example of getting started using PyMC3 can be found here upon which this example is based.

    This code can be installed using pip with:

    pip install pymc3
    

    or with Conda with:

    conda install -c conda-forge pymc3
    

    and the source is available on GitHub here.

    The majority of distributions that can be used for priors and likelihood functions are predefined in PyMC3, so you do not have to worry about writing them yourself (although you can create your own custom distribution).

    Note that in PyMC3 the prior and likelihood distributions have to be defined within the context of a PyMC3 Model() class, and cannot be defined outside that context (the with statment). Therefore, to set up the model we can't just use the straight_line() function defined above, but can do the following:

    import pymc3 as pm # import PyMC3
    print('PyMC3 version: {}'.format(pm.__version__))
    # set the PyMC3 model
    linear_model = pm.Model()
    with linear_model:
        # set prior parameters
        cmin = -10. # lower range of uniform distribution on c
        cmax = 10.  # upper range of uniform distribution on c
        mmu = 0.     # mean of Gaussian distribution on m
        msigma = 10. # standard deviation of Gaussian distribution on m
        # set priors for unknown parameters
        cmodel = pm.Uniform('c', lower=cmin, upper=cmax) # uniform prior on y-intercept
        mmodel = pm.Normal('m', mu=mmu, sd=msigma)       # Gaussian prior on gradient
        sigmamodel = sigma # set a single standard deviation
        # Expected value of outcome, aka "the model"
        mu = mmodel*x + cmodel
        # Gaussian likelihood (sampling distribution) of observations, "data"
        Y_obs = pm.Normal('Y_obs', mu
    
    
    
    
        
    =mu, sd=sigmamodel, observed=data)
    
    %%capture
    Nsamples = 1000 # final number of samples
    Ntune = 1000    # number of tuning samples
    # perform sampling
    t0 = time()
    with linear_model:
        trace = pm.sample(Nsamples, tune=Ntune, discard_tuned_samples=True); # perform sampling
    t1 = time()
    timepymc3 = (t1 - t0)
    
    WARNING:pymc3:The acceptance probability does not match the target. It is 0.8881386717506918, but should be close to 0.8. Try to increase the number of tuning steps.
    WARNING:pymc3:The acceptance probability does not match the target. It is 0.9037542650276801, but should be close to 0.8. Try to increase the number of tuning steps.
    
    samples_pymc3 = np.vstack((trace['m'], trace['c'])).T
    resdict['mpymc3_mu'] = np.mean(samples_pymc3[:,0])      # mean of m samples
    resdict['mpymc3_sig'] = np.std(samples_pymc3[:,0])      # standard deviation of m samples
    resdict['cpymc3_mu'] = np.mean(samples_pymc3[:,1])      # mean of c samples
    resdict['cpymc3_sig'] = np.std(samples_pymc3[:,1])      # standard deviation of c samples
    resdict['ccpymc3'] = np.corrcoef(samples_pymc3.T)[0,1]  # correlation coefficient between parameters
    plotposts(samples_pymc3)
    acts = []
    for param in ["m", "c"]:
        acts.append(emcee.autocorr.integrated_time(trace[param]))
        print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
    print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
    resdict['espymc3'] = int(len(samples_pymc3) / np.max(acts))
    resdict['esspymc3'] = int(resdict['espymc3'] / timepymc3)
    print("Effective samples per second: {}".format(resdict['esspymc3']))
    
    Autocorrelation length for 'm' is [4.57782586]
    Autocorrelation length for 'c' is [3.95745792]
    Maximum autocorrelation length is 4.6
    Effective samples per second: 190
    

    TensorFlow Probability

    TensorFlow Probability (TFP) is a probabilistic modelling framework built upon the TensorFlow library. TFP contains some samplers for performing MCMC. At the time of writing TFP is still fairly new, and currently includes a Hamiltonian MC sampler, a Metropolis-Hastings sampler, and a slice sampler.

    TFP can be installed from PyPI via pip with:

    pip install tensorflow tensorflow-probability
    

    where tensorflow must be explicitly installed as it's not a dependency of tensorflow-probability.

    Using TFP for MCMC sampling has some similarites to PyMC3, however it can be fairly opaque, and (for me at least!) non-intuitive. The example given does not come with a detailed explanation of the constructions I've used, because I don't have any in-depth knowlegde of them! It was mainly cobbled together by looking at the example given here.

    Firstly, we can set up the probablistic model (with a JointDistributionSequential class) that contains the prior distributions on the parameters of interest and a lambda function defining the likelihood function, within which we write out the straight line model for the data. Some things to be aware of are: the arguments to the lambda function defining the likelihood must be given in the opposite order to that in which those arguments are placed within the JointDistributionSequential class; the name of the Independent distribution that holds the likelihood function should be the same as the name of the variable that holds the data.

    # import TensorFlow probability
    import tensorflow_probability as tfp
    from tensorflow_probability import distributions as tfd
    import tensorflow as tf
    cmin = -10.  # lower range of uniform distribution on c
    cmax = 10.   # upper range of uniform distribution on c
    mmu = 0.      # mean of Gaussian distribution on m
    msigma = 10.  # standard deviation of Gaussian distribution on m
    # convert x values and data to 32 bit float
    xtfp = x.astype(np.float32)  # x is being use globally here
    datatfp = data.astype(np.float32)
    # set model - contains priors and the expected linear model
    model = tfd.JointDistributionSequential([
      tfd.Normal(loc=mmu, scale=msigma, name="m"),  # m prior
      tfd.Uniform(cmin, cmax, name="c"),  # c prior
      lambda c, m: (tfd.Independent(
            tfd.Normal( # the Normal likelihood function
                loc=(m[..., tf.newaxis] * xtfp + c[..., tf.newaxis]), # the straight line model
                scale=sigma  # the data standard deviation
            name="datatfp",  # the name of the variable holding the data
            reinterpreted_batch_ndims=1,
    print('TensorFlow version: {}'.format(tf.__version__))
    print('TensorFlow Probability version: {}'.format(tfp.__version__))
    
    def target_log_prob_fn(mvalue, cvalue):
        return model.log_prob(
            (mvalue, cvalue, datatfp)
    

    We can now set up the sampler and here we'll use the No U-turn Sampler (NUTS). For the set up of the NoUTurnSampler we have to specify a step size (i.e., the size of the proposals). Some experimentation for our particular case shows a step size of 0.05 is reasonable, but it won't work for every case and most be tuned by hand. It also means that if different parameters have quite different dynamic ranges this method will not work well (it may produce jumps that are always for too big or small for certain sets of parameters).

    Wrapping the sampling within a function (do_sampling()) with the @tf.function decorator compiles the model and means it runs far faster than without the wrapper.

    We'll draw initial values of the m and c parameters from distributions that match their priors.

    Nburn = 4000 # number of tuning samples # set up Hamiltonian MC (within TensorFlow function to speed up computation) @tf.function(autograph=False) def do_sampling(): # set initial state (random draw from prior) qc = tf.random.uniform([], minval=cmin, maxval=cmax, name="init_c") qm = tf.random.normal([], stddev=msigma, mean=mmu, name="init_m") hmc_kernel = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=target_log_prob_fn, step_size=0.05, num_leapfrog_steps= 5, return tfp.mcmc.sample_chain( num_results=Nsamples, current_state=[qm, qc], kernel=hmc_kernel, num_burnin_steps=Nburn, t0 = time() states, kernel_results = do_sampling() t1 = time() timetfp = (t1-t0) # extract the samples ms, cs = states # convert output states to numpy arrays samples_tfp = np.vstack((ms.numpy(), cs.numpy())).T num_accepted = np.sum(kernel_results.is_accepted) print("Acceptance rate: {}".format(num_accepted / Nsamples)) # print out acceptance rate
    /opt/conda/envs/python37/lib/python3.7/site-packages/tensorflow_probability/python/__init__.py:75: UserWarning: TensorFloat-32 matmul/conv are enabled for NVIDIA Ampere+ GPUs. The resulting loss of precision may hinder MCMC convergence. To turn off, run `tf.config.experimental.enable_tensor_float_32_execution(False)`. For more detail, see https://github.com/tensorflow/community/pull/287.
      'TensorFloat-32 matmul/conv are enabled for NVIDIA Ampere+ GPUs. The '
    /opt/conda/envs/python37/lib/python3.7/site-packages/tensorflow_probability/python/mcmc/sample.py:342: UserWarning: Tracing all kernel results by default is deprecated. Set the `trace_fn` argument to None (the future default value) or an explicit callback that traces the values you are interested in.
      warnings.warn('Tracing all kernel results by default is deprecated. Set '
    
    resdict['mtfp_mu'] = np.mean(samples_tfp[:,0])      # mean of m samples
    resdict['mtfp_sig'] = np.std(samples_tfp[:,0])      # standard deviation of m samples
    resdict['ctfp_mu'] = np.mean(samples_tfp[:,1])      # mean of c samples
    resdict['ctfp_sig'] = np.std(samples_tfp[:,1])      # standard deviation of c samples
    resdict['cctfp'] = np.corrcoef(samples_tfp.T)[0,1]  # correlation coefficient between parameters
    plotposts(samples_tfp)
    
    # calculate the autocorrelation length
    acts = []
    for i, param in enumerate(["m", "c"]):
        acts.append(emcee.autocorr.integrated_time(samples_tfp[:,i]))
        print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
    print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
    resdict['estfp'] = int(len(samples_tfp) / np.max(acts))
    resdict['esstfp'] = int(resdict['estfp'] / timetfp)
    print("Effective samples per second: {}".format(resdict['esstfp']))
    
    Autocorrelation length for 'm' is [14.8794824]
    Autocorrelation length for 'c' is [17.3901431]
    Maximum autocorrelation length is 17.4
    Effective samples per second: 47
    

    Zeus

    Zeus (Karamanis & Beutler, 2020, Karamanis, Beutler & Peacock, 2021) is a pure Python MCMC code that uses ensemble slice sampling. It is very similar in usage to emcee.

    Zeus is available via PyPI and can be installed using pip with:

    pip install zeus-mcmc
    

    or with Conda with:

    conda install -c conda-forge zeus-mcmc
    

    The Zeus source code is available on Github.

    The Zeus documentation gives a nice example of fitting a straight line. The interface is so similar to emcee that we can reuse the logprior, loglikelihood and logposterior functions that we used in the case (I won't bother redefining them, so you'll have to scroll back up!). So, that just leaves the sampling. We will just leave any tunable parameters at their default values, but you can choose between a few different proposal distributions.

    print('Zeus version: {}'.format(zeus.__version__)) ndim = 2 # Number of parameters/dimensions (e.g. m and c) Nens = 100 # for consistency with the emcee example we'll use 100 Nburn = 500 # burn-in samples Nsamples = 500 # Nsamples # set the initial samples (we'll draw from the prior) mini = np.random.normal(mmu, msigma, Nens) # initial m points cini = np.random.uniform(cmin, cmax, Nens) # initial c points inisamples = np.array([mini, cini]).T # initial samples # Initialise the sampler sampler = zeus.EnsembleSampler(Nens, ndim, logposterior, args=[data, sigma, x]) t0 = time() # Run the sampler sampler.run_mcmc(inisamples, Nburn + Nsamples) t1 = time() timezeus = t1 - t0 # print summary diagnostics sampler.summary Number of Tuning Generations: 14 Scale Factor: 0.641642 Mean Integrated Autocorrelation Time: 3.09 Effective Sample Size: 32373.08 Number of Log Probability Evaluations: 586000 Effective Samples per Log Probability Evaluation: 0.055244
    samples_zeus = sampler.get_chain(flat=True, discard=Nburn)
    resdict['mzeus_mu'] = np.mean(samples_zeus[:,0])      # mean of m samples
    resdict['mzeus_sig'] = np.std(samples_zeus[:,0])      # standard deviation of m samples
    resdict['czeus_mu'] = np.mean(samples_zeus[:,1])      # mean of c samples
    resdict['czeus_sig'] = np.std(samples_zeus[:,1])      # standard deviation of c samples
    resdict['cczeus'] = np.corrcoef(samples_zeus.T)[0,1]  # correlation coefficient between parameters
    plotposts(samples_zeus)
    
    # calculate the autocorrelation length
    acts = []
    for i, param in enumerate(["m", "c"]):
        acts.append(emcee.autocorr.integrated_time(samples_zeus[:, i]))
        print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
    print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
    resdict['eszeus'] = int(len(
    
    
    
    
        
    samples_zeus) / np.max(acts))
    resdict['esszeus'] = int(resdict['eszeus'] / timezeus)
    print("Effective samples per second: {}".format(resdict['esszeus']))
    
    Autocorrelation length for 'm' is [3.08637982]
    Autocorrelation length for 'c' is [3.09159199]
    Maximum autocorrelation length is 3.1
    Effective samples per second: 1279
    

    MCMC wrapper samplers

    There are other MCMC samplers that are not pure Python packages, but have a Python-based wrapper to a different library. With these you can't write pure Python model and probability functions, but instead have to create a code string that is then compiled and run using the underlying library functions. We will look at PyStan and PyJAGS.

    PyStan

    PyStan is a Python interface to the popular Stan library. As with PyMC3, it uses the modern NUTS and Hamiltonain MC sampling methods. It can be installed using pip with:

    pip install pystan
    

    or via Conda with:

    conda install -c conda-forge pystan
    

    and the source code is available on GitHub here.

    To define the model and distributions you need to write a code string like this:

    int N; // number of data points real y[N]; // observed data points real x[N]; // abscissa points real sigma; // standard deviation parameters {{ // parameters for the fit real m; real c; transformed parameters {{ real theta[N]; for (j in 1:N) theta[j] = m * x[j] + c; // straight line model model {{ m ~ normal({mmu}, {msigma}); // prior on m (gradient) c ~ uniform({clower}, {cupper}); // prior on c (y-intercept) y ~ normal(theta, sigma); // likelihood of the data given the model linear_data = { 'N': M, # number of data points 'y': data, # observed data (converted from numpy array to a list) 'x': x, # abscissa points (converted from numpy array to a list) 'sigma': sigma, # standard deviation import pystan # import PyStan Nsamples = 1000 # set the number of iterations of the sampler chains = 4 # set the number of chains to run with # dictionary for inputs into line_code linedict = {} linedict['mmu'] = 0.0 # mean of Gaussian prior distribution for m linedict['msigma'] = 10 # standard deviation of Gaussian prior distribution for m linedict['clower'] = -10 # lower bound on uniform prior distribution for c linedict['cupper'] = 10 # upper bound on uniform prior distribution for c t0 = time() sm = pystan.StanModel(model_code=line_code.format(**linedict)); # compile model t1 = time() timepystancomp = (t1 - t0) t0 = time() fit = sm.sampling(data=linear_data, iter=Nsamples, chains=chains); # perform sampling t1 = time() timepystan = (t1 - t0)
    print('PyStan version: {}'.format(pystan.__version__))
    print("Time taken to comile 'PyStan' model is {} seconds".format(timepystancomp))
    print("Time taken to run 'PyStan' is {} seconds".format(timepystan))
    
    PyStan version: 2.19.1.1
    Time taken to comile 'PyStan' model is 41.173564434051514 seconds
    Time taken to run 'PyStan' is 0.36288952827453613 seconds
    
    la = fit.extract(permuted=True)  # return a dictionary of arrays
    samples_pystan = np.vstack((la['m'], la['c'])).T
    resdict['mpystan_mu'] = np.mean(samples_pystan[:,0])      # mean of m samples
    resdict['mpystan_sig'] = np.std(samples_pystan[:,0])      # standard deviation of m samples
    resdict['cpystan_mu'] = np.mean(samples_pystan[:,1])      # mean of c samples
    resdict['cpystan_sig'] = np.std(samples_pystan[:,1])      # standard deviation of c samples
    resdict['ccpystan'] = np.corrcoef(samples_pystan.T)[0,1] # correlation coefficient between parameters
    # plot using corner.py
    plotposts(samples_pystan)
    
    # calculate the autocorrelation length
    acts = []
    for i, param in enumerate(["m", "c"]):
        acts.append(emcee.autocorr.integrated_time(samples_pystan[:, i]))
        print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
    print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
    resdict['espystan'] = int(len(samples_pystan) / np.max(acts))
    resdict['esspystan'] = int(resdict['espystan'] / timetfp)
    print("Effective samples per second: {}".format(resdict['esspystan']))
    
    Autocorrelation length for 'm' is [1.21500575]
    Autocorrelation length for 'c' is [1.11721154]
    Maximum autocorrelation length is 1.2
    Effective samples per second: 341
    

    PyJAGS

    PyJAGS is another wrapper to a non-Python library, in this case the JAGS library (Plummer, 2003). As with PyStan, you have to write a code string that is then compiled and run. To use PyJAGS you need to install JAGS, e.g. on a Debian-based system this can be done using:

    sudo apt-get install jags
    

    and then install PyJAGS using pip with:

    pip install pyjags
    

    The source code for PyJAGS is available on GitHub here.

    A linear regression fitting example using PyJAGS is given here, on which this example is based.

    We can define the model string using (note that in this model the Gaussian/Normal distribution dnorm takes the inverse variance as the second argument, rather than the standard deviation):

    y[i] ~ dnorm(c + m * x[i], {invvar}) # Gaussian likelihood m ~ dnorm({mmu}, {minvvar}) # Gaussian prior on m c ~ dunif({clower}, {cupper}) # Uniform prior on c
    datadict = {
        'x': x,     # abscissa points (converted from numpy array to a list)
        'N': M,     # number of data points
        'y': data,  # the observed data
    import pyjags # import PyJAGS
    Nsamples = 1000 # set the number of iterations of the sampler
    chains = 4      # set the number of chains to run with
    # dictionary for inputs into line_code
    linedict = {}
    linedict['mmu'] = 0.0              # mean of Gaussian prior distribution for m
    linedict['minvvar'] = 1 / 10**2    # inverse variance of Gaussian prior distribution for m
    linedict['clower'] =
    
    
    
    
        
     -10           # lower bound on uniform prior distribution for c
    linedict['cupper'] = 10            # upper bound on uniform prior distribution for c
    linedict['invvar'] = 1 / sigma**2  # inverse variance of the data
    # compile model
    t0 = time()
    model = pyjags.Model(line_code_jags.format(**linedict), data=datadict, chains=chains)
    t1 = time()
    timepyjagscomp = (t1 - t0)
    t0 = time()
    samples = model.sample(Nsamples, vars=['m', 'c']) # perform sampling
    t1 = time()
    timepyjags = (t1 - t0)
    
    Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
    NumExpr defaulting to 8 threads.
    Using JAGS library located in /usr/lib/x86_64-linux-gnu/libjags.so.4.
    Loading module basemod from /usr/lib/x86_64-linux-gnu/JAGS/modules-4/basemod.so
    Loading module bugs from /usr/lib/x86_64-linux-gnu/JAGS/modules-4/bugs.so
    Loading module lecuyer from /usr/lib/x86_64-linux-gnu/JAGS/modules-4/lecuyer.so
    
    print('PyJAGS version: {}'.format(pyjags.__version__))
    print("Time taken to comile 'PyJAGS' model is {} seconds".format(timepyjagscomp))
    print("Time taken to run 'PyJAGS' is {} seconds".format(timepyjags))
    cchainjags = samples['c'].flatten()
    samples_pyjags = np.vstack((mchainjags, cchainjags)).T
    resdict['mpyjags_mu'] = np.mean(samples_pyjags[:,0])      # mean of m samples
    resdict['mpyjags_sig'] = np.std(samples_pyjags[:,0])      # standard deviation of m samples
    resdict['cpyjags_mu'] = np.mean(samples_pyjags[:,1])      # mean of c samples
    resdict['cpyjags_sig'] = np.std(samples_pyjags[:,1])      # standard deviation of c samples
    resdict['ccpyjags'] = np.corrcoef(samples_pyjags.T)[0,1] # correlation coefficient between parameters
    # plot using corner.py
    plotposts(samples_pyjags)
    
    # calculate the autocorrelation length
    acts = []
    for i, param in enumerate(["m", "c"]):
        acts.append(emcee.autocorr.integrated_time(samples_pystan[:, i]))
        print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
    print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
    resdict['espyjags'] = int(len(samples_pyjags) / np.max(acts))
    resdict['esspyjags'] = int(resdict['espyjags'] / timetfp)
    print("Effective samples per second: {}".format(resdict['esspyjags']))
    
    Autocorrelation length for 'm' is [1.21500575]
    Autocorrelation length for 'c' is [1.11721154]
    Maximum autocorrelation length is 1.2
    Effective samples per second: 683
    

    Nested Sampling

    Nested sampling (Skilling, 2006) is a method to numerically perform the integral required to evaluate the marginal likelihood of the data given a particular model. This is not a value that is produced by most standard MCMC methods, but is a value that is very useful if wanting to do Bayesian model comparison. Above, we defined the marginal likelihood, or evidence, as

    p(\mathbf{d}|I) = \int^{\forall \theta_i \in \vec{\theta}} p(\mathbf{d}|\vec{\theta},I) p(\vec{\theta}|I) {\rm d}{\theta_i}, $$

    but, we can make this more explicit for a given hypothesis, or model, $H_j$, by stating:

    p(\mathbf{d}|H_j, I) \equiv Z = \int^{\forall \theta_i \in \vec{\theta}} p(\mathbf{d}|\vec{\theta}, H_j, I) p(\vec{\theta}|H_j,I) {\rm d}{\theta_i}, $$

    where previously the $H_j$ had been subsumed inside our implicit assumptions, $I$.

    Nested sampling algorithms generally start by drawing a set of points (sometimes called "live points", or "active points") from the prior distribution. It relies on these points being independent draws from the prior, otherwise the resulting marginal likelihood value can be biased. The likelihood is evaluated at each point, and the smallest likelihood point is found. This points is then removed from the set of live points (and goes into the sum calculating the marginal likelihood integral) and a new point is drawn from the prior with the constraint that it has a likelihood larger than the just removed point. Drawing truly independent points from this constrained prior is the main tricky part of nested sampling (particularly when the number of parameters is not small), and a variety of methods (including MCMC methods) can be used.

    The uncertainty on the estimate of the natural logarithm of the marginal likelihood, $\ln{Z}$, can be found if the information gain (or Kullback-Leibler divergence), $h$, in going from the prior to the posterior is calculated. If $h$ (in nats) is available then the uncertainty is (Skilling, 2006):

    \sqrt{\frac{h}{N_{\rm live}}}, $$

    where $N_{\rm live}$ is the number of live points used.

    Note: This is the statistical uncertainty on the value, but if there are any biases due to the sampling not being independent draws from the prior then there can be systematic uncertainties too.

    As well as producing a value for the marginal likelihood, a by-product of the nested sampling algorithm is the production a set of samples that can be re-sampled from to provide samples drawn from the posterior distribution.

    Note: MCMC methods can be used to estimate the marginal likelihood, e.g., emcee can be run with a parallel tempering sampling method and used to compute the evidence using thermodynamic integration (e.g., Goggans & Chi, 2004), or potentially using Slice sampling with PyMC3.

    As with MCMC, there are various codes available to perform the nested sampling algorithm. We will again look at pure Python implementations and some that are wrappers to other libraries.

    Nestle

    Nestle is a pure Python nested sampling algorithm. It has three options for the method with which it draws new samples: "classic", "single ellipsoid", and "multiple ellipsoids. In this example we'll make use of the last of these. This uses the MultiNest method (also see PyMultiNest below) (Feroz et al, 2009) in which the current live points are clustered using a recursive k-means method (however, clustering is not re-done at every iteration of the algorithm), ellipsoids completely bounding each cluster are found, an ellipsoid is randomly chosen with a probability proportional to its volume, and a new sample is drawn uniformly from it.

    Nestle is available on PyPI and installable using pip with:

    pip install nestle
    

    or via Conda with:

    conda install -c conda-forge nestle
    

    The source code is available on GitHub here.

    The set up of likelihood functions is quite similar to that for emcee, but the prior that is used is a unit hypercube (an $n$-dimensional cube with all sides equal to one), so parameters must be mapped into this space. This is simple for uniform priors, but requires a re-parameterisation using a Jacobian for more complex priors (if possible).

    For our Gaussian prior on $m$ we can't analytically re-parameterise it for a unit hypercube, however we can use inverse transform sampling to take samples from a uniform range between 0 and 1, and convert them back to a Gaussian, e.g.

    m = \mu_m + \sqrt{2}\sigma_m {\rm erf}^{-1}\left(2c - 1\right), $$

    where $c \sim {\rm Uniform}(0,1)$ and ${\rm erf}^{-1}$ is the inverse error function. As shown in the Nestle documentation we can use the scipy.special.ndtri function for this.

    An example of using Nestle for the linear regression problem, but with uniform priors on both parameters, is given here.

    Setting the prior transform

    Given the above description, rather than setting a prior function, we set a prior transformation function that maps from parameters drawn from the unit hypercube into the true parameters.

    def prior_transform(theta): A function defining the tranform between the parameterisation in the unit hypercube to the true parameters. Args: theta (tuple): a tuple containing the parameters. Returns: tuple: a new tuple or array with the transformed parameters. mprime, cprime = theta # unpack the parameters (in their unit hypercube form) cmin = -10. # lower bound on uniform prior on c cmax = 10. # upper bound on uniform prior on c mmu = 0. # mean of Gaussian prior on m msigma = 10. # standard deviation of Gaussian prior on m m = mmu + msigma* ndtri(mprime) # convert back to m c = cprime*(cmax-cmin) + cmin # convert back to c return (m, c)

    Setting the likelihood function

    This is similar to setting the likelihood function for emcee except, as we're calculating the marginal likelihood, we will keep the normalisation constants. As before, we work with the natural logarithm of the likelihood function. The function can only take in a tuple variable containing the parameter values, so any additional required values must be hard coded, or set as global variables.

    # set the natural logarithm of 2pi, so that it doesn't have to be recalculated
    LN2PI = np.log(2.*np.pi)
    LNSIGMA = np.log(sigma) # natural log of the data noise standard deviation
    def loglikelihood_nestle(theta):
        The log-likelihood function.
        m, c = theta # unpack the parameters
        # normalisation
        norm = -0.5 * M * LN2PI - M * LNSIGMA
        # chi-squared (data, sigma and x are global variables defined early on in this notebook)
        chisq = np.sum(((data - straight_line(x, m, c)) / sigma)**2)
        return norm - 0.5 * chisq
    

    Now you can run the sampler. We'll use 1024 live/active points (there's nothing special about this number, and powers of two are not required). The more live points you use the longer the code will take to run, but the uncertainty on the final marginal likelihood value will decrease with greater numbers of live points.

    Unlike with MCMCs, where you give it a number of iterations to run and after that point it stops, nested sampling needs to be supplied with a stopping criterion at which point it will terminate. One common criterion is based on the ratio between an estimate of the total evidence and the current calculated evidence value. Once this value gets below some set value the algorithm will terminate.

    print('Nestle version: {}'.format(nestle.__version__)) nlive = 1024 # number of live points method = 'multi' # use MutliNest algorithm ndims = 2 # two parameters tol = 0.1 # the stopping criterion t0 = time() res = nestle.sample(loglikelihood_nestle, prior_transform, ndims, method=method, npoints=nlive, dlogz=tol) t1 = time() timenestle = (t1-t0) print("Time taken to run 'Nestle' is {} seconds".format(timenestle))
    logZnestle = res.logz                         # value of logZ
    infogainnestle = res.h                        # value of the information gain in nats
    logZerrnestle = np.sqrt(infogainnestle/nlive) # estimate of the statistcal uncertainty on logZ
    print("log(Z) = {} ± {}".format(logZnestle, logZerrnestle))
    
    # re-scale weights to have a maximum of one
    nweights = res.weights/np.max(res.weights)
    # get the probability of keeping a sample from the weights
    keepidx = np.where(np.random.rand(len(nweights)) < nweights)[0]
    # get the posterior samples
    samples_nestle = res.samples[keepidx,:]
    resdict['mnestle_mu'] = np.mean(samples_nestle[:,0])      # mean of m samples
    resdict['mnestle_sig'] = np.std(samples_nestle[:,0])      # standard deviation of m samples
    resdict['cnestle_mu'] = np.mean(samples_nestle[:,1])      # mean of c samples
    resdict['cnestle_sig'] = np.std(samples_nestle[:,1])      # standard deviation of c samples
    resdict['ccnestle'] = np.corrcoef(samples_nestle.T)[0,1]  # correlation coefficient between parameters
    resdict['nestle_npos'] = len(samples_nestle)              # number of posterior samples
    resdict['nestle_time'] = timenestle                       # run time
    resdict['nestle_logZ'] = logZnestle                       # log marginalised likelihood
    resdict['nestle_logZerr'] = logZerrnestle                 # uncertainty on log(Z)
    print('Number of posterior samples is {}'.format(len(samples_nestle)))
    

    With nested sampling the samples should be uncorrelated by construction (in reality, for some complex problems, some methods used to draw new samples within the nested sampling algorithm can leave correlations). During nested sampling, you can estimate the number of effective posterior samples from the full chain of samples using, for example, (an unnormalised verion of) the effective sampler size calculation given here.

    N = len(weights) w = weights / weights.sum() ess = N / (1.0 + ((N * w - 1)**2).sum() / N) return int(ess)
    resdict['esnestle'] = ess(res.weights)
    resdict['essnestle'] = int(resdict['esnestle'] / timenestle)
    print("Effective samples per second: {}".format(resdict['essnestle']))
    

    CPNest

    CPNest (Veitch & del Pozzo, 2017) is another pure Python nested sampling algorithm. To perform the sampling from the restricted prior it uses a Metropolis-Hastings based MCMC, with sample proposal distributions that use a combination of the affine invarient ensemble methods described for emcee (Goodman & Weare, 2010), differential evolution, or jumps based on the eigenvectors of the covariance matrix of live points.

    CPNest is available on PyPI and can be installed using pip with:

    pip install cpnest
    

    or via Conda with:

    conda install -c conda-forge cpnest

    The source code is available on GitHub here.

    To run CPNest we have to define a class containing the likelihood and prior functions. CPNest also requires explicit bounds on all parameters from which it will uniformly draw an initial set of live points. Prior to running the nested sampling algorithm, it will run an MCMC using the prior as the target distribution to make sure the samples are drawn from the required prior, so it does not require any mapping between the prior and a unit-hypercube.

    class StraightLineModel(Model): A simple straight line model, with a Gaussian likelihood. Args: data (:class:`numpy.ndarray`): an array containing the observed data abscissa (:class:`numpy.ndarray`): an array containing the points at which the data were taken modelfunc (function): a function defining the model sigma (float): the standard deviation of the noise in the data names = ['m','c'] # parameter names (this is a required variables for the class) # define the boundaries using for initial sampling of the live points cmin = -10. # lower range on c (the same as the uniform c prior lower bound) cmax = 10. # upper range on c (the same as the uniform c prior upper bound) mmu = 0. # mean of the Gaussian prior on m msigma = 10. # standard deviation of the Gaussian prior on m cbounds = [cmin, cmax] # set the m bounds to be +/- 10sigma about the mean mbounds = [mmu-10.*msigma, mmu+10.*msigma] # NOTE: the bounds could instead be set through arguments passed to the __init__ # function for the class if wanted bounds=[mbounds, cbounds] # upper and lower bounds on each parameter (required for the class) def __init__(self, data, abscissa, modelfunc, sigma): # set the data self._data = data # oberserved data self._abscissa = abscissa # points at which the observed data are taken self._sigma = sigma # standard deviation(s) of the data self._logsigma = np.log(sigma) # log sigma here to save computations in the likelihood self._ndata = len(data) # number of data points self._model = modelfunc # model function def log_likelihood(self, params): The natural logarithm of the likelihood function. Args: params (dict): a dictionary keyed to the parameter names. Returns: float: the log likelihood value. # extract the parameters m = params['m'] c = params['c'] # calculate the model model = self._model(x, m, c) # normalisation norm = -0.5*self._ndata*LN2PI - self._ndata*self._logsigma # chi-squared chisq = np.sum(((self._data - model)/(self._sigma))**2) return norm - 0.5*chisq def log_prior(self,p): The natural logarithm of the prior function. Args: p (dict): a dictionary keyed to parameter names. Returns: float: The log prior value. # the uniform priors are dealt with by just checking that we're within the bounds if not self.in_bounds(p): return -np.inf # check parameters are all in bounds # Gaussian prior on m lp = 0. m = p['m'] lp -= 0.5 * ((m - self.mmu) / self.msigma)**2 # no need for normalisation constant on the prior return lp

    We can now run CPNest. As with Nestle we will use 1024 live points. CPNest can use mutltiple CPU cores, but we'll set it to use one to be comparable to the other tests. The number of iterations used for the MCMC sampling is automatically worked out within the code by calculating autocorrelation lengths to see how many iterations are needed to produce independent samples, but a maximum chain length can also be set. We'll set this to 1024. As with Nestle a stopping criterion is required, which for CPNest is defined in the same way as Nestle. The current CPNest default is 0.1 and this cannot be altered by the user.

    print('CPNest version: {}'.format(cpnest.__version__)) nlive = 1024 # number of live point maxmcmc = 1024 # maximum MCMC chain length nthreads = 1 # use one CPU core # set up the algorithm work = cpnest.CPNest( StraightLineModel(data, x, straight_line, sigma), verbose=0, nthreads=nthreads, nlive=nlive, maxmcmc=maxmcmc # run the algorithm t0 = time() work.run(); t1 = time() timecpnest = (t1 - t0)
    Setting verbosity to 0
    2021-05-20, 09:32:04 - cpnest.cpnest.CPNest                  : Running with 1 parallel threads
    Running with 1 parallel threads
    Setting verbosity to 0
    2021-05-20, 09:32:21 - cpnest.sampler.Sampler                : Sampler process 970: MCMC samples accumulated = 461865
    Sampler process 970: MCMC samples accumulated = 461865
    2021-05-20, 09:32:21 - cpnest.sampler.Sampler                : Sampler process 970 - mean acceptance 0.496: exiting
    Sampler process 970 - mean acceptance 0.496: exiting
    2021-05-20, 09:32:21 - cpnest.NestedSampling.NestedSampler   : Final evidence: -116.80
    Final evidence: -116.80
    2021-05-20, 09:32:21 - cpnest.NestedSampling.NestedSampler   : Information: 7.16
    Information: 7.16
    2021-05-20, 09:32:22 - cpnest.nest2pos                       : Computed log_evidences: (-116.80638721422837,)
    Computed log_evidences: (-116.80638721422837,)
    2021-05-20, 09:32:22 - cpnest.nest2pos                       : Relative weights of input files: [1.0]
    Relative weights of input files: [1.0]
    2021-05-20, 09:32:22 - cpnest.nest2pos                       : Relative weights of input files taking into account their length: [1.0]
    Relative weights of input files taking into account their length: [1.0]
    2021-05-20, 09:32:22 - cpnest.nest2pos                       : Number of input samples: [11764]
    Number of input samples: [11764]
    2021-05-20, 09:32:22 - cpnest.nest2pos                       : Expected number of samples from each input file [2754]
    Expected number of samples from each input file [2754]
    2021-05-20, 09:32:22 - cpnest.nest2pos                       : Samples produced: 2754
    Samples produced: 2754
    

    We can again extract the value of $\ln{Z}$ and its uncertainty. One simple bit of model comparison that we may want to do is to compare the evidence for the model of the data containg a straight line (with the given priors on the parameter), with the evidence for a model that the data just contains independent Gaussian noise with the given standard deviation. We can calculate the latter evidence easily by just setting the model to zero in the log-likelihood function, e.g.:

    \ln{Z}_{\rm noise} = -\frac{M}{2}\ln{(2\pi)} - \ln\left(\prod_{i=1}^M \sigma_i\right) - \sum_{i=1}^M\frac{d_i^2}{2\sigma_i^2}. $$

    The natural logarithm of the Bayes factor comparing the two models is then:

    \ln{\mathcal{B}} = \ln{Z} - \ln{Z}_{\rm noise}. $$

    We will also extract this value.

    logZcpnest = work.NS.logZ                     # value of log Z
    infogaincpnest = work.NS.state.info           # value of the information gain
    logZerrcpnest = np.sqrt(infogaincpnest/nlive) # estimate of the statistcal uncertainty on logZ
    # get the null log likelihood (evidence that the data is Gaussian noise with zero mean, and given standard devaition)
    logZnull = work.user.log_likelihood({'m': 0., 'c': 0.})
    print("log(Z) = {} ± {}".format(logZcpnest, logZerrcpnest))
    # output the log Bayes factor
    print('log Bayes factor is {}'.format(logZcpnest - logZnull))
    
    mchain_cpnest = work.posterior_samples['m']  # extract chain of m values
    cchain_cpnest = work.posterior_samples['c']  # extract chain if c values
    samples_cpnest = np.vstack((mchain_cpnest, cchain_cpnest)).T
    # print out the number of posterior samples
    print('Number of posterior samples is {}'.format(samples_cpnest.shape[0]))
    resdict['mcpnest_mu'] = np.mean(samples_cpnest[:,0])      # mean of m samples
    resdict['mcpnest_sig'] = np.std(samples_cpnest[:,0])      # standard deviation of m samples
    resdict['ccpnest_mu'] = np.mean(samples_cpnest[:,1])      # mean of c samples
    resdict['ccpnest_sig'] = np.std(samples_cpnest[:,1])      # standard deviation of c samples
    resdict['cccpnest'] = np.corrcoef(samples_cpnest.T)[0,1]  # correlation coefficient between parameters
    resdict['cpnest_npos'] = len
    
    
    
    
        
    (samples_cpnest)              # number of posterior samples
    resdict['cpnest_time'] = timecpnest                       # run time
    resdict['cpnest_logZ'] = logZcpnest                       # log marginalised likelihood
    resdict['cpnest_logZerr'] = logZerrcpnest                 # uncertainty on log(Z)
    # plot using corner.py
    plotposts(samples_cpnest)
    
    from cpnest.nest2pos import compute_weights
    _, weights = compute_weights(work.get_nested_samples()["logL"], work.nlive)
    weights = np.exp(weights)
    resdict['escpnest'] = ess(weights)
    resdict['esscpnest'] = int(resdict['escpnest'] / timecpnest)
    print("Effective samples per second: {}".format(resdict['esscpnest']))
    

    dynesty

    dynesty (Speagle, 2019) is a pure Python-based implementation of the dynamic nested sampling algorithm (Higson et al, 2017). It is very similar in usage to Nestle. Dynamic nested sampling is a way of dynamically adapting the number of live points used to provide more accurate evidence estimates and/or larger numbers of independent posterior samples. The code can run using dynamic adjustment of live points or just in a "static" mode with a fixed number of live points. There a many options for how the new live points are drawn, both in terms of the bounds from which they are drawn and the sampling from within those bounds. The bound options include the using bounding ellipsoids in the style of MultiNest, or the RadFriends algorithm of UltraNest, among others. The sampling options include uniform draws from the bounds, random walks, or slice sampling in the style of PolyChord. If not explicitly specified, dynesty will select a sampling algorithm based on the number of dimensions of the problem being tackled.

    dynesty is available on PyPI and can be installed with pip using:

    pip install dynesty
    

    or using Conda with:

    conda install -c conda-forge dynesty
    

    Note: Other dynamic nested sampling codes exist such as dyPolyChord and perfectns.

    The likelihood function and prior transform can be set up in a way that is identical to Nestle (we'll define both again below for completeness), with the differences being in how the sampler is called. Here we will show the sampling being used for both "static" and "dynamic" nested sampling.

    def prior_transform(theta): A function defining the tranform between the parameterisation in the unit hypercube to the true parameters. Args: theta (tuple): a tuple containing the parameters. Returns: tuple: a new tuple or array with the transformed parameters. mprime, cprime = theta # unpack the parameters (in their unit hypercube form) cmin = -10. # lower bound on uniform prior on c cmax = 10. # upper bound on uniform prior on c mmu = 0. # mean of Gaussian prior on m msigma = 10. # standard deviation of Gaussian prior on m m = mmu + msigma*ndtri(mprime) # convert back to m c = cprime*(cmax-cmin) + cmin # convert back to c return (m, c) # set the natural logarithm of 2pi, so that it doesn't have to be recalculated LN2PI = np.log(2. * np.pi) LNSIGMA = np.log(sigma) # natural log of the data noise standard deviation def loglikelihood_dynesty(theta): The log-likelihood function. m, c = theta # unpack the parameters # normalisation norm = -0.5 * M * LN2PI - M * LNSIGMA # chi-squared (data, sigma and x are global variables defined early on in this notebook) chisq = np.sum(((data-straight_line(x, m, c))/sigma)**2) return norm - 0.5 * chisq print('dynesty version: {}'.format(dynesty.__version__)) nlive = 1024 # number of live points bound = 'multi' # use MutliNest algorithm for bounds ndims = 2 # two parameters sample = 'unif' # uniform sampling tol = 0.1 # the stopping criterion from dynesty import NestedSampler sampler = NestedSampler(loglikelihood_dynesty, prior_transform, ndims, bound=bound, sample=sample, nlive=nlive) t0 = time() sampler.run_nested(dlogz=tol, print_progress=False) # don't output progress bar t1 = time() timedynesty = (t1-t0) print("Time taken to run 'dynesty' (in static mode) is {} seconds".format(timedynesty))
    res = sampler.results # get results dictionary from sampler
    logZdynesty = res.logz[-1]        # value of logZ
    logZerrdynesty = res.logzerr[-1]  # estimate of the statistcal uncertainty on logZ
    print("log(Z) = {} ± {}".format(logZdynesty, logZerrdynesty))
    
    # get function that resamples from the nested samples to give sampler with equal weight
    from dynesty.utils import resample_equal
    # draw posterior samples
    weights = np.exp(res['logwt'] - res['logz'][-1])
    samples_dynesty = resample_equal(res.samples, weights)
    resdict['mdynesty_mu'] = np.mean(samples_dynesty[:,0])      # mean of m samples
    resdict['mdynesty_sig'] = np.std(samples_dynesty[:,0])      # standard deviation of m samples
    resdict['cdynesty_mu'] = np.mean(samples_dynesty[:,1])      # mean of c samples
    resdict['cdynesty_sig'] = np.std(samples_dynesty[:,1])      # standard deviation of c samples
    resdict['ccdynesty'] = np.corrcoef(samples_dynesty.T)[0,1]  # correlation coefficient between parameters
    resdict['dynesty_npos'] = len(samples_dynesty)              # number of posterior samples
    resdict['dynesty_time'] = timedynesty                       # run time
    resdict['dynesty_logZ'] = logZdynesty                       # log marginalised likelihood
    resdict['dynesty_logZerr'] = logZerrdynesty                 # uncertainty on log(Z)
    print('Number of posterior samples is {}'.format(len(samples_dynesty)))
    # plot using corner.py
    plotposts(samples_dynesty)
    
    resdict
    
    
    
    
        
    ['esdynesty'] = ess(np.exp(res["logwt"] - logZdynesty))
    resdict['essdynesty'] = int(resdict['esdynesty'] / timedynesty)
    print("Effective samples per second: {}".format(resdict['essdynesty']))
    

    We can now do the same thing, but with the "dynamic" sampler. For this we can specify an initial number of live points (we will stick with 1024), but the actual number used will evolve. The stopping criterion for the dynamic method is more complex than for the static case, and we will just leave it at the default values.

    # set up the sampler dsampler = DynamicNestedSampler(loglikelihood_dynesty, prior_transform, ndims, bound=bound, sample=sample) # run the sampler (not that we have not included the dlogz tolerance) t0 = time() dsampler.run_nested(nlive_init=nlive, print_progress=False) t1 = time() timedynestydynamic = (t1-t0) print("Time taken to run 'dynesty' (in dynamic mode) is {} seconds".format(timedynestydynamic))
    dres = dsampler.results # get results dictionary from sampler
    logZdynestydynamic = dres.logz[-1]        # value of logZ
    logZerrdynestydynamic = dres.logzerr[-1]  # estimate of the statistcal uncertainty on logZ
    print("Static: log(Z) = {} ± {}".format(logZdynesty, logZerrdynesty))
    print("Dynamic: log(Z) = {} ± {}".format(logZdynestydynamic, logZerrdynestydynamic))
    # draw posterior samples
    dweights = np.exp(dres['logwt'] - dres['logz'][-1])
    samples_dynestydynamic = resample_equal(dres.samples, dweights)
    essdynestydynamic = ess(np.exp(dres["logwt"] - logZdynestydynamic))
    print('Static: num. posterior samples: {}'.format(len(samples_dynesty)))
    print('Dynamic: num. posterior samples: {}'.format(len(samples_dynestydynamic)))
    print('Static: effective sample size: {}'.format(resdict['esdynesty']))
    print('Dynamic: effective sample size: {}'.format(essdynestydynamic))
    
    Static: log(Z) = -116.66207844899319 ± 0.11680906514881793
    Dynamic: log(Z) = -116.70461530428305 ± 0.11701110275760397
    Static: num. posterior samples: 11624
    Dynamic: num. posterior samples: 25519
    Static: effective sample size: 4074
    Dynamic: effective sample size: 16617
    

    UltraNest

    UltraNest is Python-based implementation of the nested sampling, that contains several sampling algorithms designed for correctness of sampling. One particular method it provides is called RadFriends (Buchner, 2014) that is designed to be robust against effects caused by inaccuracies of drawing new samples from the constrained prior.

    UltraNest is available from its github repository, but can be installed form PyPi using

    pip install ultranest
    

    or using Conda with:

    conda install -c conda-forge ultranest
    

    Similarly to Nestle and dynesty, UltraNest samples from a unit hypercube, and therefore you have to defined a prior transformation function, so we can set up these functions in a very similar way to what we have seen above.

    def prior_transform_ultranest(theta): A function defining the tranform between the parameterisation in the unit hypercube to the true parameters. Args: theta (list): a list/array containing the parameters. Returns: list: a new list/array with the transformed parameters. # unpack the parameters (in their unit hypercube form) mprime = theta[0] cprime = theta[1] cmin = -10. # lower bound on uniform prior on c cmax = 10. # upper bound on uniform prior on c mmu = 0. # mean of Gaussian prior on m msigma = 10. # standard deviation of Gaussian prior on m m = mmu + msigma*ndtri(mprime) # convert back to m c = cprime*(cmax-cmin) + cmin # convert back to c return np.array([m, c]) norm = -0.5*M*LN2PI - M*LNSIGMA # chi-squared (data, sigma and x are global variables defined early on in this notebook) chisq = np.sum(((data-straight_line(x, m, c))/sigma)**2) return norm - 0.5*chisq print('UltraNest version: {}'.format(ultranest.__version__)) # set the ReactiveNestedSampler method sampler = ultranest.ReactiveNestedSampler(["m", "c"], loglikelihood_ultranest, prior_transform_ultranest) ndims = 2 # two parameters tol = 0.1 # the stopping criterion # run the nested sampling algorithm t0 = time() result = sampler.run(dlogz=tol) t1 = time() timeultranest = (t1 - t0)
    Sampling 400 live points from prior ...
    run_iter dlogz=0.1, dKL=0.5, frac_remain=0.01, Lepsilon=0.0010, min_ess=400
    max_iters=-1, max_ncalls=-1, max_num_improvement_loops=-1, min_num_live_points=400, cluster_num_live_points=40
    minimal_widths_sequence: [(-inf, 400.0), (inf, 400.0)]
    
    logZultranest = result['logz']        # value of logZ
    logZerrultranest = result['logzerr']  # estimate of the statistcal uncertainty on logZ
    # output marginal likelihood
    print("Marginalised evidence is {} ± {}".format(logZultranest, logZerrultranest))
    
    points = np.array(result["weighted_samples"]["points"])
    weights = np.array(result["weighted_samples"]["weights"])
    scaledweights = weights / weights.max()
    mask = np.random.rand(len(scaledweights)) < scaledweights
    samples_ultranest = points[mask, :]
    resdict['multranest_mu'] = np.mean(samples_ultranest[:,0])      # mean of m samples
    resdict['multranest_sig'] = np.std(samples_ultranest[:,0])      # standard deviation of m samples
    resdict['cultranest_mu'] = np.mean(samples_ultranest[:,1])      # mean of c samples
    resdict['cultranest_sig'] =
    
    
    
    
        
     np.std(samples_ultranest[:,1])      # standard deviation of c samples
    resdict['ccultranest'] = np.corrcoef(samples_ultranest.T)[0,1]  # correlation coefficient between parameters
    resdict['ultranest_npos'] = len(samples_ultranest)              # number of posterior samples
    resdict['ultranest_time'] = timeultranest                       # run time
    resdict['ultranest_logZ'] = logZultranest                       # log marginalised likelihood
    resdict['ultranest_logZerr'] = logZerrultranest                 # uncertainty on log(Z)
    # plot using corner.py
    plotposts(samples_ultranest)
    print('Number of posterior samples is {}'.format(samples_ultranest.shape[0]))
    
    resdict['esultranest'] = ess(weights)
    resdict['essultranest'] = int(resdict['esultranest'] / timeultranest)
    print("Effective samples per second: {}".format(resdict['essultranest']))
    

    We will now look at some nested sampling algorithms that are not pure Python, but do have Python wrappers.

    PyMultiNest

    PyMultiNest (Buchner et al, 2014) is a wrapper to the popular MultiNest code (Feroz & Hobson, 2008 & Feroz et al, 2009), which has been particularly taken up by the astrophysics and cosmology community. The basics underlying MultiNest were described in relation to Nestle, although MultiNest can use more sophisticated sampling methods, such as importance sampling (Feroz et al, 2014).

    To use PyMultiNest, the MultiNest library itself must be installed (MultiNest can be downloaded from GitHub here). Installation of PyMultiNest and its requirements are described here; once MultiNest is installed you can then install PyMultiNest using pip with:

    pip install pymultinest
    

    A simpler solution, that comes bundled with MultiNest, is to install using Conda with:

    conda install -c conda-forge pymultinest
    

    Note: If using MultiNest, and software wrapping it, for your work please note the software license restrictions here.

    Some examples of running PyMultiNest can be found here, although an interface through a solver function and Solver class are also available. We'll use the Solver class method, which can take in the same initialisation keyword arguments as the run() method.

    We use the Solver class to create a new class for our specific example. As with CPNest we use this class to define our log likeihood function, and like Nestle we define a prior transform to go from a unit hypercube to the true parameterisation. These functions can only take in arrays of the parameter values that we are exploring and cannot take in additional arguments. Therefore any additional values required must either be global variables, or a new __init__ method for the class must be defined. We'll show the latter of these.

    Note: PyMultiNest is also available in Python through the Monte Python package.

    import pymultinest from pymultinest.solve import Solver from scipy.special import ndtri LN2PI = np.log(2.*np.pi) class StraightLineModelPyMultiNest(Solver): A simple straight line model, with a Gaussian likelihood. Args: data (:class:`numpy.ndarray`): an array containing the observed data abscissa (:class:`numpy.ndarray`): an array containing the points at which the data were taken modelfunc (function): a function defining the model sigma (float): the standard deviation of the noise in the data **kwargs: keyword arguments for the run method # define the prior parameters cmin = -10. # lower range on c (the same as the uniform c prior lower bound) cmax = 10. # upper range on c (the same as the uniform c prior upper bound) mmu = 0. # mean of the Gaussian prior on m msigma = 10. # standard deviation of the Gaussian prior on m def __init__(self, data, abscissa, modelfunc, sigma, **kwargs): # set the data self._data = data # oberserved data self._abscissa = abscissa # points at which the observed data are taken self._sigma = sigma # standard deviation(s) of the data self._logsigma = np.log(sigma) # log sigma here to save computations in the likelihood self._ndata = len(data) # number of data points self._model = modelfunc # model function Solver.__init__(self, **kwargs) def Prior(self, cube): The prior transform going from the unit hypercube to the true parameters. This function has to be called "Prior". Args: cube (:class:`numpy.ndarray`): an array of values drawn from the unit hypercube Returns: :class:`numpy.ndarray`: an array of the transformed parameters # extract values mprime = cube[0] cprime = cube[1] m = self.mmu + self.msigma*ndtri(mprime) # convert back to m c = cprime*(self.cmax-self.cmin) + self.cmin # convert back to c return np.array([m, c]) def LogLikelihood(self, cube): The log likelihood function. This function has to be called "LogLikelihood". Args: cube (:class:`numpy.ndarray`): an array of parameter values. Returns: float: the log likelihood value. # extract parameters m = cube[0] c = cube[1] # calculate the model model = self._model(x, m, c) # normalisation norm = -0.5*self._ndata*LN2PI - self._ndata*self._logsigma # chi-squared chisq = np.sum(((self._data - model)/(self._sigma))**2) return norm - 0.5*chisq
    nlive = 1024 # number of live points
    ndim = 2     # number of parameters
    tol = 0.1    # stopping criterion
    # run the algorithm
    t0 = time()
    solution = StraightLineModelPyMultiNest(data, x, straight_line, sigma, n_dims=ndim,
                                            n_live_points=nlive, evidence_tolerance=tol);
    t1 = time()
    timepymultinest = (t1-t0)
    
    logZpymnest = solution.logZ        # value of log Z
    logZerrpymnest = solution.logZerr  # estimate of the statistcal uncertainty on logZ
    print("log(Z) = {} ± {}".format(logZpymnest, logZerrpymnest))
    
    Model in "(temporary directory)" (2 dimensions)
    Evidence ln Z = -116.7 +- 0.1
    Parameter values:
       Parameter  1 : 3.464 +- 0.100
       Parameter  2 : 0.992 +- 0.566
    
    mchain_pymnest = solution.samples[:,0
    
    
    
    
        
    ] # extract chain of m values
    cchain_pymnest = solution.samples[:,1] # extract chain if c values
    samples_pymnest = np.vstack((mchain_pymnest, cchain_pymnest)).T
    resdict['mpymnest_mu'] = np.mean(samples_pymnest[:,0])      # mean of m samples
    resdict['mpymnest_sig'] = np.std(samples_pymnest[:,0])      # standard deviation of m samples
    resdict['cpymnest_mu'] = np.mean(samples_pymnest[:,1])      # mean of c samples
    resdict['cpymnest_sig'] = np.std(samples_pymnest[:,1])      # standard deviation of c samples
    resdict['ccpymnest'] = np.corrcoef(samples_pymnest.T)[0,1]  # correlation coefficient between parameters
    resdict['pymnest_npos'] = len(samples_pymnest)              # number of posterior samples
    resdict['pymnest_time'] = timepymultinest                   # run time
    resdict['pymnest_logZ'] = logZpymnest                       # log marginalised likelihood
    resdict['pymnest_logZerr'] = logZerrpymnest                 # uncertainty on log(Z)
    # print out the number of posterior samples
    print('Number of posterior samples is {}'.format(samples_pymnest.shape[0]))
    # plot using corner.py
    plotposts(samples_pymnest)
    
    # note: I'm not calculating the correct ESS here as I can't access the weights
    resdict['espymnest'] = len(samples_pymnest)
    resdict['esspymnest'] = int(resdict['espymnest'] / timepymultinest)
    print("Effective samples per second: {}".format(resdict['esspymnest']))
    

    DNest4

    DNest4 (Brewer & Foreman-Mackey, 2016) uses a nested sampling algorithm called Diffusive Nested Sampling (Brewer et al, 2011). The main difference between this method and others is that it can at any point still propose samples throughout the whole prior, rather than being constrained to draw points within the current minimum likelihood bound. This allows it to find volumes of high likelihood that might be missed by other methods because they lie in an area where no live points remain, i.e. it is good for sampling narrow multi-model posteriors. It can also be used in cases where the number of parameters in the model that is being fit is itself a variable, e.g., if the line model could either be a straight line, or a quadratic, or a cubic. The method used is called a Reversible Jump MCMC (Green, 1995), although this is not currently available in the Python interface for DNest4.

    An example of using the Python wrapper for DNest4 for a straight line mode is given here, but we will reproduce it with minor modifications (in that example $\sigma$ is also included in the fitting).

    You can install the development version of DNest4 by cloning it from its GitHub repository and following the instructions here. However, the Python interace can be installed from PyPI via pip with

    pip install dnest4
    

    or through Conda with

    conda install -c conda-forge dnest4
    

    For DNest4 we need to create a Model class that contains the following methods:

  • from_prior(): a function that draws a sample from the prior distributions of the parameters
  • perturb(): a function that proposes new sample values
  • log_likelihood(): the natural logarithm of the likelihood function
  • The perturb() method also needs to calculate the natural logarithm of the Metropolis-Hastings ratio required to give the acceptance probability:

    H = \frac{p(\theta'|I)}{p(\theta|I)} \frac{g(\theta|\theta')}{g(\theta'|\theta)}, $$

    where the first ratio on the right-hand-side is the ratio of priors at the proposed point $\theta'$ and current point $\theta$, and the second ratio of the ratio of the proposal distributions. In our case the proposal distribution is symmetric, so its ratio is unity, and the ratio of the prior depends on the Gaussian prior on $m$, but not on the uniform prior for $c$.

    Parameters cannot be passed to the class on initialisation, so things such as the data must be global variables.

    # import randh function, which draws points from a heavy-tailed distribution, and wrapm which wraps a point within a given range
    from dnest4 import randh, wrap
    LN2PI = np.log(2.*np.pi)
    LNSIGMA = np.log(sigma)
    # define prior range values as global variables
    cmin = -10.  # lower range on c (the same as the uniform c prior lower bound)
    cmax = 10.   # upper range on c (the same as the uniform c prior upper bound)
    mmu = 0.     # mean of the Gaussian prior on m
    msigma = 10. # standard deviation of the Gaussian prior on m
    class DNest4Model(object):
        Specify the model in Python.
        def __init__(self):
            Parameter values *are not* stored inside the class
        def from_prior(self):
            Draw points from the prior distribution.
            Returns:
                :class:`np.ndarray`: a numpy array containing the parameter values.
            m = np.random.normal(mmu, msigma)
            c = np.random.uniform(cmin,  cmax)
            return np.array([m, c])
        def perturb(self, params):
            Perturb the current parameters by proposing a new position. This takes a numpy array of
            parameters as input, and modifies it in-place. In just perturbs one parameter at a time.
            Args:
                params (:class:`np.ndarray`): the current parameters (this is modified by the function)
            Returns:
                float: the natural logarithm of the Metropolis-Hastings proposal ratio.
            logH = 0.0 # log of the Metropolis-Hastings prior x proposal ratio
            # randomly choose which parameter to perturb 
            which = np.random.randint(len(params))
            mmu = 0.
            msigma = 10.
            if which == 0:
                # update H for Gaussian prior
                logH -= -0.5*((params[which]-mmu)/msigma)**2
            params[which] += 1.*randh() # we could scale this or use different distributions for each parameter, but in our case we know our distribution is quite compact and unimodal
            if which == 0:
                # update H for Gaussian prior
                logH += -0.5*((params[which]-mmu)/msigma)**2
            else:
                # wrap c value so that it stays within the prior range
                params[which] = wrap(params[which], cmin, cmax)
            return logH
        def log_likelihood(self, params):
            Gaussian sampling distribution.
            m, c = params # unpack parameters
            norm = -0.5*M*LN2PI - M*LNSIGMA
            chisq = np.sum(((data - straight_line(x, m, c))/sigma)**2)
            return norm - 0.5*chisq
    

    We can now run the sampler. The values that are passed to the sampler are described in Section 7 of Brewer & Foreman-Mackey, 2016, but we'll stick with the defaults used in the example code. Unlike with the other nested sampling algorithms we do not pass a tolerance for the stopping criterion, but just specify the number of levels and steps to perform. The num_steps gives to total number of MCMC iterations, but if not set the code will run forever by default, until the process is killed by the user.

    print('DNest4 version: {}'.format(dnest4.__version__)) # Create a model object and a sampler model = DNest4Model() sampler = dnest4.DNest4Sampler(model, backend=dnest4.backends.CSVBackend(".", sep=" ")) # Set up the sampler. The first argument is max_num_levels gen = sampler.sample(max_num_levels=30, num_steps=1000, new_level_interval=10000, num_per_step=10000, thread_steps=100, num_particles=5, lam=10, beta=100) # Do the sampling (one iteration here = one particle save) t0 = time() for sample in enumerate(gen): t1 = time() timednest4 = (t1-t0) print("Time taken to run 'DNest4' is {} seconds".format(timednest4))

    To create the posterior samples the output of DNest4 has to be processed, which can be performed using the postprocess() function. Once this is done the posterior samples will be held in a posterior_sample.txt file containing a column for each parameter. As stated in Brewer & Foreman-Mackey, 2016:

    "it is harder to compute justified error bars on $\ln{(Z)}$ in DNS than it is in standard NS."

    so we will not quote a value here.

    # Run the postprocessing to get marginal likelihood and generate posterior samples
    logZdnest4, infogaindnest4, _ = dnest4.postprocess(plot=False);
    samples_dnest4 = np.loadtxt('posterior_sample.txt')
    
    # print out the number of posterior samples
    print('Number of posterior samples is {}'.format(samples_dnest4.shape[0]))
    resdict['mdnest4_mu'] = np.mean(samples_dnest4[:,0])      # mean of m samples
    resdict['mdnest4_sig'] = np.std(samples_dnest4[:,0])      # standard deviation of m samples
    resdict['cdnest4_mu'] = np.mean(samples_dnest4[:,1])      # mean of c samples
    resdict['cdnest4_sig'] = np.std(samples_dnest4[:,1])      # standard deviation of c samples
    resdict['ccdnest4'] = np.corrcoef(samples_dnest4.T)[0,1]  # correlation coefficient between parameters
    resdict['dnest4_npos'] = len(samples_dnest4)              # number of posterior samples
    resdict['dnest4_time'] = timednest4                       # run time
    resdict['dnest4_logZ'] = logZdnest4                       # log marginalised likelihood
    # plot using corner.py
    plotposts(samples_dnest4)
    
    # note: I'm not calculating the correct ESS here as I can't access the weights
    resdict['esdnest4'] = len(samples_dnest4)
    resdict['essdnest4'] = len(samples_dnest4) / timednest4
    print("Effective samples per second: {0:.2f}".format(resdict['essdnest4']))
    

    PyPolyChord

    PolyChord (Handley et al, 2015a, Handley et al, 2015b) is a nested sampling implementation that uses slice sampling when drawing new points from the constrained prior.

    The PolyChord package itself provides a Python interface called PyPolyChord for using the algorithm. PolyChord can be downloaded from Github here with:

    git clone https://github.com/PolyChord/PolyChordLite.git
    

    PyPolyChord can then be installed with

    cd PolyChordLite
    python setup.py install
    

    The installation will attempt to find a copy of the libmpi.so library. If you have not got MPI (or are having further problems with MPI) you can disable it by instead installing with:

    python setup.py --no-mpi install
    

    As with Nestle and PyMultiNest PolyChord samples from a unit hypercube, so you need to define a prior transform function. We also need a log likelihood function. These can be very similar to those used for Nestle, but the functions must take in and output list rather than lists or tuples. Values such as the data array cannot be passed to the functions explicitly, so must be set as global variables.

    Note: If using PolyChord for your work, please note the license restrictions described here.

    def prior_transform_polychord(cube): A function defining the tranform between the parameterisation in the unit hypercube to the true parameters. Args: cube (array, list): a list containing the parameters as drawn from a unit hypercube. Returns: list: the transformed parameters. # unpack the parameters (in their unit hypercube form) mprime = cube[0] cprime = cube[1] cmin = -10. # lower bound on uniform prior on c cmax = 10. # upper bound on uniform prior on c mmu = 0. # mean of Gaussian prior on m msigma = 10. # standard deviation of Gaussian prior on m m = mmu + msigma * ndtri(mprime) # convert back to m c = cprime * (cmax - cmin) + cmin # convert back to c theta = [m, c] return theta def loglikelihood_polychord(theta): The log-likelihood function. Args: theta (array, list): the set of parameter values. Returns: float: the log-likelihood value. list: A list of any derived parameters (an empty list if there are none). # unpack the parameters m = theta[0] c = theta[1] # normalisation norm = -0.5 * M * LN2PI - M*LNSIGMA # chi-squared (data, sigma and x are global variables defined early on in this notebook) chisq = np.sum(((data-straight_line(x, m, c)) / sigma)**2) return norm - 0.5 * chisq, []

    As before we'll use 1024 live points (the default is to use 25 times the dimensionality of the problem) and a stopping criterion of 0.1 (the default value is 0.001). The current version of PyPolyChord will return values of the marginal likelihood and its uncertainty from the PyPolyChord object that runs the algorithm, but the class does not currently contain the posterior samples. It instead outputs these to files. The location of these files is set by setting the keyword arguments base_dir (defaults to "chains") and file_root (defaults to "test") in a PolyChordSettings object. This sets {root} = {base_dir}/{file_root}, and within this the posterior samples can be found in {root}.txt and the other statistics found in {root}.stats.

    You may find that codes produce a "Segmentation Fault" soon after they start running. The PolyChord documentation says:

    "The slice sampling & clustering steps use a recursive procedure. The default memory allocated to recursive procedures is embarassingly small (to guard against memory leaks)."

    and the solution is to:

    Try increasing the stack size:
    Linux: ulimit -s unlimited
    OSX: ulimit -s hard

    Alternatively, within a Python script you can achieve this using the resource module, as shown below.

    nlive = 1024   # number of live points
    ndims = 2      # number of parameters
    nderived = 0   # number of derived parameters (this is zero)
    tol = 0.1      # stopping criterion
    basedir = os.path.join(os.getcwd(), 'polychord')  # output base directory
    if not os.path.isdir(basedir):
        os.makedirs(basedir)                          # create base directory
        os.makedirs(os.path.join(basedir, 'clusters'))
    fileroot = 'straightline'                         # output file name
    broot
    
    
    
    
        
     = os.path.join(basedir, fileroot)
    # import PolyChord
    import pypolychord
    from pypolychord.settings import PolyChordSettings  # class for passing setup information
    from pypolychord.priors import UniformPrior         # pre-defined class for a uniform prior
    # setup run settings using the PolyChordSetting class
    pargs = {'nlive': nlive,
             'precision_criterion': tol,
             'base_dir': basedir,
             'file_root': fileroot,
             'write_resume': False, # don't output a resume file
             'read_resume': False}  # don't read a resume file
    settings = PolyChordSettings(ndims, nderived, **pargs)
    t0 = time()
    output = pypolychord.run_polychord(loglikelihood_polychord, ndims, nderived, settings, prior_transform_polychord)
    t1 = time()
    timepolychord = (t1-t0)
    print("Time taken to run 'PyPolyChord' is {} seconds".format(timepolychord))
    # reset stack resource limit
    resource.setrlimit(resource.RLIMIT_STACK, curlimit)
    
    samplefile = broot+'_equal_weights.txt'
    samples_polychord = np.loadtxt(samplefile)
    samples_polychord = samples_polychord[:,-ndims:] # extract the last 'ndims' columns
    # print out the number of posterior samples
    print('Number of posterior samples is {}'.format(samples_polychord.shape[0]))
    resdict['mpolychord_mu'] = np.mean(samples_polychord[:,0])      # mean of m samples
    resdict['mpolychord_sig'] = np.std(samples_polychord[:,0])      # standard deviation of m samples
    resdict['cpolychord_mu'] = np.mean(samples_polychord[:,1])      # mean of c samples
    resdict['cpolychord_sig'] = np.std(samples_polychord[:,1])      # standard deviation of c samples
    resdict['ccpolychord'] = np.corrcoef(samples_polychord.T)[0,1]  # correlation coefficient between parameters
    resdict['polychord_npos'] = len(samples_polychord)              # number of posterior samples
    resdict['polychord_time'] = timepolychord                       # run time
    resdict['polychord_logZ'] = output.logZ                         # log marginalised likelihood
    resdict['polychord_logZerr'] = output.logZerr                   # uncertainty on log(Z)
    # plot using corner.py
    plotposts(samples_polychord)
    
    weights = np.loadtxt(broot+'.txt')[:,0]  # get weights
    resdict['espolychord'] = ess(weights)
    resdict['esspolychord'] = int(resdict['espolychord'] / timepolychord)
    print("Effective samples per second: {}".format(resdict['esspolychord']))
    

    One code to rule them all!

    There are a variety of different styles, syntaxes and formats required to use the various packages decribed above, so wouldn't it be nice to have a unified interface to access a range of different samplers!

    Well, there are in fact a few different packages that can allow you to use more than on of the above samplers, e.g., Monte Python, cronus, Gleipnir and bilby (other may, and probably do, exist).

    bilby

    Of these, bilby (Ashton et al, 2019) provides the access to the widest range of samplers (I may be biased as one of the developers), including both MCMC and nested sampler packages. An example of using bilby for a linear regression model can be found here, but I'll reproduce something equivalent to the example in this notebook below.

    Bilby has a selection of built-in likelihood functions and prior distributions. You can define your own likelihood, but here we'll use the built-in Gaussian likelihood function (the source code for which can be found here). By default, bilby will use the dynesty sampler, but we'll explicitly set it in this example.

    # import the likelihood function and prior function from bilby.core.likelihood import GaussianLikelihood from bilby.core.prior import Gaussian, Uniform # define the likelihood function likelihood = bilby.core.likelihood.GaussianLikelihood(x, data, straight_line, sigma=sigma) cmin = -10. # lower range of uniform distribution on c cmax = 10. # upper range of uniform distribution on c mmu = 0. # mean of Gaussian distribution on m msigma = 10. # standard deviation of Gaussian distribution on m # define the priors in a dictionary priors = { "m": Gaussian(mmu, msigma, name="m"), "c": Uniform(cmin, cmax, name="c"), # run the sampler (using dynesty) bilbyresult = bilby.run_sampler( likelihood=likelihood, priors=priors, outdir="bilby", # the directory to output the results to label="bilby", # a label to apply to the output file plot=False, # by default this is True and a corner plot will be produced injection_parameters={"m": m, "c": c}, # (optional) true values for adding to plots sampler='dynesty', # set the name of the sampler nlive=1024, # add in any keyword arguments used by that sampler sample='unif',

    Summary

    The timings given in this notebook are not directly comparable as the various codes have not been run with identical settings, and the number of likelihood evaluations produced can be quite different. But, they may be helpful in providing a rough ball-park of expected run times. As mentioned in relation to emcee above, it is important to note that different samplers will produce different numbers of truly independent samples, so some methods that may take longer to run may actually give more independent samples. Therefore comparisons of the number of effective samples per second may be more useful, although these will still be quite problem dependent. It is also worth noting that some of the codes can make use or multiple CPU cores, or GPUs, but these have not been used here. So, speed-ups could be achieved by using these in-built parallelisations.

    The number of posterior samples produced by the nested sampling codes can depend strongly on the number of initial live points used. However, for the codes where we have specified a number of live points the final posterior sizes are very comparable. It should be noted that comparing posterior sample sizes and run times for DNest4 in particular is not entirely fair, as it is designed with more complex problems in mind and is not created to be efficient at simple examples where other methods work easily.

    Something that is also worth noting, in particular with regard to nested sampling algorithms, is that they will generally take longer to run when there is a larger degree of difference between the prior and the posterior (i.e., the data contains a lot of new information, which, in technical terms, is called having a large Kullback-Leibler (KL) divergence). So, estimating the parameters of a model when the data contains a high signal-to-noise ratio signal will often take longer than the same model but with a weaker signal in the data.

    Now, let's overplot everything (it will look a bit horrific)!

    # get a list of all the samples chains = [samples_emcee, samples_pymc3, samples_zeus , samples_tfp, samples_pystan, samples_pyjags, samples_nestle, samples_cpnest, samples_dynesty, samples_ultranest, samples_pymnest, samples_dnest4, samples_polychord] color = cm.gnuplot(np.linspace(0, 1, len(chains))) hist2dkwargs = {'plot_datapoints': False, 'plot_density': False, 'levels': 1.0 - np.exp(-0.5 * np.arange(1.5, 2.1, 0.5) ** 2)} # roughly 1 and 2 sigma for i, chain in enumerate(chains): cv = to_hex(color[i]) if i == 0: fig = corner.corner( chain, labels=[r"$m$", r"$c$"], color=cv, hist_kwargs={'density': True}, **hist2dkwargs, else: if i < len(chains)-1: corner.corner( chain, labels=[r"$m$", r"$c$"], fig=fig, color=cv, hist_kwargs={'density': True}, **hist2dkwargs, else: # plot true values on last plot corner.corner( chain, labels=[r"$m$", r"$c$"], truths=[m, c], fig=fig, color=cv, hist_kwargs={'density': True}, **hist2dkwargs, fig.set_size_inches(10.0, 9.0) outputtable = """ | Sampler | $m$ | $c$ | corr. coeff | num. effective samples | effective samples / s | |--- | :---: | :---: | :---: | :---: | :---: | | emcee | {memcee_mu:.3f} ± {memcee_sig:.3f} | {cemcee_mu:.3f} ± {cemcee_sig:.3f} | {ccemcee:.3f} | {esemcee:d} | {essemcee:d} | | PyMC3 | {mpymc3_mu:.3f} ± {mpymc3_sig:.3f} | {cpymc3_mu:.3f} ± {cpymc3_sig:.3f} | {ccpymc3:.3f} | {espymc3:d} | {esspymc3:d} | | TFP | {mtfp_mu:.3f} ± {mtfp_sig:.3f} | {ctfp_mu:.3f} ± {ctfp_sig:.3f} | {cctfp:.3f} | {estfp:d} | {esstfp:d} | | emcee | {mzeus_mu:.3f} ± {mzeus_sig:.3f} | {czeus_mu:.3f} ± {czeus_sig:.3f} | {cczeus:.3f} | {eszeus:d} | {esszeus:d} | | PyStan | {mpystan_mu:.3f} ± {mpystan_sig:.3f} | {cpystan_mu:.3f} ± {cpystan_sig:.3f} | {ccpystan:.3f} | {espystan:d} | {esspystan:d} | | PyJAGS | {mpyjags_mu:.3f} ± {mpyjags_sig:.3f} | {cpyjags_mu:.3f} ± {cpyjags_sig:.3f} | {ccpyjags:.3f} | {espyjags:d} | {esspyjags:d} | | Nestle | {mnestle_mu:.3f} ± {mnestle_sig:.3f} | {cnestle_mu:.3f} ± {cnestle_sig:.3f} | {ccnestle:.3f} | {esnestle:d} | {essnestle:d} | | CPNest | {mcpnest_mu:.3f} ± {mcpnest_sig:.3f} | {ccpnest_mu:.3f} ± {ccpnest_sig:.3f} | {cccpnest:.3f} | {escpnest:d} | {esscpnest:d} | | dynesty | {mdynesty_mu:.3f} ± {mdynesty_sig:.3f} | {cdynesty_mu:.3f} ± {cdynesty_sig:.3f} | {ccdynesty:.3f} | {esdynesty:d} | {essdynesty:d} | | UltraNest | {multranest_mu:.3f} ± {multranest_sig:.3f} | {cultranest_mu:.3f} ± {cultranest_sig:.3f} | {ccultranest:.3f} | {esultranest:d} | {essultranest:d} | | PyMultiNest | {mpymnest_mu:.3f} ± {mpymnest_sig:.3f} | {cpymnest_mu:.3f} ± {cpymnest_sig:.3f} | {ccpymnest:.3f} | {espymnest:d} | {esspymnest:d} | | DNest4 | {mdnest4_mu:.3f} ± {mdnest4_sig:.3f} | {cdnest4_mu:.3f} ± {cdnest4_sig:.3f} | {ccdnest4:.3f} | {esdnest4:.2f} | {essdnest4:.2f} | | PyPolyChord | {mpolychord_mu:.3f} ± {mpolychord_sig:.3f} | {cpolychord_mu:.3f} ± {cpolychord_sig:.3f} | {ccpolychord:.3f} | {espolychord:d} | {esspolychord:d} | Markdown(outputtable.format(**resdict)) | Sampler | $\ln{{Z}}$ | num. posterior samples | run time (s) | | --- | --- | :---: | --- | | Nestle | {nestle_logZ:.3f} ± {nestle_logZerr:.3f} | {nestle_npos} | {nestle_time:.3f} | | CPNest | {cpnest_logZ:.3f} ± {cpnest_logZerr:.3f} | {cpnest_npos} | {cpnest_time:.3f} | | dynesty | {dynesty_logZ:.3f} ± {dynesty_logZerr:.3f} | {dynesty_npos} | {dynesty_time:.3f} | | UltraNest | {ultranest_logZ:.3f} ± {ultranest_logZerr:.3f} | {ultranest_npos} | {ultranest_time:.3f} | | PyMultiNest | {pymnest_logZ:.3f} ± {pymnest_logZerr:.3f} | {pymnest_npos} | {pymnest_time:.3f} | | DNest4 | {dnest4_logZ:.3f} | {dnest4_npos} | {dnest4_time:.3f} | | PyPolyChord| {polychord_logZ:.3f} ± {polychord_logZerr:.3f} | {polychord_npos} | {polychord_time:.3f} | Markdown(outputnest.format(**resdict))

    Running this notebook

    I have created a Docker image, available on Quay.io, in which Python 3.7 and all the above samplers are installed (this is roughly 5 Gb in size, so may take some time to download). This can be used to run this notebook in Jupyter. You can download this notebook from here.

    If you have Docker installed then under Linux, Mac OS or Windows you can run the notebook by opening a terminal (e.g., the Powershell in Windows) and running:

    sudo docker run -i -t -p 8888:8888 -v ${notebook_location}:/samplers quay.io/mattpitkin/samplers-demo:latest /bin/bash -c "jupyter notebook --notebook-dir=/samplers --ip='*' --port=8888 --no-browser --allow-root --MultiKernelManager.default_kernel_name=Samplers"
    

    where {notebook_location} is replaced with the directory path containing the downloaded notebook. This will provide a link that can be pasted into your browser to show the notebook, and the Samplers.ipynb file can be selected. Under Windows remove the sudo from the above command.