Skip to content

Gaussian process cross correlation

License

Notifications You must be signed in to change notification settings

HITS-AIN/GPCC.jl

Repository files navigation

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. GitHub ascl:2303.006

GPCC.jl

ℹ What is this?

This is a Julia implementation of the Gaussian Process Cross Correlation (GPCC) method introduced in

A Gaussian process cross-correlation approach to time delay estimation for reverberation mapping of active galactic nuclei.

GPCC is a probabilistic alternative to the Interpolated Cross Correlation function (ICCF). Advantages over the ICCF include:

  • Outputs a probability distribution for the delay.
  • It can incorporate a prior on the delay.
  • Delivers predictions for out-of-sample data.

💾 Installation

Apart from cloning, an easy way of using the package is the following:

1 - Add the registry AINJuliaRegistry

2 - Switch into "package mode" with ] and add the package with

add GPCC

The package exposes the following functions of interest to the user:

  • posteriordelay,
  • simulatetwolightcurves and simulatethreelightcurves,
  • infercommonlengthscale,
  • gpcc,
  • uniformpriordelay.

These functions can be queried in help mode in the Julia REPL.

🚀 An important note about performance

(This note is not specific to the GPCC package; it applies in general whenever BLAS threads run concurrently to julia threads.)

The package supports the parallel evaluation of candidate delays. To that end, start julia with multiple threads. For instance, you can start julia with 8 threads using julia -t8. We recommend to use as many threads as physical cores.

To get the most performance, please read this note here concerning issues when running multithreaded code that makes use of BLAS calls. In most cases, the following instructions suffice:

using LinearAlgebra
BLAS.set_num_threads(1) # Set the number of BLAS threads to 1

using ThreadPinning # must be indepedently installed
pinthreads(:cores) # allows you to pin Julia threads to specific CPU-threads 

Unless you are using the Intel MKL, we recommend to always use the above code before estimating delays.

▶ How to simulate data

Method simulatetwolightcurves can be used to simulate data in 2 arbitrary (non-physical) bands:

using GPCC
tobs, yobs, σobs, truedelays = simulatetwolightcurves() # output omitted

A figure like the one above should show up displaying simulated light curves.

It is important to note how the simulated data are organised because function gpcc expects the data passed to it to be organised in the exact same way. First of all, we note that all three returned outputs are vectors whose elements are vectors (i.e. arrays of arrays) and that they share the same size:

# try this out in repl
typeof(tobs), typeof(yobs), typeof(σobs) 
size(tobs), size(yobs), size(σobs)

Each output contains data for 2 bands. tobs contains the observed times. tobs[1] contains the observed times for the 1st band, tobs[2] for the 2nd band. Similarly yobs[1] contains the flux measurements for the 1st band and σobs[1] the error measurements for the 1st band and so on. We can plot the data pertaining to the 2nd band as an example:

using PyPlot # must be indepedently installed, other plotting packages can be used instead
figure()
errorbar(tobs[2], yobs[2], yerr=σobs[2], fmt="o", label="2nd band")

▶ How to estimate delays

Two-lightcurves example

Start Julia with multiple threads. We simulate some data:

using GPCC, LinearAlgebra, ThreadPinning
BLAS.set_num_threads(1)
pinthreads(:cores) 

tobs, yobs, σobs, truedelays = simulatetwolightcurves()

We define a set of candidate delays that we would like to test:

candidatedelays = LinRange(0.0, 10.0, 100)

Having generated the simulated data, we will now estimate the delays. To that end we use the function posteriordelay:

 P = posteriordelay(tobs, yobs, σobs, candidatedelays; kernel = GPCC.rbf, iterations = 1000)

The returned P contains the probability of each candidate delay. We can plot the result with:

using PyPlot # must be independently installed
figure("Delay for two simulated lightcurves")
plot(candidatedelays, P)

Three-lightcurves example

We show how the above estimation of the posterior delay can be performed for three lightcurves:

using GPCC, LinearAlgebra, ThreadPinning
BLAS.set_num_threads(1)
pinthreads(:cores) 

tobs, yobs, σobs, truedelays = simulatethreelightcurves()

candidatedelays = LinRange(0.0, 6.0, 100)
P = posteriordelay(tobs, yobs, σobs, candidatedelays; kernel = GPCC.rbf, iterations = 1000)

size(P) # P is now a matrix, above it was a vector

using PyPlot # must be indepedently installed, other plotting packages can be used instead
figure();title("marginals")
plot(candidatedelays, vec(sum(P,dims=[2;3])))
plot(candidatedelays, vec(sum(P,dims=[1;3])))

figure(); title("joint distribution")
pcolor(candidatedelays, candidatedelays, P)

The above examples can be extended to more than three lightcurves.

▶ Estimate delays for real datasets

In the following script, we estimate the delays for a number of objects where two light curves are available. The real data are provided in the package GPCCData.jl. After stating Julia with multiple threads, we execute the following script:

using GPCC, LinearAlgebra, ThreadPinning
BLAS.set_num_threads(1)
pinthreads(:cores)

using GPCCData # needs to be indepedently installed, provides access to real data
using PyPlot # needs to be indepedently installed

let # WARMUP - Julia precompiles code

  tobs, yobs, σobs, truedelays = simulatetwolightcurves()
  candidatedelays = LinRange(0.0,4.0,3)
  posteriordelay(tobs, yobs, σobs, candidatedelays; kernel = GPCC.rbf);

end

candidatedelays = collect(0.0:0.1:60.0)

for i in 1:5
       tobs, yobs, σobs, lambda, = readdataset(source = listdatasets()[i])
       P = posteriordelay(tobs, yobs, σobs, candidatedelays; kernel = GPCC.OU)
       figure(); title(listdatasets()[i])
       plot(candidatedelays, P)
end

▶ How to fit a dataset with gpcc

We show how to fit the GPCC model and make predictions with it. To that end we use the function gpcc. Options for gpcc can be queried in help mode.

using GPCC

tobs, yobs, σobs, truedelays = simulatetwolightcurves();

# We first determine the lengthscale for the GPCC with the following call.
# We choose the rbf kernel. Other choices are GPCC.OU, GPCC.matern32, GPCC.matern52

ρ = infercommonlengthscale(tobs, yobs, σobs; kernel = GPCC.rbf, iterations = 1000)


# We choose the same kernel as the one used for inferring the length scale.
# Choosing a different kernel may lead to non-sensical results.
# We fit the model for the given the true delays above. 
# Note that without loss of generality we can always set the delay of the 1st band equal to zero
# The optimisation of the model runs for a maximum of 1000 iterations.

loglikel, α, postb, pred = gpcc(tobs, yobs, σobs; kernel = GPCC.rbf, delays = truedelays, iterations = 1000, ρfixed = ρ)

The call returns three outputs:

  • the marginal log likelihood loglikel reached by the optimiser.
  • a vector of scaling coefficients $\alpha$.
  • posterior distribution postb (of type MvNormal) for shift $b$.
  • a function pred for making predictions.

We show below how function pred can be used both for making predictions and calculating the predictive likelihood.

▶ How to make predictions

Having fitted the model to the data, we can now make predictions. We first define the interval over which we want to predict and use pred:

t_test = collect(0:0.2:62);
μpred, σpred = pred(t_test);

Both μpred and σpred are arrays of arrays. The $l$-th inner array refers to predictions for the $l$-th band, e.g. μpred[2] and σpred[2] hold respectively the mean prediction and standard deviation of the $2$-band. We plot the predictions for all bands:

using PyPlot # must be independently installed, other plotting packages can be used instead

colours = ["blue", "orange"] # define colours

figure()
for i in 1:2
    plot(tobs[i], yobs[i], "o", color=colours[i])
    plot(t_test, μpred[i], "-", color=colours[i])
    fill_between(t_test, μpred[i] + σpred[i], μpred[i] - σpred[i], color=colours[i], alpha=0.2) # plot uncertainty tube
end

▶ How to calculate log-likelihood on test data

Suppose we want to calculate the log-likelihood on some new data (test data perhaps):

ttest = [[9.0; 10.0; 11.0], [9.0; 10.0; 11.0]]
ytest = [ [6.34, 5.49, 5.38], [13.08, 12.37, 15.69]]
σtest = [[0.34, 0.42, 0.2], [0.87, 0.8, 0.66]]

pred(ttest, ytest, σtest)

❗ As a general note, running GPCC on more than four light curves and for a large number of candidate delays can be a very lengthy computation constrained by the available CPU and memory resources! This is because GPCC will try out in a brute force manner all possible delay combinations. We may address the efficiency of this computation in the future.

About

Gaussian process cross correlation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages