Frequentist Parameter Estimation with Supervised Learning

Recently there has been a great deal of interest surrounding the calibration of quantum sensors using machine learning techniques. In this work, we explore the use of regression to infer a machine-learned point estimate of an unknown parameter. Although the analysis is neccessarily frequentist - relying on repeated esitmates to build up statistics - we clarify that this machine-learned estimator converges to the Bayesian maximum a-posterori estimator (subject to some regularity conditions). When the number of training measurements are large, this is identical to the well-known maximum-likelihood estimator (MLE), and using this fact, we argue that the Cram{\'e}r-Rao sensitivity bound applies to the mean-square error cost function and can therefore be used to select optimal model and training parameters. We show that the machine-learned estimator inherits the desirable asymptotic properties of the MLE, up to a limit imposed by the resolution of the training grid. Furthermore, we investigate the role of quantum noise the training process, and show that this noise imposes a fundamental limit on number of grid points. This manuscript paves the way for machine-learning to assist the calibration of quantum sensors, thereby allowing maximum-likelihood inference to play a more prominent role in the design and operation of the next generation of ultra-precise sensors.

Supervised learning is any kind of machine learning technique that makes use of pre-labelled training data.The objective is to use these data to infer a labelling rule (or model) Θ W (µ) that maps an input µ (known as feature vector-in practice given by experimental data) to the desired output θ.Throughout this manuscript, we use the notation that W is a set of model parameters, to be adjusted during the training process.We argue that the calibration of parameter estimation experiments is itself a supervised learning task : typically one collects sequences of calibration measurements µ at many (known) parameter values θ j (called the labels), with j running over a set of grid-points.This data-set is then used (e.g. via a trained neural network) to infer a function that accepts an arbitrary set of measurement results not used during calibration, and returns an estimate of the (unknown) parameter θ that has produced them.In the language of The goal of supervised learning to map a feature vector µ onto a label θ by training a model ΘW (µ).These labels could be either smooth (regression, (a)) or discrete (classification, (b)).As no information about the underlying probability distribution is available, regression necessitates frequentist inference.A histogram is obtained from many estimates Θ (bottom left).The uncertainty is the statistical fluctuations in Θ (shaded gray), and for an unbiased estimator the mean converges to the true value (black dashed line).Another approach is probabilistic classification (right), whereby the model provides a conditional probability that a particular category is correct, given µ.These coarse-grained probabilities can be used to perform Bayesian inference (bottom right).The black-dashed line is the true value, and shaded gray region is a Bayesian confidence interval.
Broadly speaking, there are two kinds of supervised learning tasks : classification and regression [3,4,29].Regression aims to make predictions on smooth or conti-nuous quantities, and is the focus of this manuscript.In regression problems the labels can take on continuous values, and the labelling rule Θ W (µ) is a smooth curve that should provide accurate predictions on θ that don't necessarily coincide with one of the grid values.As the model can only return a point estimate of the parameter, this approach necessitates a frequentest analysis, whereby repeated estimates must be performed and analysed to obtain statistics.The mean of the resultant histogram are taken as the final estimate, and its width the uncertainty -the result of inevitable statistical fluctuations (Fig. 1, (a)).Although not the focus of this work, we provide a brief discussion of classification for context.Classification is well suited to naturally discrete or coarse grained datasuch as the classification of handwritten images [30].The calibration of quantum sensors has also been formulated as a classification problem, whereby a grid of parameter values θ j are treated as categories [22,25] (Fig. 1,  (b)).In Ref. [25], the machine-learned confidence intervals assigned to each θ j where shown to approximate the Bayesian posterior distribution P W (θ j |µ) present in the training set.Given P W (θ j |µ), one can extract a Bayesian estimate Θ W (µ) (for instance given by the maximum or the mean value of the posterior distribution) and assign a parameter uncertainty to a confidence interval around this value [e.g. the shaded region in Fig. 1 (b)].In this sense, classification naturally performs Bayesian inference, and regression naturally performs frequentist inference [31].
In this manuscript we explore regression as a means for calibration of quantum sensors.We address some interesting and important elements of a regression-based calibration scheme that were not analyzed in previous studies [22][23][24] : we clarify the properties of the estimator Θ(µ) and its relation to the distribution of the training data.It was previously shown that a subjective prior distribution over the label set plays a key role in Bayesian estimation based on classification [25]).Is the same true in a frequentest scheme based on regression ?In particular, we use concrete examples to demonstrate that the optimal estimator is typically the Bayesian maximum aposteriori estimator, when the number of training measurements is relatively small.When the number of training measurements is large, we demonstrate that the machinelearned estimator converges to the maximum-likelihood estimator.We thus explore conditions under which the machine-learned estimator is unbiased and saturates the ultimate Cramér-Rao sensitivity bound.Furthermore, we argue that this bound applies also to the cost function used during training, and if it can be computed a-priori, could be used to aid in the selection of optimal model and training parameters.
Aside from focusing on the rather pragmatic problem of calibrating ultra-sensitive devices, our work also raise fascinating questions about the nature of supervised learning itself.Indeed, device calibration typically involves When a "labelling rule" Θ W (µ) (or model) is trained, its free parameters {W } are adjusted, for example using stochastic gradient descent [29,30], the aim of which is to minimise a cost or loss function.A common choice is the mean-square error, where the bar denotes an average over the whole training set.The question central to this manuscript is the following : when performing regression, which labelling rule (or estimator) Θ W (µ) would we expect the machine to learn, and why ?In this section we provide a brief discussion of frequentist inference, and then provide a simple argument intended to highlight why a particular labelling rule may be learned over another, based on the distribution of the training data.
Frequentist inference.The goal of parameter estimation is to infer a value of the parameter θ given a set of observations µ.A central concept is that of an estimator Θ(µ) -a monotonic function which maps a µ onto the parameter space.We wish to clarify that the subject of this manuscript is frequentist inference, although this does not preclude the use of Bayesian estimators, which can depend on any available prior knowledge P (θ) available about θ in the absence of any information µ.
However, because any regression model learns the Bayesian point estimate Θ W directly, we are forced to rely on frequentist inference strategies.
The frequentist paradigm relies on repeated observations, that are used to gain information about the distribution P (Θ|θ) (as in Fig. 1, (a)).The estimate is taken as the mean over all possible measurement sequences of fixed length m θ est = µ Θ(µ)P (µ|θ) ≡ Θ(µ) (θ). ( In practice, this average would typically be approximated using a re-sampling technique such as bootstrapping. It is useful to define the bias of an estimator Θ(µ) − θ .An estimator is said to be unbiased if its bias vanishes Θ(µ) = θ and ∂ θ Θ(µ) = 1 for all θ, however in general there is no guarantee that an estimator is unbiased.The uncertainty in the estimate is merely the result of inevitable statistical fluctuations, e.g. a common choice of risk function is the variance The variance of any unbiased estimator is bounded asymptotically by the Cramér-Rao bound (CRB) ∆ 2 θ ≥ ∆ 2 θ CRB = 1/mF , where F is the Fisher information Model training.Now we return to the question posed at the start of this section : which estimator might we expect a model learn ?Consider a pair of random variables (µ, θ), with some unknown relationship defined by the conditional probability P (µ|θ) (called the likelihood function.In Fig. 2 we provide a simple illustration of a set of points (µ j , θ j ) sampled from P (µ|θ), for the special case that the feature vectors µ are one dimensional.These points may lie along a curve f (µ), which must be encoded somehow in P (µ|θ).The goal of a regression algorithm is to use the training points to infer a model Θ W (µ) that does a good job of reproducing f (µ).As an example, the likelihood function may be modelled as a Gaussian Such a likelihood function would produce training points that lie along the curve defined by f (µ) with some scatter given by σ, as illustrated in Fig. 2 (top panel).At least for this example, one would expect any reasonable fit to learn to reproduce the maximum Θ W (µ) ≈ f (µ).In a parameter estimation problem, the function f (µ) should be monotonic.However, the likelihood function is not the whole story : we argue that the prior knowledge P (θ) over the  5).The training set (black circles) lie along the curve f (µ) (blue line) which maximises the likelihood function, with some scatter σ.Naively, given enough points one would expect a fit to reproduce this curve.(bottom) The prior distribution P (θ) can also influence the fit (orange line), if for example the same µ is observed at multiple θ.Here, the size of the circle is proportional to the number of observations.label set is also relevant.Suppose the labels are first sampled from P (θ).If the same µ value is sampled at multiple θ values, (Fig. 2, bottom panels), then the relative frequency of each observation may also influence any possible fit to the training data.This argument is based only on the distribution of the training data, but should hold for any reasonable training and model choices.Specifically we assume (1) a sufficiently large training set (2) the model Θ W has enough flexibility to learn the "target" function f with reasonable fidelity and (3) the training algorithm has sensible hyperparameters.Becaue the model Θ W can depend on a prior distribution P (θ) over the label set, in this sense we argue that regression is naturally Bayesian.
In summary, any machine-learning algorithm is trained on input-output pairs sampled from the joint distribution [3,29] P (µ, θ) = P (µ|θ)P (θ), (6) and it is the properties of this distribution that determine the optimal Θ W (µ).Although P (θ) is a probability distribution over the label set, this does not necessarily imply that θ is a random variable : within the Bayesian interpretation of probability P (θ) is a statement of knowledge or certainty, and is chosen subjectively by the user to include any knowledge about θ, prior to the collection of any evidence µ.
Throughout this manuscript, we assume that the feature vectors µ = {µ 1 , µ 2 , • • • µ m } contain m independent elements, which are denoted without boldface.In many supervised learning applications, these elements are not independent (think of an image converted into an array of pixel data ; the ordering is crucial).We find it convenient to instead work with the relative frequencies of individual measurement outcomes µ.Familiar estimators.A common Bayesian estimator is the value of the parameter θ that maximises the Bayesian posterior distribution for a given µ, called the maximum a-posteriori (MAP) estimator where the sum runs over each unique measurement outcome µ, not each possible sequence µ.Here, is the Bayesian posterior which follows from the symmetry of the joint distribution P (µ, θ) = P (θ, µ) [27,28].
We have used the fact that the likelihood function P (µ|θ) factorises for m independent measurements.The conditional probability P (θ|µ) is the posterior distribution, and can be thought of as an update of the prior knowledge, given new evidence µ.The denominator P (µ) is fixed by normalisation.When working exclusively in the frequentist formalism, the notion of prior knowledge is not meaningful, and thus another common estimator is the maximum-likelihood estimator (MLE) Notice that Θ MAP and Θ MLE coincide for flat priors P (θ), and for sufficiently well-behaved priors when m → ∞.
Within the Bayesian formalism, the MLE can be understood as the asymptotic limit of the MAP estimator.Indeed, in the limit that m → ∞, some fairly general analytic results are available regarding the Bayesian posterior distribution.Using the fact that for independent measurement results the likelihood function factorises, it is possible to show that the Bayesian posterior becomes a Gaussian [27,28] The maximum Θ of the Bayesian posterior is the MAP estimator Eq. ( 9), which converged to the MLE as m → ∞.
Of course, neither the MLE or MAP estimators can be computed unless the likelihood function P (µ|θ) is known.For this reason, it is always necessary to calibrate the estimation experiment, which typically requires a large set of training measurements.This is precisely an act of supervised learning, which brings us back to the question posed at the start of this section.If an estimator Θ W (µ) is trained (or calibrated) based on a large set of training points sampled from the joint distribution Eq. ( 6), which estimator would we expect to learn, and why ?When the number of training measurements are large, following the argument in Fig. 2 the Gaussian form of Eq. ( 13) would imply that the MAP should be the expected estimator.When the number of training measurements are small, no such general argument exists.However, in the following section we shall demonstrate that the answer is still very often the MAP estimator.

NUMERICAL RESULTS
Artificial neural network.In this section we explore frequentist parameter estimation using artificial neural networks (ANNs).A trained ANN is a function Θ W (µ) of measurement data µ, depending on many free parameters {W } that are adjusted during training and characterize the ANN.The power of ANN models lies in their universality : ANNs can approximate any function with arbitrary accuracy, limited only by the size of the network [29,30].As such, they are well suited to problems where the form of Θ W (µ) is unknown.We do not discuss the details of ANNs here, for pedagogical overview the reader is referred to Ref. [30].The network design is depicted in Fig. 3.For a D-dimensional feature space (namely, D indicates the number of possible measurement outcomes), measurement results are tallied at a fixed (but unknown) θ.The resulting array of relative frequencies It is important to use relative frequencies (and not the raw tallies) because the ANN is better able to generalise to any number of measurements.The output is a point-estimate Θ W (µ) of the parameter θ.
For the training set we define a uniform parameter grid consisting of d points over the relevant domain The training measurements are performed as follows.First, a label θ j is sampled randomly from the prior P (θ j ).A feature vector (measurement sequence) µ of size m is then sampled at this θ j from the likelihood function P (µ|θ j ) = m k=1 P (µ k |θ j ).In our numerical study P (µ|θ j ) is known to us but never seen by the model.This process is repeated until M j feature vectors The number of hidden neurons n h is related to the complexity of the model, introducing more free parameters W .The output neuron is able to take on any continuous value, hence this realizes a regression model.
sufficiently large, a re-sampling technique such as bootstrapping could be used (instead of considering several M j feature vectors) if the total number of measurements is an important resource [23].However, we do not employ such a bootstrapping strategy here.Once the training data has been collected, a random set of initial weights W are chosen for the model.For reasons of efficiency, it is common to divide the data into randomly chosen mini-batches, and compute the cost Eq.( 1) only over the mini-batch (rather than the whole training set).Using the cost the parameters W are updated e.g. using gradient descent [30].Once the entire training set has been used to compute an update of the model parameters W , an epoch is said to have passed.Typically a model is trained for many epochs.Throughout this manuscript we use the python-based package Keras [40] to build and train our ANN models.All networks have D input neurons (for a D-dimensional Hilbert space), n h hidden neurons with the "ReLU" activation function, and a single output with linear activation, as in Fig. (3).Training is done using the ADAM algorithm for stochastic gradient descent, which aims to minimise the mean-square error cost Eq.( 1).
Single qubit benchmark example.The likelihood function fully characterises the sensor, and is central to parameter estimation.Below, as a pedagogical example, we consider the case of a single qubit prepared in the state | ↑ .The qubit is rotated by an angle θ in the Bloch sphere, such that |ψ θ = exp(−iσ y θ/2)| ↑ is the state after the rotation.Finally we project over the σ z basis resulting in two possible outcomes : ↑ or ↓, with θ-dependent probabilities P (↑ |θ) = cos(θ/2) 2 and P (↓ |θ) = 1 − P (↑ |θ), respectively.These probabilities are analogous to a classical biased coin with (generally) unbalanced head and tail events.For this likelihood function, the CRB and MLE can be computed analytically.1) as a function of the number of training data m for the single-qubit example with a non-flat prior (explicitly shown in the inset).Here we manually compute the cost function for both the MLE (dashed orange line) and MAP (solid blue line) estimators on a training set with d = 10 grid-points and Mj = 5000 feature vectors at each (on average).When m is small, the cost function will favour maximising the Bayesian posterior distribution, rather than the likelihood function.Interestingly, the cost function favours the Bayesian MAP estimator over the MLE even though the training data contains no information about the total number of measurements m -only the frequencies are used.As expected, when m becomes large the two estimators become indistinguishable to the cost function.
The CRB is ∆ 2 θ = 1/m for all θ, which is also known as the standard quantum limit (SQL).The MLE is This estimator is monotonic on the phase interval θ ∈ [0, π].This system is particularly valuable for building intuition due to its simplicity.In particular, it is onedimensional -as revealed by Eq. 14 the estimation problem is determined by a single number, the relative frequencies of a single measurement result, here "↑".Aside from being a purely pedagogical example, NV-centre magnetometry is a well-known application of a single-qubit quantum sensor [2].However, in these devices calibration is not the key challenge -typically the likelihood function can be modelled explicitly [? ].
In the following, we first present numerical results for this example, obtained in the non-asymptotic regime of a relatively small number of training data m.We then discuss the opposite limit or m 1, which the relevant regime for calibrating quantum sensors.

Non-asymptotic results
First we train networks in the regime where each feature vector µ = {f ↑ , 1−f ↑ } contains only a small number of measurements m for each θ j .The number of feature vectors M j sampled at each θ j may still be large.In principle, the networks only needs a single real number f ↑ as Figure 5: When m is small, ANNs can learn the MAP estimator, but when d is small they tend to over-fit.(top row) We compare the machine-learned estimator ΘW (solid blue), to the Bayesian MAP estimator ΘMAP (computed numerically) and the MLE ΘMLE Eq. ( 14).Data was sampled from a Gaussian prior (inset) shown with m = 10, on a training grid with d points in θ ∈ [0, π].We visualise the training data as hollow circles, whose relative sizes are proportional to the number of times a particular measurement result f ↑ occurred at that θj relative to the most common result.When d is small, the ANN tends to overfit, learning an estimator that performs well on the training set but will obviously generalise poorly (such as the d = 3 results here).As d increases, the model appears to learn the a good approximation to the MAP estimator.An indication that overfitting has occurred is that the model's cost function is far below its value when evaluated for MLE and MAP estimators (horizontal lines).The ANN models use a single input neuron taking the value f ↑ and one hidden layer with n h = 1024 neurons.The training set contains Mj = 10 3 samples on average with m = 10 at each grid-point.During training a mini-batch size of 32 was used.
input -however we have found that better results are obtained using two inputs : f ↑ and 1 − f ↑ .When m is small, the effect of a non-flat prior knowledge P (θ j ) over the label set can be significant, notice that the MLE and MAP estimator differ Eq. ( 12) only by a prior-dependent term that scales like 1/m.To illustrate this difference, in Fig. 4 we manually evaluate the cost function over the training set sampled from a non-flat prior (i.e.no networks are trained here).This confirms that the mean-square error cost Eq.( 1) does indeed favour the MAP estimator.Interestingly, the training set contains only frequencies, which lack any explicit information about m, however the MAP estimator cannot be computed without knowledge of m [see Eq. ( 9)] Despite this, Fig. 4 demonstrates that the cost function favours the MAP estimator over the MLE when the prior P (θ) is non-flat.As expected the difference in cost between the MAP and MLE estimators vanishes as m → ∞, confirming that to understand the role of P (θ j ), it is necessary to study the small m limit.Of course, the case of small m is not relevant from an estimation point of view, since the estimator is generally strongly biased in this regime.
In Fig. 5 we show the machine-learned estimator Θ W as a function of f ↑ compared to the MLE and MAP estimators for a Gaussian prior P (θ) ∝ exp[−(θ − π/2) 2 ] (Fig. 5, inset).We train the ANN model using three different training sets with d = 3, 4, 6 grid-points using the broad Gaussian prior at m = 10.
The hollow circles are a visualisation of the training set.The diameter of each circle is proportional to the number of times that f ↑ was observed at a particular label θ j relative to the most frequent outcome in the entire training set -in other words, they are proportional to P (θ|f ↑ ).Along the bottom row we plot the cost during training, compared to the cost computed for the MLE and MAP estimators manually over the training set (as in Fig. 4).
With d = 3 grid points, the ANN learns a highly biased estimator that performs well over the training set, but generalises poorly to unseen values of θ.This is indicated by a cost that is much lower than the expected cost, achieved by learning an estimator that is mostly flat.The point is that although Fig. 4 indicates that the MAP estimator is favoured over the MLE, at least for this training set the highly-biased estimator depicted in Fig. 5 (d = 3) has a far lower cost than either.In other words, this model performs well on the training set, but generalises poorly.In the language of machine learning, this is called overfitting [29].
Increasing the number of grid-points, combats overfitting to some extent.For example, even including a single additional grid-point d = 4 brings the model much closer to the MAP estimator (as indicated by comparing the costs).This is caused by a kind of competition between neighbouring points -due to intrinsic noise ∆θ in the training data, increasing d while holding m fixed inevi-tably leads to the same input f ↑ being observed at different θ j .This spread ∆θ in the training data is visible in the training set in Fig. 5 (hollow circles), notice that both θ = 0, π represent eigenstates of the measurement observable and therefore have ∆θ = 0.When neighbouring points "overlap" to a large degree, in order to minimise the cost the model is forced to interpolate between these points, and this is precisely when the prior has a significant effect (see Fig. 2).As a result, a mostly flat estimator, such as d = 3, is no longer optimal.Finally, with only d = 6 grid-points the model appears to learn a good approximation to the MAP estimator, which is clearly distinct from the MLE at m = 10.As discussed in Fig. 4, notice that the cost function evaluated for the MAP estimator is smaller than that of the MLE due to the non-flat prior.
Even aside from overfitting, the MAP estimator is not always optimal.In fact, the model only appears to learn an approximation to the MAP estimator when the Bayesian posterior distribution P (θ j |f ↑ ) is sufficiently "regular", namely smooth and single-peaked.In Fig. 6 we explore the machine-learned estimator Θ W for several priors, again for d = 6 grid points with m = 10.With m = 10, the priors chosen in Fig. 6 (a-c) yield single-peaked P (θ j |f ↑ ), and we confirm that in these cases, the machine-learned estimator learns a good approximation to the maximum Bayesian estimator Θ W ≈ Θ MAP .Fig. 6 (d) provides a counter-example : the machine-learned estimator does not converge to the MAP estimator.In this case, the simple picture presented in Fig. 2 does not hold -there is not a well-defined curve f (f ↑ ) present in the distribution of training points.

Asymptotic Results
In this section we turn our attention to the training of ANN models in the asymptotic limit m → ∞, and also study the usefulness of such models for performing frequentist inference after they have been trained.To avoid confusion, we clarify here that the models are trained on a sequences of m measurements during training, and after training, the model is used to perform inference on a sequence of ν measurements at a fixed but unknown phase θ, which is to be estimated.Crucially, m and ν need not be equal, and in fact any useful model should generalise well to ν not seen during training.
In the previous section we found that so long as the Bayesian posterior distribution has a single, well-resolved peak, an ANN estimator will learn a good approximation to the MAP estimator.When the number of training measurements is large, the Bayesian posterior converges to a Gaussian distribution and the MAP estimator converges to the MLE.Thus, we can conclude We demonstrate that the machine-learned estimator ΘW learns a good approximation to the Bayesian MAP estimator for some other priors (a-c), and also provide a counter-example (d).All models were trained with d = 6 grid points and m = 10, and otherwise the same training set and model architecture as Fig. 5.The points are a visualisation of the training set, the relative diameters are proportional the to the number of observations.Notice that when an observation f ↑ appears at multiple phases, the relative number of observations (proportional to the prior) determines fit.
that when m → ∞, under ideal training conditions an ANN estimator should learn a good approximation to the MLE.This result is demonstrated in Fig. 7 for the single-qubit example introduced in the previous section.For a Gaussian prior, we plot the machine-learned esti- mator for increasing m and observe that it converges to the MLE.More precisely, in Fig. 6(a-c) we found that for a Gaussian posterior distribution that is not overfit, the machine-learned estimator is a close to the MAP estimator Θ W ≈ Θ MAP (m), which depends on the number of training measurements m.As m → ∞ we have Θ MAP → Θ MLE .This implies and is precisely what is observed in Fig. 7.In particular, in a "well-trained" network (see discussion below) the parameter sensitivity achieved by the estimator should saturate the CRB.
The asymptotic role of prior knowledge.
Here we investigate the impact of the prior distribution in the case of a large number of measurements.Crucially, the asymptotic result Eq. ( 13) is prior-independent, but this is not to say that the prior knowledge can never play a role asymptotically.For instance, Eq. ( 13) only holds if the prior is non-zero and non-singular around Θ MLE , which need not always be the case.
First, we consider a step-function prior.During training, the model is never shown data from half the grid θ ∈ [π/2, π] (Fig. 8 (a), inset).The resultant model Θ W is a good approximation to the MLE over the grid-half that it was shown (θ ∈ [0, π/2]) but for θ > π/2 the model attempts to generalise, Fig. 8 (a), but the result is a highly biased estimator that is neither the MLE or the MAP.After training, if the model is used to estimate a phase θ ∈ [π/2, π], the resultant estimate is biased Fig. 8 (b) but with reduced variance Fig. 8 (c), even as ν → ∞.Compare this to the MLE, which is asymptotically unbiased and saturates the CRB.
Although step-function priors may be a little contrived, we include the example to clearly make the following point : even asymptotically, prior knowledge plays an inescapable role in the supervised training of an estimator Θ W .To be more quantitative, we now consider a narrow Gaussian prior, If the prior is sufficiently narrow, it will produce similar results to the step function prior because there will be many θ-values rarely (or never) shown to the model during training.In Fig. 9 we verify that even when this is the case, the machine-learned estimator has the correct asymptotic properties (unbiased and saturates the CRB as ν → ∞).We train several models with different σ, and use them to estimate θ 0 .The variance and bias are plotted as a function of ν, revealing that so long as ν ∼ 1/σ 2 or greater, the estimate is roughly unbiased and saturates the CRB.When ν 1/σ 2 the estimates are inconsistent -large fluctuations are observed in the trained models, which have high bias and reduced variance.Note that in Fig. 9 we use m = 10 4 for which the MLE and MAP estimators are extremely close -however due to the narrow prior the model is never shown some θ-values, and similarly to the step function prior Fig. 8, learns a distinct estimator.As a general rule we might say that for well-behaved priors, the Fisher information of the prior provides a threshold for the optimal use of the machinelearned estimator.

OPTIMAL TRAINING PARAMETERS
The discussion presented thus far has been concerned with the properties of the machine-learned estimator Θ W , based on the statistics of the training set when the number of training data m are large.From the Gaussian form of the Bayesian posterior distribution Eq. ( 13) in the limit where m → ∞, we argue that one would generally expect to learn the MLE, as demonstrated in Fig. 7.In Fig. 10 we show that when m → ∞, the convergence of the cost to the phase-averaged CRB can be an indication that the model is well-trained.After the model is trained, the goal is of course to use it to estimate an unknown value of the parameter θ based on a sequence of m independent measurements.Crucially, ν and m need not be the same.In this section, we discuss some conditions for optimal performance of the machine-learned model Θ W . Specifically, up to some ν it should be (1) unbiased, within the inevitable statistical fluctuations ∆ 2 θ and (2) these fluctuations should be close to the CRB.
No model can truly be expected to perform well for arbitrary ν -at a certain point one must expect the model to break down.Here we show that the density of the training grid is the key factor that appears to limit the model's ability to generalise to large ν.However, minimum useful training grid density is itself limited by quantum noise in the training data.
When is the network "Well-trained" ?
Of course, ideal training conditions can be difficult to achieve in practice.However, it is possible to exploit the fact that asymptotically in the number of training measurements the MLE aught to be the optimal model to help choose good training hyper parameters.The key observation is that the mean-square error cost function, Eq. (1) should also be bounded by the CRB.If the number of feature vectors M j sampled at a particular θ j is large, the sample average used to compute the cost should converge to the expected value Here the overline denotes an average over all samples µ collected at a single θ j .This quantity is the mean-square error for the estimator Θ W , which is also bounded by the Cramér-Rao bound [41].We introduce the notation for the phase-averaged CRB.If one expects that the optimal model should be close to the MLE, then the cost should saturate the phase-averaged CRB.Unlike the mean-square error (as in Fig. 5, which provides a tighter bound), the phase-averaged CRB has the advantage of being estimator independent.If the CRB can be computed, the phase-averaged CRB can be used to ensure that a particular choice of model and training algorithm is optimal given the training data, so long as (1) m is sufficiently large that the asymptotic result Eq. ( 13) applies and (2) the training set is sufficiently large that sample averages approximate expectation values (i.e.large M j ).
Of course, the phase-averaged CRB is not a strict inequality -aside from the obvious possibility of overfitting (as in Fig. 5), the CRB is an asymptotic bound that can be violated from below for finite m.Furthermore, the CRB is a bound on the expectation value, however the cost is computed as a sample mean, which is prone to additional statistical fluctuations.Despite this, if the CRB can be computed a-priori (for instance, any separable quantum system is bounded by the standard quantum limit (SQL) ∆ 2 θ CRB ≥ ∆ 2 θ SQL = 1/mN , where N is the number of probes), the phase-averaged CRB can be a useful heuristic for designing models and choosing training hyperparameters.For instance, in Fig. 10 we demonstrate that phase-averaged CRB can be used to select the number of neurons in the network's hidden layer n h , and the number of training epochs.Roughly speaking, increasing n h increases the number of free-parameters {W } to be trained, but also provides the model with additional flexibility.In Fig. 10 if n h is too small, the cost plateaus early during training, indicating that perhaps the model cannot learn the MLE as well as possible, given the training set.However, saturation of the phase-averaged CRB does not imply that a given model will necessarily generalise well to unseen inputs -a training set that contains only a single θ point will always generalise poorly.
The role of grid resolution relative to ν.
In Fig. 11 again for the single-qubit system, we plot the mean bias and variance of the machine-learned estimator Θ W both as a function of ν for fixed θ (Fig. 11,  a-d) and as a function of θ for a fixed ν, for two different grid resolutions δθ = L/(d − 1).Training is performed until the cost function gets as close as possible to the phase-averaged CRB (as discussed in Fig. 10, here CRB θav = 1/m) indicating Θ W ≈ Θ MLE as well as possible given the training set.In Fig. 11, (a-d as a vertical dashed-green line -and is an excellent predictor of the point beyond which the training starts to produce inconsistent results.Specifically, we train 10 models, and indicate the average with a solid blue line, the shaded gray regions are one standard deviation.Beyond the resolution limit, the machine-learned model does not converge to a consistent estimator, and does not reliably saturate the CRB.Far enough beyond the resolution limit the model breaks down totally.In the inset we show that the bias is eventually non-zero by a margin not accounted for by the variance ∆ 2 θ = 1/ν (dashed red lines), indicating that the model can no longer be trusted.In the bottom panels Fig. 11, (e-h) we verify that Θ W reproduces the MLE across a range of θ values not present in the training set.As expected, the model performs better when trained on a finer grid-resolution.To conclude, we might say that so long as that a well-trained model can reliably replicate the MLE, and importantly, inherit its desirable asymptotic properties i.e. that it is unbiased and saturates the CRB.
The role of m relative to the grid resolution.
Finally, we explore the role of quantum noise in the training data.The number of training measurements m performed at each θ results in a variance bounded by ∆ 2 θ CRB (θ) = 1/mF (θ) when m is large.This effectively sets a resolution limit for the training grid, beyond which neighbouring grid points cannot be distinguished.In Fig. 12, assuming the MLE is the expected result, we plot the distance between the machine-learned estimator Θ W to the MLE.As the resolution increases, Θ W comes closer to learning the MLE, but only up to a point.Beyond this point, there is no additional advantage in a finer training grid.We train 10 models for m = 10 2 and m = 10 4 , to demonstrate that reducing the noise in the training set -which allows for a finer resolvable gridproduces an estimator that comes closer to the MLE.In other words, the number of training measurements sets a noise limit on the minimum useful grid resolution which in turn sets the maximum possible ν for which the model is reliably unbiased and saturates the CRB.

NON-CLASSICAL STATES OF MANY QUBITS
Thus far we have made an extensive study of a single qubit, mainly to build understanding.However, it is important to study more complex states.In particular, there is currently a great effort to engineer many-qubit states with non-classical correlations, capable of providing sensitivities beyond the SQL [1].The limitations explored in the previous section must be revisited with these more complex states in mind.Specifically, the criteria Eq. 19 is not a strict cut-off.Rather, it should be interpreted as a regime where the model can no longer reliably reproduce the MLE.Eventually, this manifests as large fluctuations in the point estimates Θ W provided by different models with identical training.Combining the regimes Eq. ( 19) and Eq. ( 19) one can deduce that ν < m, which limits the usefulness of the model.If we must collect many more training measurements m than we could ever actually use in a real application, the model would be wasteful in some sense.That the model performs well (i.e. it has sub SQL variance and is unbiased) for some ν beyond the resolution limit is important for this reason.In this section we verify that machine-learned estimators can yield unbiased sub-shot noise sensitivities for complex non-classical states even beyond the resolution limit -effectively demonstrating that these models can usefully generalise from the training set.
In Fig. 13 we compute the variance and bias for a twin-Fock state (TFS) as a function of the number of measurements m at a fixed phase.An N -qubit TFS is defined by a symmetrized combination of N/2 spin-up and N/2 spin-down particles |TFS = Symm{| ↓ ⊗N/2 , | ↓ ⊗N/2 }.We consider a rotation of this state by the unitary U = exp(−i Ĵy θ), where the collective spin operator is defined Ĵk = k is the kth Pauli matrix for the ith qubit.The operation Û is equivalent to a Mach-Zehnder interferometer [1].The CRB for this system is ∆ 2 θ CRB = 1/[νN/2(N/2 + 1)] which is below the SQL ∆ 2 θ SQL = 1/νN .In Fig. 13 we train 10 For a single qubit we train 10 models with m = 10 3 until the cost saturates the phase-averaged CRB C ∼ 1/m and show here the mean result (solid blue) and standard deviation (shaded gray regions).We test the models and compute the bias (a,b,e,f ) and mean-square error normalised to the CRB (c,d,g,h), both as a function of ν at θ = 0.2π (top) and as a function of θ at ν = 500 (bottom).The θ values used to generate this plot are deliberately distinct from the training grid.To explore the role of the grid resolution, we compare a grid with d = 10 grid points (left) to d = 50 (right) both between θ ∈ [0, π] and sampled from a flat prior.In (a-d) we indicate the grid-resolution limit δθ = L/(d−1) with a vertical dashed green line.The models consistently agree well with the MLE, up to a point determined by the grid resolution where the results have large fluctuations relative to the bias.In the inset (a), we show that for m large enough, the model fails totally, as indicated by non-zero bias by margin not accounted for by the variance ∆ 2 θ = 1/ν (dashed red lines).All models have a single hidden layer of n h = 1024 neurons.The training set contains Mj = 2 × 10 3 feature vectors at each grid point (on average) and during training we use a mini-batch size of 256 with 75 training epochs, which ensures the cost saturates the phase-averaged CRB Eq. (18).models Θ W until the phase-averaged CRB is saturated, and compare the mean (solid blue) and standard deviation (shaded gray region) to the MLE (dashed orange).Above the resolution of the training grid, the models ex-  To account for variance in the randomly generated initial networks, we randomly generate 10 initial models, and then train each using a single training set for each d (hollow points), the mean is the solid line.Beyond a certain d, a finer grid yields no additional advantage, and can even result in worse performance on average.The reason is quantum noise in the training data, each point has a width ∼ 1/ √ F m, which can exceed the grid resolution δθ = L/(d−1).This noise limit is indicated by the dashed vertical lines.For each d, the training set and model architecture are identical to those used in Fig. (11).
hibit large variance, similarly to the single-qubit example in Fig. 11.However, the variance remains comfortably below the SQL and the bias remains zero within the variance ∆ 2 θ (computed exactly for the TFS and plotted in dashed red, see Fig. 11 inset).This important benchmark indicates that useful sub-SQL parameter estimation is possible using an ANN estimator, and that the estimator generalises well beyond the ν used for training (m = 10 3 in this case).Similar results were reported in Ref. [23], but not at the CRB, presumably due to the resolution of the training grid.
One final observation is that for the TFS, unlike the single-qubit example, the model generalises poorly to ν m.See for instance in Fig. 11 that between ν ∼ 1−10 (note m = 10 3 ) the machine-learned estimators differ significantly from the MLE.The reason is simply that the TFS likelihood function is extremely non-Gaussian for small ν, but was trained in the large ν limit (such that Eq. 13 applies).As a result, feature vectors µ with relatively few measurements differ significantly from the training examples and are therefore difficult for the model to correctly label.This is not a major concern, as experiments are typically not interested in the low ν regime which is generally highly biased with sensitivity far from the CRB (see the dashed-orange curve).

CONCLUSION
In this manuscript we have explored the training of a machine-learned point estimate using artificial neural networks.Using a simple single-qubit example, we show that the prior knowledge plays an important role in determining which estimator the machine will "decide" to learn -especially when the number of training measurements are small.So long as the Bayesian posterior distribution over the training set is single-peaked, smooth and non-zero around θ, this estimator is a good approximation to the Bayesian maximum a-posteriori estimator.We regard this result -that prior knowledge is an inescapable part of the training process -as evidence that supervised learning itself is naturally Bayesian.
When the number of training measurements are large, this Bayesian estimator coincides with the well-known maximum-likelihood estimator.Based on this, we argue that during training the cost function should converge to the well-known CramérRao bound.If this bound is known a-priori, it can be used to select optimal training parameters (such as the number of hidden neurons in the network, the batch-size, number of training epochs, etc.).We then argue that so long as the Fisher information of the prior is large relative to the CRB, the resolution of the training grid is key to building a model that generalises well to large m.Furthermore, we show that quantum noise in the training data imposes a fundamental limit to the minimum useful resolution of the training grid.Finally, using a specific example we show that ANN estimators can provide unbiased estimates with sensitives below the classical limit up to a number of measurements larger than those present in the training set, indicating that the model can generalise well.Our results pave the way for maximum-likelihood inference to play a more significant role in the operation of quantum sensors.Currently, MLE is relegated merely to a handful of proof-ofprinciple experiments [35,36,42,43] due to challenges associated with device calibration.

Figure 1 :
Figure1: Schematic of supervised learning in relation to parameter estimation.The goal of supervised learning to map a feature vector µ onto a label θ by training a model ΘW (µ).These labels could be either smooth (regression, (a)) or discrete (classification, (b)).As no information about the underlying probability distribution is available, regression necessitates frequentist inference.A histogram is obtained from many estimates Θ (bottom left).The uncertainty is the statistical fluctuations in Θ (shaded gray), and for an unbiased estimator the mean converges to the true value (black dashed line).Another approach is probabilistic classification (right), whereby the model provides a conditional probability that a particular category is correct, given µ.These coarse-grained probabilities can be used to perform Bayesian inference (bottom right).The black-dashed line is the true value, and shaded gray region is a Bayesian confidence interval.

Figure 2 :
Figure 2: Simple illustration of regression.(top) For a one-dimensional input µ, we illustrate a set of training examples sampled from a Gaussian likelihood function Eq. (5).The training set (black circles) lie along the curve f (µ) (blue line) which maximises the likelihood function, with some scatter σ.Naively, given enough points one would expect a fit to reproduce this curve.(bottom) The prior distribution P (θ) can also influence the fit (orange line), if for example the same µ is observed at multiple θ.Here, the size of the circle is proportional to the number of observations.

Figure 3 :
Figure 3: Artificial neural network architecture.Throughout this manuscript we use this architecture to train a model able to map measurement results converted to relative frequencies µ = {f1, f2, • • • fD} to a phase estimate Θ.The number of hidden neurons n h is related to the complexity of the model, introducing more free parameters W .The output neuron is able to take on any continuous value, hence this realizes a regression model.

Figure 4 :
Figure 4: Impact of the prior on the cost function.Mean-square error cost function Eq. (1) as a function of the number of training data m for the single-qubit example with a non-flat prior (explicitly shown in the inset).Here we manually compute the cost function for both the MLE (dashed orange line) and MAP (solid blue line) estimators on a training set with d = 10 grid-points and Mj = 5000 feature vectors at each (on average).When m is small, the cost function will favour maximising the Bayesian posterior distribution, rather than the likelihood function.Interestingly, the cost function favours the Bayesian MAP estimator over the MLE even though the training data contains no information about the total number of measurements m -only the frequencies are used.As expected, when m becomes large the two estimators become indistinguishable to the cost function.

Figure 6 :
Figure6: Comparing models with different priors.We demonstrate that the machine-learned estimator ΘW learns a good approximation to the Bayesian MAP estimator for some other priors (a-c), and also provide a counter-example (d).All models were trained with d = 6 grid points and m = 10, and otherwise the same training set and model architecture as Fig.5.The points are a visualisation of the training set, the relative diameters are proportional the to the number of observations.Notice that when an observation f ↑ appears at multiple phases, the relative number of observations (proportional to the prior) determines fit.

Figure 7 :
Figure 7: Asymptotically in m the machine-learned estimator converges to the MLE.For a single qubit with a Gaussian prior (inset), we train three models with m = 5, 10 and 500.As m increases, ΘW appears to converge to the MLE.Models were trained on d = 10 grid-points with Mj = 10 3 feature vectors each.Gradient descent was run for 50 training epochs with a mini-batch size of 32.Models have a single hidden layer with n h = 1024 neurons.

Figure 8 :
Figure 8: Step-function prior.For a single qubit we train a model using a step-function prior, resulting in a training set containing no examples from half the grid θ ∈ [π/2, π].Unlike Fig. 6 the training set is firmly in the asymptotic limit with m = 10 4 (a) The machine-learned estimator ΘW is compared to the MLE and MAP estimators, and in (b,c) we plot the bias and variance, normalised to the CRB as a function of m at θ = 0.8π, deliberately chosen to test the model's performance in the grid-half it was never shown.The result is a highly biased estimator, even as m → ∞.We use a single hidden layer with n h = 1024 neurons.The training set has d = 5 grid-points in θ ∈ [0, π/2], with Mj = 10 3 feature vectors at each grid point, on average.During training a mini-batch size of 8 was used for a total of 50 trainig epochs.

Figure 9 :
Figure9: Saturation of the CRB with non-flat priors.We plot the mean bias (top row) and variance (bottom row) of an estimate at θ = 0.4π as a function of ν, with a single qubit.The goal is to demonstrate that the machine-learned estimator has the correct asymptotic properties (unbiased and saturates the CRB) for narrow Gaussian priors with variance σ 2 and mean θ0 = 0.4π.We train 10 models with a single hidden layer of n h = 1024 neurons, and show the mean (solid blue line) and standard deviation (shaded gray regions) compared to the MLE (orange dashed line).Vertical dashed green lines indicate the prior variance 1/σ 2 .The training set contained d = 50 uniformly spaced grid-points between θ ∈ [0, π] with Gaussian prior P (θ) ∝ exp −(θ − θ0) 2 /2σ 2 .The mean number of feature vectors used at each phase was Mj = 10 3 with m = 10 4 measurements at each.During training we use a mini-batch size of 128 for 10 training epochs.

Figure 10 :
Figure 10: Cost function converges to the phaseaveraged CRB.For a single qubit in the asymptotic regime (m = 10 3 ), we train three different ANN models with the same training set.As the number of hidden neurons n h increases (all model have only a single hidden layer), the costfunction approaches the phase-averaged CRB as the training set is iterated over.An epoch is a single training iteration over the entire training set.We use a mini-batch size of 8 on a training set with d = 10 uniformly spaced grid-points between θ ∈ [0, π] sampled from a flat prior.The mean number of feature vectors used at each phase was Mj = 10 2 .

Figure 11 :
Figure 11: The grid-resolution limits generalisability.For a single qubit we train 10 models with m = 10 3 until the cost saturates the phase-averaged CRB C ∼ 1/m and show here the mean result (solid blue) and standard deviation (shaded gray regions).We test the models and compute the bias (a,b,e,f ) and mean-square error normalised to the CRB (c,d,g,h), both as a function of ν at θ = 0.2π (top) and as a function of θ at ν = 500 (bottom).The θ values used to generate this plot are deliberately distinct from the training grid.To explore the role of the grid resolution, we compare a grid with d = 10 grid points (left) to d = 50 (right) both between θ ∈ [0, π] and sampled from a flat prior.In (a-d) we indicate the grid-resolution limit δθ = L/(d−1) with a vertical dashed green line.The models consistently agree well with the MLE, up to a point determined by the grid resolution where the results have large fluctuations relative to the bias.In the inset (a), we show that for m large enough, the model fails totally, as indicated by non-zero bias by margin not accounted for by the variance ∆ 2 θ = 1/ν (dashed red lines).All models have a single hidden layer of n h = 1024 neurons.The training set contains Mj = 2 × 10 3 feature vectors at each grid point (on average) and during training we use a mini-batch size of 256 with 75 training epochs, which ensures the cost saturates the phase-averaged CRB Eq.(18).

Figure 12 :
Figure12: Quantum noise in the training data limits the grid resolution.For ν = 10 shots at θ = 0.2π we explore how close the machine learned estimator can come to the true MLE as a function of the number of grid-points d, for a single qubit.To account for variance in the randomly generated initial networks, we randomly generate 10 initial models, and then train each using a single training set for each d (hollow points), the mean is the solid line.Beyond a certain d, a finer grid yields no additional advantage, and can even result in worse performance on average.The reason is quantum noise in the training data, each point has a width ∼ 1/ √ F m, which can exceed the grid resolution δθ = L/(d−1).This noise limit is indicated by the dashed vertical lines.For each d, the training set and model architecture are identical to those used in Fig.(11).

Figure 13 :
Figure13: Non-classical N -qubit state.We compute the bias (top) and variance (bottom) for an N = 4 twin-Fock state (TFS), compared to the SQL and CRB (black dashed lines).We train 10 models and compare the mean (solid blue) to the MLE (dashed orange).Shaded gray regions represent the standard deviation in the models.Models are trained in the regime where the conditions Eq. (17), Eq. 19 and Eq.20 are all met, namely m = 10 3 training measurements with d = 50 grid-points on [0, π/2].For training we use Mj = 10 3 feature vectors per grid-point, and terminate training once the phaseaveraged CRB is saturated, here roughly after 400 epochs with a mini-batch size of 128.The ANNs have N + 1 = 5 input neurons, and 2 hidden layers with 128 neurons each.For the plot, we fix θ = 0.2π and approximate expectation values by averaging over 10 4 randomly chosen measurement sequences µ at each m.The grid-resolution δθ = π/2(d − 1) is a green vertical dashed line.(inset) Zoom-in of the bias, showing that at large m the fluctuations in the trained models are large relative to the root-variance ∆θ for the TFS, computed exactly (dashed red curves).

Table I :
Summary of frequently used notation.fairlysmallfeature spaces relative to the size of the training set.It thus operates in a regime very different from the usual applications of machine learning, such as image recognition, for example, where the size of the feature space is extremely large in comparison to the number of training examples : in which case the likelihood function P (µ|θ) is unknowable for all practical purposes, making maximum-likelihood or Bayesian inference impossible.To assist the reader, here we provide a table of commonly used notation.
BACKGROUND : REGRESSION AND PARAMETER ESTIMATION