I have just uploaded a paper to arXiv

here.

**Title:** Modelling the probability density of Markov sources

**Abstract:** This paper introduces an objective function that seeks to minimise the average total number of bits required to encode the joint state of all of the layers of a Markov source. This type of encoder may be applied to the problem of optimising the bottom-up (recognition model) and top-down (generative model) connections in a multilayer neural network, and it unifies several previous results on the optimisation of multilayer neural networks.

I wrote this paper about a decade ago, and submitted it to Neural Computation on 2 February 1998, but in effect it was not accepted for publication because the referees asked for changes to be made to the paper that I thought (and still think) were unreasonable. Apart from minor reformatting changes, this arXiv version of the paper is

*identical* to the version that I submitted to Neural Computation.

The paper contains material that is based on a preprint that I wrote at the

Neural Networks and Machine Learning Scientific Programme at the

Isaac Newton Institute in 1997. The Newton Institute preprint number is NI97039, and it is available online on this

page of 1997 preprints, or can be accessed directly

here.

The main idea in the paper is to optimise a multi-layer density-modelling network so that the

*joint* probability of the state of

*all* of its layers has desirable properties, rather than for the

*marginal* probability of the state of

*only* its input layer to have desirable properties. In effect, all of the layers of the network are put on an equal footing, rather than selecting the input layer to be treated differently from the rest of the layers. The approach to density modelling described in this paper turns out to unify various results that I had published in several of my earlier papers.

The objective function for optimising the joint probability of the state of all the layers of a network is

*almost* the same as the one that is used to optimise a

Helmholtz machine, but it

*omits* the so-called "bits-back" term that is used by Helmholtz machines, so it penalises the use of

*distributed* codes whilst encouraging the use of

*sparse* codes in the hidden layers of the network. At the

Neural Networks and Machine Learning Scientific Programme in 1997, I was told by Geoffrey Hinton (who is one of the originators of the Helmholtz machine) that they had looked at what happened if they omitted the bits-back term, and they had concluded that it didn't lead anywhere interesting! I found this remark quite amusing because much of my useful research output had derived from omitting the bits-back term. Also, with the decade of additional insight that I have accumulated since 1997, I can now see that

*joint* (rather than

*marginal*) probability optimisation is the

*key* to obtaining useful results in multi-layer density-modelling networks. Watch this space, as they say!