I have just uploaded a paper to arXiv here
Modelling the probability density of Markov sourcesAbstract:
This paper introduces an objective function that seeks to minimise the average total number of bits required to encode the joint state of all of the layers of a Markov source. This type of encoder may be applied to the problem of optimising the bottom-up (recognition model) and top-down (generative model) connections in a multilayer neural network, and it unifies several previous results on the optimisation of multilayer neural networks.
I wrote this paper about a decade ago, and submitted it to Neural Computation on 2 February 1998, but in effect it was not accepted for publication because the referees asked for changes to be made to the paper that I thought (and still think) were unreasonable. Apart from minor reformatting changes, this arXiv version of the paper is identical
to the version that I submitted to Neural Computation.
The paper contains material that is based on a preprint that I wrote at the Neural Networks and Machine Learning Scientific Programme
at the Isaac Newton Institute
in 1997. The Newton Institute preprint number is NI97039, and it is available online on this page of 1997 preprints
, or can be accessed directly here
The main idea in the paper is to optimise a multi-layer density-modelling network so that the joint
probability of the state of all
of its layers has desirable properties, rather than for the marginal
probability of the state of only
its input layer to have desirable properties. In effect, all of the layers of the network are put on an equal footing, rather than selecting the input layer to be treated differently from the rest of the layers. The approach to density modelling described in this paper turns out to unify various results that I had published in several of my earlier papers.
The objective function for optimising the joint probability of the state of all the layers of a network is almost
the same as the one that is used to optimise a Helmholtz machine
, but it omits
the so-called "bits-back" term that is used by Helmholtz machines, so it penalises the use of distributed
codes whilst encouraging the use of sparse
codes in the hidden layers of the network. At the Neural Networks and Machine Learning Scientific Programme
in 1997, I was told by Geoffrey Hinton (who is one of the originators of the Helmholtz machine) that they had looked at what happened if they omitted the bits-back term, and they had concluded that it didn't lead anywhere interesting! I found this remark quite amusing because much of my useful research output had derived from omitting the bits-back term. Also, with the decade of additional insight that I have accumulated since 1997, I can now see that joint
(rather than marginal
) probability optimisation is the key
to obtaining useful results in multi-layer density-modelling networks. Watch this space, as they say!