I have just uploaded a paper to arXiv
here.
Title: Modelling the probability density of Markov sources
Abstract: This paper introduces an objective function that seeks to minimise the average total number of bits required to encode the joint state of all of the layers of a Markov source. This type of encoder may be applied to the problem of optimising the bottom-up (recognition model) and top-down (generative model) connections in a multilayer neural network, and it unifies several previous results on the optimisation of multilayer neural networks.
I wrote this paper about a decade ago, and submitted it to Neural Computation on 2 February 1998, but in effect it was not accepted for publication because the referees asked for changes to be made to the paper that I thought (and still think) were unreasonable. Apart from minor reformatting changes, this arXiv version of the paper is
identical to the version that I submitted to Neural Computation.
The paper contains material that is based on a preprint that I wrote at the
Neural Networks and Machine Learning Scientific Programme at the
Isaac Newton Institute in 1997. The Newton Institute preprint number is NI97039, and it is available online on this
page of 1997 preprints, or can be accessed directly
here.
The main idea in the paper is to optimise a multi-layer density-modelling network so that the
joint probability of the state of
all of its layers has desirable properties, rather than for the
marginal probability of the state of
only its input layer to have desirable properties. In effect, all of the layers of the network are put on an equal footing, rather than selecting the input layer to be treated differently from the rest of the layers. The approach to density modelling described in this paper turns out to unify various results that I had published in several of my earlier papers.
The objective function for optimising the joint probability of the state of all the layers of a network is
almost the same as the one that is used to optimise a
Helmholtz machine, but it
omits the so-called "bits-back" term that is used by Helmholtz machines, so it penalises the use of
distributed codes whilst encouraging the use of
sparse codes in the hidden layers of the network. At the
Neural Networks and Machine Learning Scientific Programme in 1997, I was told by Geoffrey Hinton (who is one of the originators of the Helmholtz machine) that they had looked at what happened if they omitted the bits-back term, and they had concluded that it didn't lead anywhere interesting! I found this remark quite amusing because much of my useful research output had derived from omitting the bits-back term. Also, with the decade of additional insight that I have accumulated since 1997, I can now see that
joint (rather than
marginal) probability optimisation is the
key to obtaining useful results in multi-layer density-modelling networks. Watch this space, as they say!