\documentclass{article}
\begin{document}

\newcommand{\beqn}{\begin{equation}}
\newcommand{\eeqn}{\end{equation}}
\newcommand{\bqray}{\begin{eqnarray}}
\newcommand{\eqray}{\end{eqnarray}}

\newcommand{\kS}{{\bf \Sigma}}
\newcommand{\kC}{{\bf C}}
\newcommand{\kA}{{\bf \Phi}}
\newcommand{\kHS}{{\bf \cal S}}

\newcommand{\kSt}{{\bf \tilde \Sigma}}
\newcommand{\kStt}{{\bf \tilde {\tilde \Sigma}}}
\newcommand{\kH}{{\cal H}}
\newcommand{\kK}{{\cal K}}
\newcommand{\kHd}{{\Bbb H}^{d}}
\newcommand{\kRd}{{\Bbb R}^{d}}
\newcommand{\R}{{\Bbb R}}
\newcommand{\ev}{{\cal E}}
\newcommand{\proj}{{\cal \pi}}
\newcommand{\ring}{{\bf \cal R}}

\newcommand{\kyh}{{\bf \hat y}}
\newcommand{\kyt}{{\bf \tilde y}}
\newcommand{\kxh}{{ \hat x}}
\newcommand{\kxhh}{\bf \hat {\hat x}}
\newcommand{\kxt}{{\bf \tilde x}}
\newcommand{\kxtt}{\bf \tilde {\tilde x}}
\newcommand{\ky}{{\bf y}}
\newcommand{\kx}{{\bf x}}
\newcommand{\kv}{{\bf v}}
\newcommand{\kw}{{\bf w}}

\newcommand{\yonen}{{ \{ \ky_1,\dots,\ky_n \} } }
\newcommand{\yonenm}{{ \{ \ky_1,\dots,\ky_{n-1} \} } }
\newcommand{\ytonen}{{ \{ \kyt_1,\dots, \kyt_n \} } }
\newcommand{\ytonenm}{{ \{\kyt_1,\dots, \kyt_{n-1} \} } }
\newcommand{\lspan}{{\cal L}}

\newcommand{\xn}{\kx_n}
\newcommand{\yn}{\ky_n}
\newcommand{\wn}{\kw_n}
\newcommand{\vn}{\kv_n}

\newcommand{\ym}{{\bf y _{m}}}

\newcommand{\ytnm}{\snm {\tilde y}}
\newcommand{\xnm}{\snm x}
\newcommand{\wnm}{\snm w}
\newcommand{\rem}{\it {\em Remark: }}

\newcommand{\sh}{{\nonumber}}

\newtheorem{defn}{\bf Definition}
\newtheorem{theo}{\bf Theorem}
\newtheorem{lemm}{\bf Lemma}%[section]
\newtheorem{prop}[lemm]{\bf Proposition}
\newtheorem{coro}[theo]{\bf Corollary}
\newtheorem{conj}{\sc Conjecture}
\newtheorem{hypo}[conj]{\sc Hypothesis}
\newtheorem{gess}[conj]{\sc Guess}

\newcommand{\QED}{\nolinebreak\hfill\hfill\qed}



\title { A Projections Approach to Kalman Filtering}

\author{Scot Free Kennedy}
\maketitle
\abstract {
We will be filtering linear, discrete time control systems disturbed
by Gaussian noise. The systems will be represented as difference equations
involving random variables, and their output as a Gauss-Markov sequence.
``Gauss'' because we will generally assume the noise to be normally distributed,
although the assumption is not neccesary and we will discuss it's lack below.
``Markov'' because the systems are described by ``one-step'' difference
equations perturbed by white noise. Note that we assume some {\em a priori} knowledge
of the system : The transition function, the statistical qualities of the involved
random variables, etc.
   We will build our model in the Hilbert Space of second order Random Variables
defined over the state space of our control system. We will characterize our task as the
attempt to find the projection of the state onto the subspace generated by the 
observations. In brief, it's all about recursive decomposition.
}





({\em {\em Disclaimer:} all integrals are to be assumed Lebesgue, and
all inverses, generalized})

\bigskip

\pagebreak
\tableofcontents
\pagebreak
\section{Introduction}
If we are to study filtering in stochastic control systems, a natural
first question is, ``What are Stochastic Control Systems?''

A Control System is, in barest generality, a mathematical model
in which the observer of the system has some input back into the behavior of the system. The
Dynamics of the system are then determined by the interactions between
the observer's input, called the {\em control}, and the inherent behavior of
the system, labelled somewhat obliquely the {\em transfer function}. The observer may also
 not have direct access to the state
of the system itself, but only to some image or echo thereof. That is: In control theory,
we often have access not directly to the variables we wish to model or control, but to some
intermediary variables which are related thereto. These intermediaries, known as the
{\em measurements} or {\em observations}, allow us to analyze systems with more state
variables than observation variables.   

This 
incestuous little epistemological knot  is made still more torturous 
when we consider the possibility that not only our measurement of
the system, but our knowledge of the system itself, is imperfect.
We can model these imperfections in measurement and model structure by
introducing random variables to our
mathematical description of the system.

A natural second question is of course, ``What is Filtering?''

To filter is to refine.

What we're trying to refine in this context is the ``quality'' of our
information about the current state of the system. That is, we'll be given a time-ordered,
discrete
sequence of observations of the system, which are known to be in error. Our job will be to build
another sequence which gives a better idea what the system is actually doing, gives a
``more accurate'' description of the system's behavior. Exactly
how this is done, and how we define ``better'' will have to wait for the moment, but it is
important to understand the goal which motivates the structures we will be developing in the
meantime. This attempt to sift out the imperfections in our perceptions, to see through the
unavoidable errors in any attempt to describe or understand, through to some clean truth lying
beneath the confusing surface; this will be the break in the stark horizon, stain-glass glowing
in evening gloom, toward which we point our steps. I hope the landscape we traverse enroute may
have its own rewards as well.



As an example to carry throughout this paper, we may hypothesize that we are some futuristic
nanotechnology behavioral engineer, studying the migratory patterns
of micro machines which graze oil spills. Given the thin, laminar
nature of an oil slick on water, we may model the motion of a molecular machine inside it
as a trajectory
in two-dimensional euclidean space, $\R^2$. In this context, we might
define a Stochastic Control System where the transfer function
represents some linear flux in the environment (currents due to thermal
gradients, waves, whatever); the control will model the behavior of the
nano-beastie (note that this may be quite sophisticated and non-linear since we assume it known);
``Plant Noise'' can be considered effects of unknown fluid dynamics; and
Observation Noise can be the model of refraction involving the different
optical qualities of water, air, and oil in combination with 
sunspots and UFOs.

   We may extrapolate further upon our little science-fiction plot.

There were always concerns that the self-reproducing machines might, when released into the environment,
spread beyond their intended domain and impact the surrounding ecostructures. Proponents of the scheme argued
that the machines, designed to survive only in their intended niche, would be unable to spread, and that the
benifits, in terms of oil-spill damage prevention, far outweighed the risks. While the verdict on that balance has
yet to be known, escape the nano-beasties surely did. This presents a new challenge for our protagonists the
intrepid Stochastic Control Theorists. Their entire observation and tracking system was set up for a two-dimensional
state space. Now that the meta-nisms have escaped into the three-dimensional undersea world, are they beat?
Never! With a speed born of desperation, the resourceful Graduate Student intern replaces all the matrix algebra
routines in the system with newer algorithms which use psuedo-inverses, and so can cope with a situation such as
this where the observations are of lesser dimension than the state. With a sigh of relief, they all update their
latex files, and settle in to see what further crises the day may bring.

\section{Spaces}

As is so often the case in mathematics (and elsewhere) the real key to finding our answer
lies in elegantly describing the problem. As anyone who has used polar coordinates in a problem
involving circular motion knows, the right choice of coordinate systems can make all the
difference. So our first task will be the laying of mathematical foundations; we'll define
the spaces and tools which make up the context we'll work in.


\subsection{Hilbert Spaces}

I'll assume the reader has a basic knowledge of the theory of linear spaces. Elementary
concepts such as vector, basis, and span will be used without definition, unless a new
meaning or interpretation is being invoked.

{\defn A {\em Hilbert Space} is a complete inner product space.}

In other words, a Hilbert Space is a Vector Space, with an inner product, which is complete
under the distance function induced by that inner product. That is: the inner product $(x,y)$
induces a norm, $||x||=(x,x)$ which in turn induces a distance function, $d(x,y)=||x-y||$.
Any sequence which is Cauchy under this distance function must converge to an element of our
space.

This is essentially the usual vector space construction used in most descriptive mathematical
endeavor. It provides a system of coordinates to describe the condition or ``State'' of the
system. Accordingly, it will be termed the ``State Space'', the set of all possible values
 of the variables
comprising our model. In general, it will be the Euclidean Space $\kRd$, with the standard
inner product (ie, the dot product...).

In our first Example, the state space is two-dimensional; the position (the system's state) is
restricted to the thin plane of oil lying at the interface of sea and sky. Later, the state space
is extended to three dimensions, but the observations remain two dimensional.


\subsubsection{State Space Decomposition}

Of great importance in all that follows are the fundamental concepts
of orthogonality and orthogonal projection. We briefly characterize
them here, in the familiar euclidean case, in order to more fully
appreciate the brain twisting beauty of their natural extension into
infinite dimensional probability spaces a little later.

{\defn Two vectors (in the same vector space) are {\em orthogonal} if
 their inner product is zero. This is written $x \perp y$}

In a Euclidean space, this simply means the (smallest) angle between
them is a right angle. We will also want to talk about orthogonality
relations between vectors and entire spaces. Following the above definition naturally,

{\defn Given a vector $x$ and a closed subspace $S$, both in some larger Hilbert Space,
we say {\em $x$ is orthogonal to $S$}, written $x \perp S$, if \[ x \perp y,\hspace{10 pt} \forall y \in S\].}

Thus, as a quick example, in $\kRd$, the x-axis is orthogonal to the yz-plane, but not
to the plane $x=y$ 


{\theo Given a Hilbert Space ${\kH}$, a closed subspace $\bf S \subset \kH$,
and a vector $x \in \kH$, we may uniquely decompose $x$ into orthogonal components
 in $\bf S$ and its orthogonal
complement. That is,

\beqn
x=w + v
\eeqn
where $w \in \bf S$ and $v \in \bf S ^{\perp}$ are uniquely determined by $\bf S$,
 and are orthogonal. }

Equivalently, we might say
\beqn
\kH = \bf S \oplus \bf S ^{\perp}
\eeqn



{\defn By $\lspan \{ x_1, x_2,\dots,x_3 \}$ we will mean the Linear Span of
 $\{ x_1, x_2,\dots,x_3 \}$.}


The ability to decompose any vector into orthogonal components leads
naturally to the definition of a special linear transormation: the
orthogonal projection onto a subspace. By the above theorem, we know
that given any vector $x$ and subspace $S$ there exists a unique 
element, say $y$, in $S$ such that $x-y$ is orthogonal to $S$. We call
it the orthogonal projection of $x$ on $S$, written $\proj (x|S)$.

This should seem familiar. We perform these decomositions under various rubrics and
aliases all the time. In elementary mechanics, we decompose a particle's momentum and position
into independant components along the appropriate axes. Taking the real part of a complex
valued function would seem to be no more than the projection of the plane onto the line.


Thomas Pynchon supplies us with a further example of this in his book ``V.''. If we imagine that
we are looking at the rotation of the Earth about the Sun from a point in the plane
of the ecliptic, from far enough away that depth information is lost, we would observe
the rotation as motion in a one dimensional space. We have projected the two-dimensional system of
the Earth's rotation \footnote{which of course actually lives in three-dimensional space but let's not
get carried away...} onto a one-dimensional subspace - the line crossing through the center of the sun
perpendicular to the line of our observation. Note that when we talk about these projections, we're generally
talking about the relationship with the observer - we're not changing the system at all, just how we're looking at it.

The following facts about projections will be used quite often:
\begin{enumerate}
\item Linearity \beqn\proj (x+y | S)=\proj (x|S) + \proj (y|S) \eeqn
\item Minimizes Induced Norm of ``Error''.
\beqn ||x-\proj(x|S)|| \leq ||x-s|| \eeqn $ \forall s \in S$, where the norm is induced by the
inner product, ie, $||x|| := \sqrt{(x,x)}$.
\item Orthogonal ``Error''.
\beqn (x-\proj (x|S)) \perp s \eeqn $\forall s \in S$.
\item Inner Product Equality
(x,s)=(y,s) for all s in S.
\end{enumerate}

These results find direct application in so called Gramm-Schmidt
Orthogonalization. Often, the space generated by a set of vectors will 
be of greater interest to us than the vectors themselves. This being
the case, we may seek out another set of vectors generating the same
space which is easier to work with. Generally, we will wish to reduce
a given basis down to an orthogonal basis with identical span. We do 
this as follows, given $\yonen$:
\bqray
\kyt_1 &:=&\ky_1\\
\kyt_n &:=& \ky_n - \proj (\ky_n | \lspan \ytonenm)
\eqray
By simple application of the above properties of projections, we see that this definition 
will indeed yield the fact that
\[ \kyt_n \perp \lspan \ytonenm \]
as desired, so that the set is pairwise orthogonal.



\bigskip


\subsection{Spaces of Random Variables}


After picking an appropriate state-space to work in, we'll need to construct 
another vector space ``above'' it. This will be the space of all
random variables taking on values in our state space. We will
denote this space $\kHd$ and prove it a Hilbert Space. In other words, we're building the
space which contains the probabalistic elements we're constructing our model system with. In
this space we will be able to describe the statistical, causal relationships between random
variables in terms of geometry, which is really neat.

\subsubsection{Random Variables}

There seem to be two general approaches to the development of Random Variables, enforcing
a choice between ``feast'' or ``famine'' of rigor.
The first is to simply
say a random variable is a variable which takes on values according to some probability
distribution, and leave it at that. The second is to delve into a subtle realm of measure
 and probability theory, rich and involving fields unto themselves. We will attempt to steer a middle course
between this Scilla and Charybdis, using what we need of the indubitably elegant foundation
laid by Kolmogorov and others without sinking too far into chthonian abstraction.

Let us begin with a state space. This is a finite dimensioned Hilbert Space, generally $\kRd$.
We will assume the concept of a Random Variable, with finite variance, taking on values in this
space (see Appendix). Consider further the infinitude of possible such Random Variables (RV's). It
is the relationships between {\em these} that determine the stochastic behavior of our system, and
it is on the set of all such RV's that we shall build our edifice. For this set may be shown
to be a Hilbert Space itself.
 This will be of great
interest to us, because it allows us to treat probabalistic relationships as geometric ones. 
This is of great mathematical, computational, and aesthetic import.

We may develop this vector space informally by merely noting that the operations of vector addition and scalar
multiplication defined on the original state space naturally induce definitions for the space of Random
Variables. Linear combinations of RV's may be evaluated in the obvious way, yielding still more elements of $\kHd$,
our space of Random variables.
The closure of the original set under these operations ensures the closure of the 
space of random variables, and similarly with the rest of the Vector Space Axioms.

To proceed slightly more formally, we begin by developing the one dimensional case. Consider the set
of all finite variance real-valued random variables. We may associate this with the set of all
square-integrable real-valued functions, $L_2$. We will consider $\kHd$ to be simply the 
Cartesian Product of this infinite
dimensional space of scalar random variables with itself $d$ times.

It is the similarity between this construction and that of $\kRd$ which provides its
utility and intuitive advantages, and it is the differences which make for a rich
landscape of mathematical structures and challenges.



 

\subsubsection{Expectations and Products}
Like any other vector space, $\kHd$ has a dual space of functionals (ie, mappings from a vector
space to its set of scalars) \footnote{actually, what is going on here? The vector space
$\kHd$ is defined over isn't a field, so can we really call its elements {\em scalars}? If not,
what are the scalars in $\kHd$?}. In fact, since it is an infinite-dimensional Hilbert Space, it
is self dual. One very important functional is the expectation. This can be thought of as the
center of mass (or {\em moment}) of a function on the state space. That is,
\[ \ev (f(x)):=\int f(x) dP(x) \]
Where P(x) is the measure used in the definition of x as a Random variable. In the usual terminology
P(x) is called the Probability Density Function, and $\mu_x := \ev (x)$.

We can define an inner product in $\kHd$ by taking the expected value of the standard inner
product in $\kRd$,
\[ (X,Y):=\ev (X-\mu_x)^T (Y-\mu_y)) \]
 Note that this is called the {\em Covariance} by statisticians,and that the Norm in $\kHd$ induced by
 this inner product is simply the {\em standard
deviation}.
\[ ||x||:=\sqrt{(x,x)}=\sqrt{\ev (x-\mu_x)^T (x-\mu_x))} \]
We will generally just work with the variance, $||x||^2$. Also of great interest is the induced distance
function,

\[ d(x,y) := ||x-y|| \]
though again, we generally encounter $d^2(x,y)$ in this context. This is because, if we consider $x$ to
be in some sense an estimate of $y$, then a natural measure of error is the distance function. We
use its square for convenience, since it works just as well. That is, minimizing $d^2(x,y)$\footnote{
That is, the component-wise sum of the squared error.} (called the ``Sum Of
Squares'' error) minimizes $d(x,y)$ as well.



Another construction of great utility here is the Outer Product, which is a function of two
random variables yielding their covariance matrix. This is defined as
\[ [x,y]:=\ev (x-\mu_x) (y-\mu_y)^T) \]
Note that this is simply the covariance matrix of $x$ with $y$,
though we often also write, for notational convenience,

\[ [x]:=[x,x]=\ev (x-\mu_x)^2 \]
which is, of course, the variance matrix as it is defined in standard statistics texts \footnote{
Actually these should perhaps be called correlation matrices, translated as they are to the origin.
But they are called covariance matrices almost exclusively, and they are not really correlations,
not being scaled to yield correlation coefficients in the unit interval}.

It is worth making explicit some properties of this operation which follow directly from
the definition. Given $X,Y,Z \in \kHd$ and $A,B$ matrices of appropriate dimension to pre-multiply
them with $X$ and $Y$, respectively:

\bqray
\sh [ X+Y,Z ] &=& [X,Z] + [Y,Z]\\
\sh [ A X,B Y ] &=& A [X,Y] B^T\\
X,Y independant &\rightarrow& [X,Y]=0
\eqray
 These properties \footnote{What should these be called? They seem very much like bilinearity, but
I'm not certain that term can be used here, where everything is matrix algebra...} will be used again
and again in what follows.



In fact, it is this outer product we will use here to define orthogonality.
{\defn We call two vectors $x,y \in \kHd$ {\em orthogonal} if
$[x,y]=0$}

Why are we revising our definition? Because it covers more general situations, and frees us from
worrying unneccessarily about dimensionality. We really aren't changing anything, just generalizing.
In the case where the two vectors corresspond to vectors of equal dimension in the state space,
we note that the Trace of the Outer Product equals the inner product,
 \[ (x,y)=trace([x,y]) \]
 So that if the Outer Product is the Zero Matrix, the Inner Product is the Zero Scalar.
 So the definition coincides exactly with the usual definition of orthogonality in terms
 of the inner product.

If we cannot directly calculate the inner product \footnote{Though I really think we can...The
situation is really nothing new. It's as though we had a vector in $x \in \R^2$, and needed to
take its inner product with a vector $ y \in \R^3$ not lying in the same plane. Though we
can't calculate $(x,y)$ directly, $x$ of course has a representation in $\R^3$ which allows us
to do so.} due to seemingly conflicting dimensionality, this more general definition still
makes sense. Consider our original construction
of $\kHd$ as the Cartesian Product of $L_2 (\Omega, \R)$, or $\kH$. The $\sigma_{ij}$ element
of $[x,y]$ gives the (scalar) covariance between the $i^{th}$ component of $x$ and the $j^{th}$
of $y$. So if $[x,y] = 0$ it means that every component is probabalistically orthogonal to each
component of the other vector. That is, looking at everything as happenning in $L_2$, with one-dimensional
random variables, the vectors are component-wise orthogonal.
Note that independence implies orthogonality. If both RV's are Gaussian, the converse can be shown
to be true as well.
\bigskip
 
{\bf exercise} Characterize the unit hypersphere in $\kHd$, $\kHS$.

\[ \kHS = \{ x \in \kHd : ||x||=1 \} \]
That is, the set of Random Variables with Covariance one, or Trace([x])=1.

Of slightly more interest is a Sphere around a specific point in $\kHd$,
\begin{eqnarray}
\sh \kHS(x,r) &=&\{ y \in \kHd : d(x,y)=1\}\\
\sh &=&\{ x \in \kHd : (x,y)=1 \}
\end{eqnarray}

So we have some (infinite dimensional) manifold in this wild space of random variables. What dynamics
can we imagine sliding across the surface of this bubble of causality, shimmering in the piercing light
of reason? What atlas maps its surface?

{\theo
The orthogonal projection of $x \in \kHd$ on $y \in \kHd$ (assuming, for simplicity, both are
zero-mean) is the random variable
\bqray
\proj (x | y) &=& [x,y][y,y]^{-1} y
\eqray}
{\em proof} 
By the uniqueness of the orthogonal projection on a subspace, we need only demonstrate an
element of $\lspan \{ y \}$ such that
\[ [x-\proj(x|y),y]=0 \]
which requires that
\[ [x,y]=[\proj(x|y),y]\]
a simple case of the celebrated {\em Wiener-Hopf} equation. Now, since $\proj (x|y) \in \lspan \{ y\}$, we can
express it as $Ay$ where A is some dim(x) by dim(y) matrix.
\bqray \sh [x,y]&=&[Ay,y]\\
	&=&A[y,y]
\eqray
So we can solve for $A$, possibly using psuedo-inverses, which yields:
\[A=[x,y][y,y]^{-1}\]

 



\section{Filtering Stochastic Control Systems}
\subsection{System Representation}
\begin{quotation}
{\em ``You Have No Control...''}

 -Bad Religion song lyric
\end{quotation}

We'll represent our system as the following Gauss-Markov sequence.

\begin{eqnarray}
\kx_n &=& \Phi_{n-1}  \kx_{n-1} +\kw_{n-1} \\
\ky_n &=& \kC \kx_n + \kv_n
\end{eqnarray}
Where $\kx_n$ and $\ky_n$ are  are, respectively,$d_{x} \times 1$ and  $d_{y} \times 1$
vectors and $\kv_n$ and $\kw_n$ are  $d_{x}$\ and  $d_{y}$ dimensional Gaussian
White Noise sequences. We may trade a great deal of hassle for a little bit of
generality by assuming them to be mutually independant random variables, so that $[\kw,\kv]=0$. $\Phi$
is our Transfer Function and $\kC$ is our measurement function, and both are assumed to be known.

 Note that this is
a single sequence in ${\kHd}$ corresponding to an (uncountably) infinite
number of sample sequences in $\kRd$. That is, each variable at each time step is a single vector in $\kHd$,
but may represent any number of possible values in the Sample (State) Space.Note also that we have no control
in this model. That
is, there is no control function, only a transfer function, a measurement function, and some noise
terms. This is because the control has no probabalistic importance. It is entirely deterministic,
so we can simply subtract out its effects to make the development simpler. Of course, in any real
world application, we'd need it in our system, and it is easily incorporated. For now, we are only
concerned with the stochastic behavior of the system, 
and the control only affects the mean.

For some examples of the type of system and behavior we're discussing, please refer to the ``Implementations
and Examples'' section below.

It is important to make explicit the following simple property of the above general system: the dimension
of $\ky_n$ need not equal that of $\kx_n$. If it is greater, of course, this is trivial. If smaller, however,
this is one of the most profound advantages and fascinating aspects of the Kalman Filter \footnote{Also, there's lots
of terrible puns to be made: we must emphasize the importance of remaining Kalman Cool. Be Kalman. Kalman get it. Sorry.}.
This is what sets the Kalman Filter apart from existing filtering techniques. Under suitable conditions
(think observability from basic linear control theory...) we may estimate and control many more
state variables than we have observations.


We can imagine some Dark Cabal of Control Theorists in Pynchon's ``Gravity's Rainbow'' designing the circuitry which will
be the Nervous System, the very Pavlovian Complex of Responses, for their mythic V2 rocket. How can they deal with the
fact that their lovely Rocket has over 29 state Variables : Pitch, Yaw, Velocity, Temperature, Weight, a
veritable litany of Relevant Factors. But they must, back on Earth, be content with the two variables they are passed
back along some static-churned tenuous link of electro-magnetic spectra. How can they, given only, say, Yaw and Velocity,
know when Brennschluss has been reached?

That is, given only a noisy measurement of two quantities, the complex, dynamic state of the Rocket must
be estimated to exacting precision, so that the Engine may be turned off at that Threshold, Aeolian Moment at
the Apex of Arc in order to turn Ascent to Descent, to bring the (from that moment on) mute, dumb Rocket
crashing in Super-Sonic Fire into huddled, blacked-out Wartime London.  Fortunately for England, this all
took place before the advent of Kalman Filtering, and the Germans were forced to use sub-optimal methods. Had they
had access to this Grail of Stochastic Control Systems, we might all be learning German in school.

Well, Hyperbole is all well and good, but it cannot be emphasized enough how powerful this property of the filter
truly is. This is what makes it truly a step forward, and has made its influence felt far beyond the 
confines of Linear Quadratic Gaussian Control theory. 

\subsection{Filtering Problem as Projection in $\kHd$}
We may characterize the filtering problem as the attempt to calculate the projection (in
the space of R.V.'s) of the state onto the subspace generated by the observations. That is,
we will be looking for an estimate $\kxhh_n$ of the state $\kx_n$ which is a linear
function of the observations up to (and including) time $n$. The space of all such functions
is the linear span of $\yonen = \lspan \ytonen$. Why is the projection the random variable we seek?
There are two possible ways to view this:

\subsubsection{$\kw_n$ and $\kv_n$ are Gaussian}
In this case, we can show that
\[\proj (\kx_n | \lspan \yonen ) = \ev \{\kx_n | \yonen\} \]
That is, the projection is actually the expected value of $\kx_n$ conditioned on
the observations up to time $n$. Since this is the orthogonal projection, by definition
of our induced norm, this is in fact the ``Least Squares Estimate'' referred to in more
applied math/engineering texts.

\subsubsection{$\kw_n$ and $\kv_n$ are not neccessarily Gaussian}
In this case, the Linear-Gaussian model is still useful. By assuming the system
to be of the above form (that is, Linear, Gaussian, and Markov) we gain a great deal
of computational simplicity. For many systems, over suitable time-intervals, this
sub-optimal model may be quite suffiecient. Of course, for moderately non-linear systems
this may be completely useless. Just as we approximate Dynamical Systems with Linear Approximations,
we may consider this model the ``linearized'', ``first-order'' approximation.


We begin by finding an orthogonal basis for $\lspan \yonen$. We may implement standard Gramm-Schmidt
orthogonalization in $\kHd$ as follows:


\subsection{Shadows in Infinite Dimensional Space}
I imagine the systems's behavior in time as an unfolding in the dark recesses of $\kHd$.
Each time step brings a new element to the basis, a new axis. The State Sequence goes
twisting and writhing along, while we, in some hyperdimensional Allegory of the Cave,
peer at it's flickering images in the space of our observations. While the Probability
Space spanned by the system builds over time into some baroque geometry of causality,
we poke and peer at its shadow, trying to glean some hint of the Truth which underlies
surface appearances, rife as they are with noise and error.

We begin this process by performing a Gramm-Schmidt orthogonalization of both our State and
Observation sequences. This is a recursive decomposition of the Random Variables. By
{\bf Theorem 1} we may decompose $\yn$, the $n^{th}$ observation, into  its projections
on $\lspan \yonenm$ and its complement $\kHd \ominus \lspan \yonenm$ for any $n$.

\bqray
\yn &=& \proj (\yn | \lspan \yonenm)  + \proj ( \yn | \lspan \yonenm )^{\perp}\\
\nonumber  &&\forall n
\eqray
{\defn for notational convenience, and more or less in keeping with Kalman's
original notation, $\forall n$
\bqray
\nonumber \kyh_n  &:=& \proj (\yn | \lspan \yonenm)\\
\nonumber \kyt_n  &:=& \proj ( \yn | \lspan \yonenm^{\perp})\\
	 &\Rightarrow& \kyh_n \perp \kyt_n\\
	 \ky_n&=&\kyh_n+\kyt_n
\eqray}

To generate an orthogonal basis for $\lspan \yonen$ we merely note that if we assume
some knowledge of initial conditions we may set

\bqray
\nonumber \kyh_0 &=& \ky_0\\
\kyt_n &=& \ky_n - \kyh_n\\
\eqray

$\{ \kyt_n \}$ are pairwise independent, and thus $\lspan \yonen = \lspan \ytonen$.
We will thus use the $\kyt$'s for our ``reference basis''. Our next step is to perform
a similar decomposition on the state variable sequence.

{\defn $\forall m \leq n$
\bqray
\kxh_n &:=&\proj (\kx_n | \lspan \ytonenm)\\
\kxhh_n &:=&\proj (\kx_n | \lspan \ytonen)\\
\kxt_n &:=&\proj (\kx_n | \lspan \ytonenm^{\perp})\\
\kxtt_n &:=&\proj (\kx_n | \lspan \ytonen^{\perp})\\
&\Rightarrow& \kx_n = \kxh_n + \kxt_n\\
&\Rightarrow& \kxh_n \perp \kxt_n
\eqray
Since $\kxt_n = \kx_n - \kxh_n$ we call $\kxh_n$ the {\em estimate} of $\kx_n$
based on the (orthogonalized) observations $\ytonenm$, and $\kxt_n$ the {\em error}.
}

Now, by orthogonality of the $\kyt_n$ we have
\begin{equation}
\lspan \ytonen=\lspan \ytonenm \oplus \lspan \{ \kyt_n \}
\end{equation}
so we may decompose the projections:
\bqray
 \proj (\kx_n | \lspan \ytonen)&=& \proj (\kx_n | \lspan \ytonenm) + \proj (\kx_n |\lspan\{\kyt_n\})\\
 \kxhh_n &=& \kxh_n + \proj (\kx_n |\kyt_n)
\eqray
 That is, we may split the process of calculating {\em this} turn's estimate into
a sum of an estimate based on {\em last} turn's estimate and one based entirely on the
current observation. This has obvious computational as well as aesthetic merit, and the
fact that the two spaces are orthogonal complements allows us to treat them seperately; the two
estimates are independent random variables \footnote {Actually, we can only guarantee their
independence if they are Gaussian - else we really should say only orthogonal. Note that
this is entirely sufficient here.} . If we consider $\ky_n$ the ``current'' observation, we may
term the projection of $\kx_n$ on $\lspan
\ytonenm$ the {\em a priori} estimate, that on $\lspan \{ \kyt_n \}$ the {\em update},
and on $\lspan \{ \ytonenm \}$the {\em posteriori} estimate, because it is our best estimate
based on all the information availible at that time.




\subsubsection{Calculating the {\em a priori} estimate, $\kxh_n$}

To calculate the {\em a priori} estimate, we simply use our defined system structure,
and the fact that $\kw_{n-1}$ is time orthogonal to $\ytonenm$ to attain:
\begin{eqnarray}
\sh \kxh_n&=&\proj (\kx_{n-1} | \lspan \ytonenm)\\
\sh &=& \proj (\Phi \kx_{n-1} + \kw_{n-1} | \lspan \ytonenm)\\
\sh &=& \proj (\Phi \kx_{n-1} | \lspan \ytonenm) + \proj (\kw_{n-1} | \lspan \ytonenm)\\
\sh &=& \Phi \proj (\kx_{n-1} | \lspan \ytonenm) + 0 \\
&=& \Phi \kxhh_{n-1}
\end{eqnarray}

\subsubsection{Condition with observation error, $\kyt_n$}
We begin by noting that $\kxh_n \perp \kyt_n$, and that
$\kxt_n$ is a zero-mean RV. We use these facts thusly:
\bqray
\sh \proj (\kx_n | \kyt_n) &=& \proj (\kxh_n + \kxt_n | \kyt_n)\\
\sh &=&\proj (\kxt_n | \kyt_n)\\
&=& [{\kxt_n} , \kyt_n][\kyt_n , \kyt_n]^{-1} \kyt_n\\
&=& \kK_n \kyt_n
\eqray
where we're defining $\kK_n$ as the Projection Operator from $\lspan \{ \ky_n$ \} to $\lspan \{ \kx_n \} $\footnote{
What does this really mean; these spaces are not the same. But they {\em are} both subspaces of this larger space.
See Appendix C}: the
 {\em Kalman Gain Matrix} of reknown. Note that we have now reduced the filtering
update to calculating the two covariance matrices, $[{\kxt_n} , \kyt_n]$ and
$[\kyt_n , \kyt_n]$. We will now proceed to derive a recursive expression for these matrices.

\subsubsection{Covariance Decomposition}

 It is worth reviewing some of the tricks which will be used in the following development.
 Most of these are used so repetitiously that it would extremely laborious and aesthetically
 abhorrent to document each individual application. Firstly, the properties of the Outer Product
 will be used almost continuously (see section 2.1.1). Secondly, we will be attempting to simplify things by decomposing
 mathematical objects into components, some of which will go to zero. Generally, things will go
 to zero because of their independance, and the fact that the Outer Product of independant RV's is
 the Zero Matrix. There are two general ways we will see independence arise: the first, which I call
 {\em Structural Orthogonality}, is the standard independence of completely unrelated variables. For
 instance, we are assuming the plant and measurement noise to be structurally independent. Other times,
 we will claim orthogonality between objects which are clearly related. This is possible if the objects
 are made independent by a difference in time index. For instance, the $(n-1)^{th}$ observation is independent
 of the the $n^{th}$ measurement noise RV. I call this {\em Time Orthogonality}. Most of what transpires below
 can be explained in terms of the properties reviewed here. So let us begin this final campaign,
 strike out for that tauntingly closer goal toward which we strive. We're almost there.
 
We begin by focusing our attention on $\{ \kxt_n \}$. This sequence, while unavailable
for direct observation, contains all the information about behavior of the system in the
sense that it yields a deterministic knowledge of the systems path through state space. We
will base our model of the systems behavior on the statistical behavior of this sequence.

To begin, we will express the covariance matrices used in terms of the covariance of
$\kxt_n $, denoted $\kSt_n :=[\kxt_n]:=\ev (\kxt_n \kxt^T_n)$.

\begin{eqnarray}
\sh \kyt_n &=& \ky_n - \proj (\ky_n | \lspan {\kyt_1,\dots,\kyt_{n-1}})\\
\sh &=& \ky_n - \proj (\kC \kx_n + \vn | \lspan \ytonenm)\\
\sh &=& \ky_n - \kC \proj (\kx_n | \lspan \ytonenm) - \proj (\kv_n | \lspan \ytonenm)\\
&=& \ky_n - \kC \kxh_n + 0\\
&=& \kC \kxt_n + v_n
\end{eqnarray}
This decomposition of the vector allows us to analyse its covariance as follows:
\bqray
[\kyt_n]&=&[\kC \kxt_n +v_n]\\
\sh &=&[\kC \kxt_n] + [v_n]\\
\sh &=&\kC [\kxt_n] \kC ^{T} +[\kv_n]\\
 &=& \kC \kSt_n  \kC^T + \kS_v
\eqray

Now, in order to similarly express the second covariance matrix needed in terms
of $\kSt_n$, we again begin with a familiar decomposition:

\bqray
[\kx_n, \kyt_n] &=& [\kxh_n + \kxt_n, \kyt_n]\\
\sh &=&[\kxh_n, \kyt_n] + [\kxt_n, \kyt_n]\\
\sh &=&0 + [\kxt_n, \kC \kxt_n + \kv_n]\\
\sh &=&[\kxt_n, \kC \kxt_n] + [\kxt_n, \kv_n]\\
\sh &=&\kSt_n \kC^T + [\kx_n - \kxh_n, \kv_n]\\
\sh &=&\kSt_n \kC^T + [\kx_n, \kv_n] - [\kxh_n,\kv_n]\\
\sh &=&\kSt_n \kC^T + [\kA \kx_{n-1} + \kw_n, \kv_n] - 0\\
\sh &=&\kSt_n \kC^T + [\kw_n, \kv_n] + \kA [\kx_{n-1},\kv_n] \kA^T\\
&=&\kSt_n \kC^T
\eqray


\subsubsection{Formulation of Discrete Time Kalman Filter}
So we have that
\bqray
\kK_n &=&  [{\kxt_n} , \kyt_n][\kyt_n , \kyt_n]^{-1}\\
&=& [\kSt_n \kC^T][ \kC \kSt_n  \kC^T + \kS_v]^{-1}
\eqray

We can now, given the value of $\kSt_n$, propagate our state estimate from
$\kxhh_{n-1}$ to $\kxhh_n$. We are not yet able, however, to propagate
our error covariance matrix. We need to look at
\bqray
\kxtt_n &=& \kxhh_n - \kx_n\\
&=&\kxh_n + \kK \kyt_n - \kx_n\\
&=&\kxh_n - \kx_n + \kK_n \kyt_n\\
&=&\kxt_n + \kK (\kC \kxt_n + \kv_n)\\
&=&({\bf 1} + \kK_n \kC) \kxt_n + \kK \kv_n
\eqray  
so that
\bqray
\kStt_n &=& [\kxtt_n]\\
&=&[({\bf 1} + \kK_n \kC) \kxt_n + \kK \kv_n]\\
&=&({\bf 1} + \kK_n \kC) \kSt_n ({\bf 1} + \kK_n \kC)^T + \kK_n \kS_v \kK_n^T\\
&=&({\bf 1} - \kK_n \kC) \kSt_n
\eqray

This last step is obtained by substituting in our expression for $\kK_n$, hacking away with
matrix algebra, then reexpressing in terms of $\kK_n$.
Now, all we need is the propagation from {\em posteriori} error last turn
to {\em a priori error} this turn.
\bqray
\kSt_n  &=& [ \kxt_n ] \\
\sh &=& [ \kA \kxh_n - \kx_n]\\
\sh &=& [ \kA \kxhh_{n-1} - (\kA \kx_{n-1}+\kw_{n-1})]\\
\sh &=& [ \kA \kxtt_{n-1} - \kw_{n-1}]\\
&=& \kA \kSt_{n-1} \kA^T + \kS_w
\eqray
Combining these last two re-expressions, we see that
\beqn
\kStt_n = ({\bf 1} - \kK_n \kC) (\kA \kStt_{n-1} \kA^T + \kS_w)
\eeqn
Which allows us to directly propagate our error covariance from $n$ to $n+1$. If
we express $\kxhh_n$ as a function of $\kStt_{n-1}$ and $\kxhh_{n-1}$ we'll
have our recursive estimation sequence. Combining equations $\{17,26,29\}$ and $\{37,46,48\}$, we have:
\beqn
\kxhh_n = \kA \kxhh_{n-1} + \kK_n \kyt_n
\eeqn
where
\beqn
\kK_n=(\kA\kStt_{n-1}\kA^T +\kS_w) \kC^T (\kC(\kA\kStt_{n-1}\kA^T +\kS_w)\kC^T+\kS_v)^{-1}
\eeqn
This, albeit in an aethetically treasonous form, is our Discrete Time Kalman time filter. 

\section{Implementations and Examples}
\subsection{Mathematica Code}
I chose to use Stephen Wolfram's ``Mathematica'' environment as my tool to implement the numerical
simulations. While using Matlab or programming directly would be much faster in terms of the time
it takes to run a simulation, the development time would be orders of magnitude greater. Mathematica allows
one to code at a high level, focussing more on the Mathematics than on some arcane syntax - though of course,
Mathematica has its share of silly tricks and frustrations. I am including my source code (temporarily uncommented,
this will soon be rectified) along with the pictures it has generated and some examples of how this was done so
that interested readers with access to Mathematica can play around on their own. The code is also available
as a Latex file on my Thesis Web Site at {\bf http://www2.ucsc.edu/people/scotfree/}.
\subsection{Illustrative Simulations}
\subsubsection{System Examples}
A picture is worth 1K words, or something like that. Let's examine some
sample trajectories of representative systems to develop a sense of the
context we'll be operating in. Please note that the illustrations are seperate, located
at the end of the paper.
 A basic example is the following
discretization of two-dimensional harmonic motion.
\[ \Phi =[ \begin{array}{cc}
	1 & -1.1\\=
	1.1 & 1
	\end{array}] \]

We'll start the system at $\kx_0 = \{1,1 \}$, and assume that the various noise inputs have diagonal
covariance matrices (ie, have independant components). In fact, we'll go so far as to assume
that the components are of equal variance as well, so that we may describe the Random Vectors
with scalar variances, say $\sigma_W$ and $\sigma_V$.
\begin{enumerate}
\item We begin with the system unperturbed by noise of any kind. Note that the natural behavior
is a gentle outward spiral (not a circle since this is only an approximation of harmonic
motion).(both variances zero.)
\item  Next let's introduce some error into our observations,($\sigma_V=2$)
\item  and into the state sequence itself.($\sigma_W=1$,$\sigma_V=5$)
itself.

\end{enumerate}

In our science-fiction example, what could this system represent? As the oil slick spreads
outward from the foundering supertanker (inebriated pilot cursing, waving his fist from
the railing) we might expect to see some outward radial motion from its denizens. It is
not much of a stretch to imagine temperature differences and atmospheric conditions creating a
rotational flow about the ship. These two together make the outward-spiralling tansfer Function
behavior we see here. The various forms of noise have the usual interpretations mentioned at the outset.



\subsection{Filtration Examples}

We'lI begin by filtering the examples above.
\begin{enumerate}
\item Example 2 above, filtered (100 steps).
\item Example 3 above, filtered (100 steps).
\item last 25 steps.
\item last 5 steps.
\end{enumerate}
 Observe the way the filter responds to major deviations in the observations, echoing
the behavior, but in a much reduced way.

\subsubsection{The Three Dimensional Case}.
The relevant matrices here are:

\[ \Phi = \begin{array}{ccc}
1&.1&0\\
-.1&1&0\\
0&.1&1.1
\end{array} \]
\[ \kC = Identity = [\kv_n] \]

There is no State Noise.

These are my favorite pictures I
think, and will be much more thoroughly annotated later. For now, the main things to note are: In the first picture, I
have started out the picture with a terrible initial guess, and a very small amount of uncertainty, ie, the guess
is considered almost exact. If the guess is started with greater uncertainty, it is corrected to the real
value almost immediately, and if it is started with no uncertainty, it is a fixed point and the filter has
no effect.

 In the second picture, observations are no longer direct. We have two observation variables:
y1=x1+x2+noise, y2=x3+noise.

\[ \kC = \begin{array}{ccc}
1&1&0\\
0&0&1
\end{array} \]

 I need to realign the axes, etc to make the picture better, but you can
see the filter is actually working! Very cool, I was delighted to see my algorithms work perfectly
in this more complicated case involving psuedo-inverses and wierd dimensionality issues. 


\appendix
\section{Probability Theory}

Probability Theory, at least in the standard approach due mostly to Kolmogorov, is essentially the study
of certain real-valued set functions, enrichened by the near mystic complexity of Independence/Dependence
relationships. I will give a brief synopsis of this development.



We begin with a {\em Ring} of sets, say $\ring$. This is a set of sets which is closed under the symmetric
difference \footnote{The symmetric difference is defined as $(A-B) \cup (B-A)$ which is
often also called ``Exclusive Or''} and the intersection. Note that closure under these two operations
guarantees closure
under all combinations of set operations \footnote{This is really a ring in the abstract algebraic sense if we consider
set intersection and union to be our "multiplication" and "addition", respectively}. An {\em Algebra} is
simply a Ring with a {\em Unit} element. This is simply a set (remember, elements of the Ring are sets) which
acts as the identity function on any other element of the Ring under intersection. Note that every ring contains
the empty set, which is our identity under union. If these closure properties hold not only when combining pairs (and hence,
any finite number) of sets, but also under countably infinite combinations, the Ring (or Algebra) is called
a {\em $\sigma$ -Ring} (or {\em $\sigma$ -Algebra}). $\sigma$ -algebras are also often called {\em Borel Algebras}. The
classic example of a $\sigma$-algebra is the set of all subsets (of a given set).

Next, we define a {\em measure} as a mapping from a $\sigma$-algebra to the non-negative real line. Such a function is
called {\em Additive} if the measure of the union of any two disjoint ``$\sigma$-subsets'' is the sum of their measures.
If this property holds under countable (but still pair-wise disjoint) unions, the measure is said to be {\em $\sigma$
-additive}.

A set endowed with a $\sigma$-additive measure is a {\em Measure Space}.

Now we are ready to build our probability theory.

A {\em Probability Space} is simply a Measure Space such that the measure of the entire space is 1. This
is generally notated as the triplet $(\Omega,{\cal A},P)$ representing the set $\Omega$ over which the $\sigma$-algebra
${\cal A}$ is defined as the domain of $P$, the measure.

This is enough to develop classical probability theory. We can call the elements of $\Omega$ the {\em Elementary
or Simple Events}, the elements of ${\cal A}$ the {\em Compound or Complex Events}, and $P$ the {\em Probability
 Distribution.}

This only allows us abstract, set-theoretic events and outcomes. We need to develop Random Variables
 to do quantitative modelling. We define a {\em Random Variable (``RV'')} as a mapping from $\Omega$ to $\kRd$.
This is written, for an RV $X$,
\bqray
\sh X&:& \Omega  \to  \kRd \\
\sh X&:& \omega \mapsto x 
\eqray


Note that these can, and almost always will, be vector-valued Random Variables, though I will
refer to them as Random Variables for generality.
Linear Combinations of Random Variables can be defined as follows, given two RV's $X$ and $Y$:
\bqray
\sh aX+bY &:& \cal A \to \cal B \\
\sh aX+bY &:& \sigma \mapsto aX(\sigma)+bY(\sigma) 
\eqray

Lastly, we note that

\[ P_X(\beta) := P(X^{-1}(\beta)), \hspace{10 pts} \forall \beta \in \cal B \]
defines a probability measure over $\Bbb R$. Thus, a Random Variable can be thought of as defining a Probability Space
over $\Bbb R$. In fact, it is possible to go back the other way. It can be shown that, given such a Probability Measure
over $\Bbb R$, there must exist an $(\Omega,{\cal A},P)$ and $X$ which induce it. So it would seem that we can
associate the measure over $\Bbb R$, rather than the mapping from $(\Omega,{\cal A},P)$ to $\cal B$, with the RV. See
[Eaton] for a more in-depth development.













\section{Psuedo-Inverses}
Psuedo-Inverses seem to be rarely used by pure mathematicians, but to be of great use. They allow us to solve
matrix equations where true inverses don't exist. In other words, they allow us to generate a pre-image of a vector
under a mapping, even if that mapping wasn't one to one. The seminal work, at least to the extent that I can discover,
seems to have been done (seperately) by Kalman and Penrose.

While the details can be somewhat grueling, the basic idea is quite elegant and simple. Suppose we have a non-square
matrix A, say of dimension m by n. Now, let us suppose that either the columns or rows (whichever has the lower
dimension, of course) are independent. Then consider: either $AA^T$ (m by m) or $A^TA$ (n by n) will be a square
matrix with full rank, and thus invertible. In the latter case, we say that
$(A^TA)^{-1}A^T$ is the psuedo-inverse, since

\bqray
\sh ((A^TA)^{-1}A^T)A &=& (A^TA)^{-1}(A^TA)\\
\sh                   &=& 1
\eqray  

Similarly if m < n, though we will need to post- rather than pre-multiply. For this reason, these two are often
called, respectively, the Left and Right psuedo inverses, corresponding to what's known in Linear Algebra circles
as the Over- and Under-Determined Cases.
 It can be shown that these psuedo-inverses always exist, and, with some additional conditions, are unique. Also,
 if the true inverse exists, the psuedo-inverse simplifies to it.


\section{Subjective Nature of Perception}
This has been the most involved intellectual project of my life thus far, and has been frought
with supreme enjoyment and abject frustration - generally both at the same time. While the main body
of the paper is certainly not free of my personality, I think it is important to make explicit
here at the end some of the issues that have been raised by this. In other words, I want to locate
myself in relation to the paper, and to my further goals.

I began this paper with no knowledge of Measure Theory (and hence, of rigorous probobility theory),
Hilbert Space Theory (at least of the infinite dimensional variety), and no real understanding
of the importance and nature of orthogonality and projection. While this ignorance is sure to be
obvious to any more knowledgable reader, I do feel that my knowledge of these subjects has grown
considerably in the course of writing this beast.

I would also like to express my feeling about the strengths and weaknesses of the paper as I see
them. I think the projections/Hilbert Space approach is a good one, in that it yields a simple, clean 
derivation which is comprehensible intuitively in familiar geometric terms. While I can hardly
claim that my work here is truly original, I do feel a great deal of satisfaction in knowing that
no book I found presented exactly this approach. Most make some offhand comments about geometric
interpretations, but end up deriving it in some ugly fashion which hides the essential simplicity
and beauty of the theory. I am very proud of the fact that I knew how {\em I} wanted to understand
it, and despite huge gaps and mistakes, I feel that I have achieved this.

My main dissapointment is in not having been able to incorporate Information Theory.

This was my original intention, but in the end it was enough to just sketch what I have here. Between
comments in Kalman's paper, and my own research, it is clear that there are deep, important connections to
be made here. Fisher Information, the Cramer-Rao lower bound, the Gaussian Channel, and Mutual Information
all have a great deal of relevance here, and I dearly hope to be continue my work in this direction.

In the end, as the culmination of a Bachelor's Degree in Mathematics and a big step down a beautiful path,
all I can say is Thank You to all the people who have helped me achieve this. I believe I've actually
done it. I cannot believe how much I've learned, and how much more I've learned there is to learn.

\subsection{The biggest flaw/strength of the paper}
The ambiguity between $\kHd$ as a vector space in its own right and as mere shufflings about of scalar random
variables is very difficult for me to settle. It was clear early on that the latter approach would be the easiest
to treat rigorously. I really wanted to attempt the former, and think of everything as vectors and matrices. Of course,
one builds the former out of the latter, but one can't also know everything when one begins. I have kept
things as they stand as a challenge to myself, because I know the development can proceed this way, and because
I think it makes the topic more approachable to non-specialists.

\section{Bibliography}
`Balakrishnan, A. V.:``Kalman filtering theory''
	University series in modern engineering.
	New York : Optimization Software, Inc., Publications Division, 1984.

Bjerhammar, Arne:``Theory of errors and generalized matrix inverses''
	Amsterdam, New York, Elsevier Wetenschappelijke Uitgeverij, 1973

Chen, G.:``Linear stochastic control systems''
	Boca Raton, FL : CRC Press, [1995]

Grewal, Mohinder S.; Andrews, A.P.:``Kalman filtering : theory and practice''
	Prentice-Hall information and system sciences series.
	Englewood Cliffs, N.J. : Prentice-Hall, c1993.

Eaton, Morris L.:``Multivariate statistics : a vector space approach''
	Wiley series in probability and mathematical statistics.
	New York : Wiley, c1983.

Kalman, R.E. : ``New Methods in Wiener Filtering'' Proceeding of First Conf. on Applications
	of Random Function Theory in Engineering (?)

Kolmogorov, A. N. (Andrei Nikolaevich):``Foundations of the theory of probability''
	New York, Chelsea Pub. Co., 1956.

Kolmogorov, A.N.; Fomin, S.V. :``Introductory Real Analysis''
	Translated and Edited by Richard Silveramn
	New York: Dover, 1975.

Pitt, H. R.:``Measure and integration for use''
	Oxford science publications.
        IMA monograph series.
	Oxford : Clarendon, 1985.

Ruymgaart, P. A.,Soong, T. T.:``Mathematics of Kalman-Bucy filtering''
	Springer series in information sciences ; v. 14.
	Berlin ; New York : Springer-Verlag, 1985.

Schwartz, Leonard S.:``Principles of coding, filtering, and information theory''
	Baltimore, Spartan Books, 1963

Sontag, Eduardo D.:``Mathematical control theory : deterministic finite dimensional systems''
	Texts in applied mathematics ; 6
	New York : Springer-Verlag, c1990.

Small, H.;McLeish, D.L.:``Hilbert Space Methods in Probability and Statistical Inference''
	Wiley Series in Probability and Statistical Inference
	New York: Wiley, 1994.

Stengel, Robert F.:``Stochastic optimal control : theory and application''
	New York : Wiley, c1986





Suggestions for Further Reading:




\end{document}



