Learning Algorithm

  1. Let \(D\) be a domain of possible experiences with an associated probability distribution.
    • If \(D\) is countable, then each \(s \in D\) has a probability \(p(s)\) associated with it. If \(D\) is uncountable, then one also needs a σ-algebra, \(\mathbf{A}\), over \(D\) and then for any \(A \in \mathbf{A}\) and each \(s \in D\), there is an associated probability density function such that, \(P[s \in A]\). Here I will assume that \(D\) is finite or countable.
  2. Consider an agent \(\pi_\theta : D \mapsto D\) that goes from experience to experience and is parametrized by \(\theta \in \Theta\), which characterizes the space of possible agents (of a certain class).
  3. Let \(T \in \mathcal{P}(D)\) be the available training experience to learn from where \(\mathcal{P}(D)\) is the power set of \(D\). Sometimes one may work with a more refined notion of the domain of possible sets of training experiences, namely, \(T \in D_T \subset \mathcal{P}(D)\).
  4. A task is defined by a performance measure that associates a score with an agent’s behavior in an experiential context: \(\phi^\pi(\Theta, D) \mapsto \mathbb{R}\).
    • One can take a behaviorist reinforcement learning perspective and use reward instead: \(\mathit{R}(D, D) \mapsto \mathbb{R}\), the reward associated with experiencing an agent’s action, such as going from one state to the next. This view is agnostic of the agent implementation details.
  5. \(\Phi\) denotes the expected performance: \(\Phi^\pi(\theta, D) = \mathbb{E}_{s \in D}[\phi^\pi(\theta, s)]\).
    • Or in the case of reward, \(\Phi^\pi(\theta, D) = \mathbb{E}_{s \in D}[\mathit{R}(s, s’)]\) where \(s’ = \pi_\theta(s)\).
  6. A learning algorithm adjusts the agent’s parameters \(\theta\) based on training experience \(T \in D_T\) so that its expected performance increases: \(L_\mu^{\phi^\pi} : (\Theta, D_T) \mapsto \Theta\). The learning algorithm has its own parameter space that may be distinct from the agent’s, \(\mu \in \mathcal{M}\).
    • Note that in principle the learning algorithm only knows the adjustable parameters of the agent and not necessarily any details about the agent’s architecture.
    • Some learning algorithms use feedback from querying the agent during learning.
    • \(\mu\) are called meta-parameters because the agent already has its own 🤓😛
  7. The expected performance gain \(\delta\), assuming a probability distribution over possible agent parametrizations, should be greater than zero:

    \[\delta(L_\mu^{\phi^\pi}) = \mathbb{E}_{\theta \in \Theta, T \in D_T}[\Phi^\pi(L_\mu^{\phi^\pi}(\theta, T)) – \Phi^\pi(\theta)]\]
  8. A meta-learning algorithm learns to learn and is defined analogously to adjust the learning algorithms parameters \(\mu\) such that it’s expected performance increases: \(ML^{\delta^L}_{\mu’} : (\mathcal{M}, D_T) \mapsto \mathcal{M}\), with its own parameters \(\mu’ \in \mathcal{M}’\) The expected performance gain \(\delta’\) is similar:

    \[\delta'(ML^{\delta}_{\mu’}) = \mathbb{E}_{\mu \in \mathcal{M}, T \in D_T}[\delta(L^{\phi^\pi}_{ML(\mu, T)}) – \delta(L^{\phi^\pi}_\mu)]\]
  9. One can have a learning algorithm that does meta-learning.

    Something like, \(L_{\mu’}^{\phi^\pi} : (\mathcal{M}, D_T) \mapsto (\Theta, D_T) \mapsto \Theta\)
    • An algorithm could do as many levels of meta-learning as desired.
    • Moreover, if \(\mathcal{M} \subseteq \Theta\), then one has a meta-learning agent that can do the learning on its own as part of the process of taking actions and experiencing.

Positives

  1. Learning to identify even and odd numbers.
    • \(D = \mathbb{N}\), or say, \(D = [1,n] \in \mathbb{N}\) for large \(n\) so that we can set \(p(n) = \frac{1}{n}\).
    • \(\phi^\pi(\theta,m) =\begin{cases}

      1 & \text{\(\exists x \in \mathbb{N}, m = 2x \) and \(\pi_\theta(n) = \) ‘even’} \\

      1 & \text{\(\exists x \in \mathbb{N}, m = 2x + 1 \) and \(\pi_\theta(n) = \) ‘odd’} \\

      \epsilon & \text{other}

      \end{cases}\)
    • Perhaps the agent could be a symbolic regression system?
    • Apparently genetic programming has traditionally been used!
  2. Supervised learning is a classic scenario where the experiential domain is taken to consist of (observation, action) pairs where the correct actions for many observations have already been curated into a training dataset.
    • \(D = \mathcal{O} \times \mathcal{A}\)
    • \(\phi^\pi (\theta, (x,y)) = \begin{cases}

      1 & \text{\(\pi_\theta(x) = y\)} \\

      \epsilon & \text{\(\pi_\theta(x) \neq y\)}

      \end{cases}\)

      Or in the case that \(\mathcal{A}\) is a metric space, distance from the correct answer can be measured directly: \(\phi^\pi(\theta, (x,y)) = e^{-|\pi_\theta(x) – y|}\)
    • The agent could be a feedforward neural network.
    • The learning algorithm could be a form of gradient descent.

Negatives

  1. An associative array (map, symbolic table, or dictionary) is perhaps the best negative example.
    • \(D = \mathcal{O} \times \mathcal{A}\)

      \(\phi^\pi (\theta, (x,y)) = \begin{cases}

      1 & \text{\(\pi_\theta(x) = y\)} \\

      \epsilon & \text{\(\pi_\theta(x) \neq y\)}

      \end{cases}\)
  2. Build a search tree or hash table to store the (key, value) pairs in the training dataset, \(T\).

    The performance will be perfect on \(T\) and the expected performance on \(n\) experiences not in \(T\) will be \(\epsilon n\), as bad as possible.

Notes

– Appropriated from Metalearning on Scholarpedia by the infamous Dr. Juergen Schmidhuber and Tom Schaul.

– If \(D_T = \mathcal{P}(D)\) is known and finite, then the ‘learning problem‘ is an ‘optimization problem‘ and can be solved in principle.

– The expected performance, \(\Phi^\pi(\theta, D) = \mathbb{E}_{s \in D}[\phi^\pi(\theta, s)]\), depends on the distribution of \(D\), which may not be entirely known. However, one then wants to estimate performance over the whole distribution to take one’s learning to new experiences beyond the training set. Without this criterion, one would simply want effective indexing algorithms as “learning”!