- Let \(D\) be a domain of
**possible experiences**with an associated probability distribution.- If \(D\) is countable, then each \(s \in D\) has a probability \(p(s)\) associated with it. If \(D\) is uncountable, then one also needs a Ïƒ-algebra, \(\mathbf{A}\), over \(D\) and then for any \(A \in \mathbf{A}\) and each \(s \in D\), there is an associated probability density function such that, \(P[s \in A]\). Here I will assume that \(D\) is finite or countable.

- If \(D\) is countable, then each \(s \in D\) has a probability \(p(s)\) associated with it. If \(D\) is uncountable, then one also needs a Ïƒ-algebra, \(\mathbf{A}\), over \(D\) and then for any \(A \in \mathbf{A}\) and each \(s \in D\), there is an associated probability density function such that, \(P[s \in A]\). Here I will assume that \(D\) is finite or countable.
- Consider an
**agent**\(\pi_\theta : D \mapsto D\) that goes from experience to experience and is parametrized by \(\theta \in \Theta\), which characterizes the space of possible agents (of a certain class). - Let \(T \in \mathcal{P}(D)\) be the available
to learn from where \(\mathcal{P}(D)\) is the power set of \(D\). Sometimes one may work with a more refined notion of the domain of possible sets of training experiences, namely, \(T \in D_T \subset \mathcal{P}(D)\).**training experience** - A
**task**is defined by a**performance measure**that associates a score with an agent’s behavior in an experiential context: \(\phi^\pi(\Theta, D) \mapsto \mathbb{R}\).

- One can take a behaviorist reinforcement learning perspective and use
**reward**instead: \(\mathit{R}(D, D) \mapsto \mathbb{R}\), the reward associated with experiencing an agent’s action, such as going from one state to the next. This view is agnostic of the agent implementation details.

- One can take a behaviorist reinforcement learning perspective and use
- \(\Phi\) denotes the
**expected performance**: \(\Phi^\pi(\theta, D) = \mathbb{E}_{s \in D}[\phi^\pi(\theta, s)]\).

- Or in the case of reward, \(\Phi^\pi(\theta, D) = \mathbb{E}_{s \in D}[\mathit{R}(s, s’)]\) where \(s’ = \pi_\theta(s)\).

- Or in the case of reward, \(\Phi^\pi(\theta, D) = \mathbb{E}_{s \in D}[\mathit{R}(s, s’)]\) where \(s’ = \pi_\theta(s)\).
- A
**learning algorithm**adjusts the agent’s parameters \(\theta\) based on training experience \(T \in D_T\) so that its expected performance increases: \(L_\mu^{\phi^\pi} : (\Theta, D_T) \mapsto \Theta\). The learning algorithm has its own parameter space that may be distinct from the agent’s, \(\mu \in \mathcal{M}\).

- Note that in principle the learning algorithm only knows the adjustable parameters of the agent and not necessarily any details about the agent’s architecture.
- Some learning algorithms use feedback from querying the agent during learning.
- \(\mu\) are called meta-parameters because the agent already has its own ðŸ¤“ðŸ˜›

- Note that in principle the learning algorithm only knows the adjustable parameters of the agent and not necessarily any details about the agent’s architecture.
- The
**expected performance gain**\(\delta\), assuming a probability distribution over possible agent parametrizations, should be greater than zero:

\[\delta(L_\mu^{\phi^\pi}) = \mathbb{E}_{\theta \in \Theta, T \in D_T}[\Phi^\pi(L_\mu^{\phi^\pi}(\theta, T)) – \Phi^\pi(\theta)]\] - A
**meta-learning**algorithm*learns to learn*and is defined analogously to adjust the learning algorithms parameters \(\mu\) such that it’s expected performance increases: \(ML^{\delta^L}_{\mu’} : (\mathcal{M}, D_T) \mapsto \mathcal{M}\), with its own parameters \(\mu’ \in \mathcal{M}’\) The expected performance gain \(\delta’\) is similar:

\[\delta'(ML^{\delta}_{\mu’}) = \mathbb{E}_{\mu \in \mathcal{M}, T \in D_T}[\delta(L^{\phi^\pi}_{ML(\mu, T)}) – \delta(L^{\phi^\pi}_\mu)]\] - One can have a learning algorithm that does meta-learning.

Something like, \(L_{\mu’}^{\phi^\pi} : (\mathcal{M}, D_T) \mapsto (\Theta, D_T) \mapsto \Theta\)

- An algorithm could do as many levels of meta-learning as desired.
- Moreover, if \(\mathcal{M} \subseteq \Theta\), then one has a
*meta-learning agent*that can do the learning on its own as part of the process of taking actions and experiencing.

- An algorithm could do as many levels of meta-learning as desired.

## Positives

- Learning to identify even and odd numbers.

- \(D = \mathbb{N}\), or say, \(D = [1,n] \in \mathbb{N}\) for large \(n\) so that we can set \(p(n) = \frac{1}{n}\).
- \(\phi^\pi(\theta,m) =\begin{cases}

1 & \text{\(\exists x \in \mathbb{N}, m = 2x \) and \(\pi_\theta(n) = \) ‘even’} \\

1 & \text{\(\exists x \in \mathbb{N}, m = 2x + 1 \) and \(\pi_\theta(n) = \) ‘odd’} \\

\epsilon & \text{other}

\end{cases}\) - Perhaps the agent could be a symbolic regression system?
- Apparently genetic programming has traditionally been used!

- \(D = \mathbb{N}\), or say, \(D = [1,n] \in \mathbb{N}\) for large \(n\) so that we can set \(p(n) = \frac{1}{n}\).
**Supervised learning**is a classic scenario where the experiential domain is taken to consist of (observation, action) pairs where the correct actions for many observations have already been curated into a training dataset.

- \(D = \mathcal{O} \times \mathcal{A}\)
- \(\phi^\pi (\theta, (x,y)) = \begin{cases}

1 & \text{\(\pi_\theta(x) = y\)} \\

\epsilon & \text{\(\pi_\theta(x) \neq y\)}

\end{cases}\)

Or in the case that \(\mathcal{A}\) is a metric space, distance from the correct answer can be measured directly: \(\phi^\pi(\theta, (x,y)) = e^{-|\pi_\theta(x) – y|}\) - The agent could be a feedforward neural network.
- The learning algorithm could be a form of gradient descent.
- Or a k-nearest neighbors system may fit this performance measure better. k-NN is both the agent and learning algorithm.

## Negatives

- An
**associative array**(map, symbolic table, or dictionary) is perhaps the best negative example.

- \(D = \mathcal{O} \times \mathcal{A}\)

\(\phi^\pi (\theta, (x,y)) = \begin{cases}

1 & \text{\(\pi_\theta(x) = y\)} \\

\epsilon & \text{\(\pi_\theta(x) \neq y\)}

\end{cases}\)

- \(D = \mathcal{O} \times \mathcal{A}\)
- Build a
*search tree*or*hash table*to store the (key, value) pairs in the training dataset, \(T\).

The performance will be perfect on \(T\) and the expected performance on \(n\) experiences not in \(T\) will be \(\epsilon n\), as bad as possible.

### Notes

– Appropriated from Metalearning on Scholarpedia by the infamous Dr. Juergen Schmidhuber and Tom Schaul.

– If \(D_T = \mathcal{P}(D)\) is known and finite, then the ‘*learning problem*‘ is an ‘*optimization problem*‘ and can be solved in principle.

– The expected performance, \(\Phi^\pi(\theta, D) = \mathbb{E}_{s \in D}[\phi^\pi(\theta, s)]\), depends on the distribution of \(D\), which may not be entirely known. However, one then wants to estimate performance over the whole distribution to take one’s learning to new experiences beyond the training set. Without this criterion, one would simply want effective indexing algorithms as “learning”!