Sunday, May 27, 2012

Pigeon Navigation (4): Identifying Landmarks

In this last post on pigeon navigation we'll see how we can identify the most important or "information rich" parts of a pigeons flight paths, and then equate these to the landmarks the pigeon uses.

Using the idea that a pigeon learns, and then attempts to follow a memorised `habitual route', we saw in the last post that we could use previously recorded flight paths to predict what future flights by the same pigeon, from the same release site would look like. We could assign a probability to any future path, thus deciding whether it was predictable or not after considering the past flights. The fact that paths typically became more predictable over successive flights was evidence that the pigeons were learning routes home and then sticking to them.

But how does a pigeon learn a route home. It is unlikely that it imagines a perfect line on the ground below it, representing some kind of idealised route it wants to follow. Memorising a complete continuous path, which has an infinite number of locations along it, is hard. Instead, the generally accepted hypothesis is that a pigeon learns its route by memorising a small number of landmarks which act as waypoints. This idea, known as `pilotage', supposes that the bird reaches one landmark, then reorients itself to head for the next until it reaches home.

Can we detect where these landmarks are, using the methodology we've developed so far? Of course we can!

Recall that we previously assumed that a flight path always consisted of 100 recorded positions, starting at the release point and ending at the home loft. We predicted future flights by using these 100 points on each flight path to estimate a habitual route  that the bird was trying to follow. We predict that future flights will also look like this habitual route, plus some variation that changes from flight to flight.

In principle we can choose to ignore some of this data. We can, if we want, choose to estimate the habitual path using only a subset of the data we have. For example, we might choose 10 random points of the 100 we have of each flight, then try to estimate the habitual route from these.

The first important point to understand for identifying landmarks is that such an approach will have varying degrees of success, depending on which points are selected. Consider the figure below

In each case the faint black line is the same simple bell curve. The black dots indicate 3 points on this curve that we are "allowed" to know in order to make a guess what the whole curve looks like. If we draw a smooth line through these three points we get the two red lines. Hopefully it should be clear that the red line on the left is a much better estimate of the bell curve than the very low red line on the right. Therefore, if I wanted to remember 3 points to try and remember the whole of the faint black line, I would better off choosing those on the left, rather than those on the left

But this is exactly what the pigeon has to do! It needs to remember a few landmarks so it can remember the whole of its route home. This suggests the second important point for identifying landmarks: the points that allow best estimation of the habitual route are the same points as the pigeon's landmarks. That means that we assume the pigeon does a good job of choosing the most efficient way to compress its habitual route into a few key points.

Since we can never measure exactly how well we have estimated the habitual route, we do the next best thing and test how well any set of possible landmarks allows us to predict future flights. If we call the subset of times that correspond to landmark locations at t_lm, and the full set of times as t_full, then our task is to choose t_lm to maximise p(x_n+1(t_full) | x_1(t_lm), x_2(t_lm), ..., x_n(t_lm))

And when we do this, we find landmarks that correspond to recognisable features of both the paths and the landscape beneath, such as below (remember I promised to explain what those red dots were...?)

The landmarks (the red dots) tend to be where the paths are very similar, since here the paths are a very good predictor of the habitual route, where the pigeon flies somewhere unexpected - the apex of the `C' shape - and where the path curves sharply. They also tend to be on the edge of forests and villages, above major roads and obvious features such as a church spire. 

[NB: Those with a machine learning background may see that this process is largely analogous to two other ideas. Active sampling, where we take data in an intelligent way to maximise our predictive power while minimising collection costs, and reduced rank Gaussian process approximations, where we use a subset of data points as `inducing points' to create a lower rank covariance matrix and speed up calculations.]

Monday, May 14, 2012

Pigeon Navigation (3): Habitual Routes

[NB: This post, and the rest of the pigeon posts will be quite mathsy. I've done my best to keep the maths as simple as possible - it should be possible to follow the argument without understanding all the working! On the other hand, if you do want to see the maths done properly, please read it properly formatted!]

Way back when this century was young, the navigation group in Oxford published a series of papers demonstrating that pigeons, when repeatedly released from the same site, would learn to follow the same route back the home loft each time.

If a pigeon is learning and following a route this ought to make its flight patterns predictable. If those flights are getting more and more predictable we should be able to observe that by using a model to predict the flights with increasing accuracy. In other words, we should have a model which gives the probability of a flight path, and that probability should get higher as our predictions get better.

In the last post we saw how to assign probabilities to individual flight paths using a Gaussian process (GP). The precise probability of a given flight path depended on the mean, m, and covariance, S, of that GP. I told you that the covariance dictated how likely the flight path was to be either smooth or wiggly, and we used the straight line between release point and home loft to create the mean. For convenience I'll write down the resulting probability as:

p(x| m, S) = GP(x; m, S)

Now, the reason we chose the straight line path to be the mean was that if we only look at a single path, and we have never seen this particular bird fly before, there is no reason to assume it will fly either one side or the other from this most efficient route. We don't expect the flight path to be perfectly straight, but we don't know beforehand in which direction it will go.

Imagine instead that we had already seen the flight paths below.

Now we should have a very good idea where the next path is going to be, somewhere close to the paths we have already seen. It looks like the pigeon is following a particular route home every time, so its unlikely to suddenly fly directly south from the release point next time. Obviously it doesn't fly exactly the same path every time, but each new flight path is like an imperfect attempt to fly some memorised route.

Lets imagine that we could look into the mind of the pigeon and retrieve exactly what its memorised route looks like. We can call this route h (for 'habitual'). Then we might replace the earlier straight line mean path with the one we now know the bird is trying to fly

p(x | h, S) = GP(x; h, S)

(I'm going to assume for simplicity that we know what  S is, but in practice we would infer it from the data)

Whats more, if we want to find the probability of several flight paths by the same bird, each an attempt to replicate h, we can simply multiply the probability of each path together, because each one is independent if we know h.

p(x1, x2, ...,xn | h, S) = GP(x1; h, S) x GP(x2; h, S) x ... x GP(xn; h, S)

Hang on! Surely those flight paths aren't really independent?! After all, they all look the same. Yes! But the reason they look the same is that they are all attempts to replicate h. They way each path varies around h is independent. All the shared structure in the paths is located in h.

Ok, thats nice, but the problem is that we don't know what h is. All we can see are a few paths that look a bit like h. But never fear - Bayes is here...we can use those flight paths we have actually seen to infer what h is. Recall Bayes' rule which allows use to reverse the order of the conditional probability:

p(h | x1, x2, ...,xn, S)p(x1, x2, ...,xn| h, S) x p(h | S) / p(x1, x2, ...,xn| S)

But we seem to be creating more trouble for ourselves. Now we need to know two more things, p(h | S) and p(x1, x2, ...,xn | S). Are we digging a hole for ourselves?

No! The first of these terms is a prior distribution. It's how likely we think any particular habitual route would be before we see any real paths. So we need to place a probability distribution over a path that could lie anywhere between the release point and the home loft. Thats exactly what we learned how to do in the last post! Before we see any real paths theres no reason to expect the habitual path to be on either side of the straight line, so the probability of h is exactly like a single path on its own, with the straight line as a mean.

p(h | S) = GP(h; m, S)

The second term is the joint probability of the real paths, if we don't know what h is. This can be calculated by integrating over all possible values of h.

∫ p(x1, x2, ...,xn | h, S) x p(h | S) dh

and this is where the theory of Gaussian processes really helps us. Integrals like this are really easy to do (using a few matrix rules...easy is a relative term!) when everything is Gaussian...

∫ p(x1, x2, ...,xn | h, S) p(h | S) dh = ∫ GP(x1; h, S) GP(x2; h, S) GP(xn; h, S) GP(h; m, S) dh

= GP ([x1, x2, ...,xn], [m,m,...,m], Σ)

where those square brackets indicate that we're concatenating the n paths and n copies of the vector m. We have a big new covariance matrix, Σ, which is generated from S. If we want to mathematical details of how we do that I would suggest reading them in this paper (Open access), where it's all properly formatted without the restrictions of html. Here we'll just assume we know the matrix rules for multiplying Gaussian distributions together - check out Appendix A of my thesis if you're interested.

The upshot of all this is that we can calculate a probability distribution, p(h | x1, x2, ...,xn, S), which tells us how likely any given habitual route h is, based on the flight paths we've already seen. Does it work? Well, look at the picture below, showing a set of flight paths from two birds, and the distribution (mean + variance) of the inferred habitual routes. The faint black lines are the flight paths, recorded from GPS. The thick black lines are the 'best guess' of the habitual routes, and the dashed red lines indicate how uncertain these are. The dashed black lines indicate where most future flight paths are expected to lie.

If we can infer what the habitual route is, we should then be able to do exactly what I suggested at the top of this post, and make some predictions about where future flight paths will be, and see if these become more accurate as the birds learn their routes. In fact, we have already done everything we need. We calculated the joint probability of n paths, assuming that we didn't know the habitual route.

 p(x1, x2, ...,xn| S) = GP ([x1, x2, ...,xn], [m,m,...,m], Sigma)

if we want to calculate how probable path xn is, based on the previous n-1 paths, we simply calculate the joint probability of x1, x2, ...,xnand of x1, x2, ...,xn-1

p(xn | x1, x2, ...,xn-1| S ) = p(x1, x2, ...,xn | S) / p(x1, x2, ...,xn-1 | S)

So lets test it out. In the experiments done in Oxford the typical procedure was to release the same bird 20 times from the same spot. What happens if we calculate how likely each of these flight paths are, based on the previous 2 flights immediately before?

That graph shows the (log) probability of the next path becoming higher over time - the pigeons are becoming more predictable, just as we hoped! Where the y-axis is equal to zero is the point at which the paths are more predictable than if we just guessed wildly without seeing any other previous flights. Therefore we can say that after ~10 flights the birds are more predictable than random - they have learnt their routes. 

This demonstration of increasing predictability is a nice alternative way of seeing route learning that was previously shown by measuring the average distance between successive paths, but its not immediately clear why it should be any more useful. In the next post we'll see how we can see now only that the route is being learnt, but where it is being learnt, to identify where the landmarks the pigeons use to navigate are and what they might be. 

Saturday, May 5, 2012

Pigeon Navigation (2): GPs and GPS

In the last post I introduced the idea of using Gaussian processes (GPs) as a tool for modeling homing pigeon flight paths. In this post I'll give a few more details of exactly what this entails.

For our purposes a pigeon flight path consists of a number of recorded 'x' and 'y' co-ordinates from a Global Positioning Satellite (GPS, don't confuse the two!) recorder, each with a time stamp 't'.

For the sake of simplicity, lets imagine that any such path begins at time t=1, and ends at time t=100, with 100 recorded points equally spaced in time between (this isn't strictly true, but it won't make any real difference in understanding this). How can we assign a probability to this path?

What we do is claim that the 100 recorded 'x' co-ordinates are a sample from a 100-dimensional multivariate Normal distribution, N, with some mean vector, m and covariance matrix S.

p([x1, x2, ..., x100]) =  N([x1, x2, ..., x100]; m, S)

(NB: the 'y' co-ordinates will have their own distribution, but we can get away with just considering the 'x's for now, we'll worry about the 'y's a bit later )

Now, a 100-dimensional distribution sounds a lot scarier than it actually is. All this is telling us is that these 100 recorded locations are connected, e.g. x6is likely to be very close to x5, since the pigeon does not have time to move very far between t=5 and t=6. Conversely, the connection between x5 and x90 will be much weaker, since the bird is free to move a large distance during that time. The Normal distribution provides a convenient tool for assigning probabilities to large numbers of correlated variables, and its mathematically easy to deal with (as we'll see as we go further).

So what are m and S? The mean vector, m, is quite simple. It is where we "expect" the bird to be at a given time. Since we know where the bird starts and finishes, we can expect that x1 will be at the release point and x100 will be at the home loft. Without any other information it is reasonable to assume that the other 98 points should be spaced equally along the straight line between the release point and home. Of course, they almost certainly won't actually be exactly on this line, but there is no reason for us to believe the bird will show a preference to fly one way or another before we see any data. In the picture below the thick black line indicates the locations of m

The covariance matrix, S, specifies two things. Firstly, the diagonal entries, such as Sii, specify how much the values of xi are likely to differ from the expected values of the mean, mi. The other entries, Sij, indicate how strongly connected the values of xi and xj are. High values of Sij mean that xi and xj will be strongly correlated. If Sij is zero then there is no correlation between xi and xj.

We don't want to have to specify a correlation between every pair of points individually. Instead we construct the matrix S using a covariance function k(i, j), which depends on the difference between i and j, e.g.

Sij = k(i, j) = k0 exp(-(i-j)^2/L)

with this function the correlation between xi and xj gets weaker as the difference |(i-j)| gets larger. The parameter L determines how quickly this happens. If L is large then correlations will persist over longer separations between points. If L is very small then correlations will almost disappear after just few time steps. If the correlations between points persist for long periods of time then the path will be very smooth, since any points close to each other in time must also be close in space. Equally, if L is small then the path can be much more 'wiggly' and the bird can change its position quickly. ktells us how uncertain the path is. If k0were to be zero then all of the entries of S would be zero and the path would be forced to lie along the mean - their would be no uncertainty. Large values of k0mean that any path can be quite far from the straight line. The plot below shows k(i, j) as a function of dt = |i-j|, using different values of L (the Input Scale), with k0set to 1.

By applying the function k(i, j) to every pair of points we can construct the full matrix S, which will typically look like the example below:

The values of S peak along the main diagonal and decay as you move away from this. The width of the central red band shows how strongly correlations persist over time. Here points are correlated when they are within about 20-30 time steps of each other. 

So, we can get the probability of any path of 100 points, given only a mean and a covariance matrix. The mean, as we saw, is specified simply by knowing where the bird starts and finishes. The covariance matrix is specified by only 2 parameters, k0and L. So, the probability of the x co-ordinates depends only on these two parameters (as well as knowing the start and finish, which we'll assume are always known)

p(x | k0, L) = N(x; m, S(k0, L) )

We can take this further and either find the optimal values of k_0 and L, or even better, sum over our uncertainty by using an appropriate prior distribution that expresses how likely we think different values of these parameters are (see the post on Bayesianism for more details). This gives us a probability for the path, independent of any particular choice of parameters.

p(x) = ∫ ∫  p(x | k0, L)p(k0)p(L) dk0 dL = ∫ ∫  N(x; m, S(k0, L) p(k0)p(L) dk0 dL

Now, remember those y co-ordinates we removed? We can apply exactly the same analysis as we've done here for the x co-ordinates, but for the y co-ordinates instead, with their own mean (derived again from the straight line path) and covariance (the bird may vary more along x or y axes). Not knowing anything in advance about how the bird's path will vary around the straight line we can treat the x and y co-ordinates as independent (once the mean path is accounted for). Therefore we can get the probability of the whole path simply by multiplying the two probabilities for both sets of co-ordinates.

p(path) = p(x)p(y)

So thats how we go about assigning a probability to a path. This probability will reflect our instincts about how 'likely' a path is: paths that lie close to the straight line will be more likely than ones that go off in some bizarre direction, and paths that are excessively 'wiggly' will have a low probability. Nice smooth flight paths in the vague vicinity of the straight line are what we expect a flying animal that cares about energy efficiency to produce.

This might all seem a little dry and you may be wondering exactly what we gain by doing this. For now, I'm going to have ask you to trust me. In the next few posts we'll see how the simple act of matching paths to probabilities gives us some exciting analytical power.

Saturday, April 21, 2012

Pigeon Navigation (1): Paths and Probability

How do birds navigate successfully over huge distances from temperate to tropical regions and back every year? How do homing pigeons know how to get back to their owner's loft quickly enough to win a race? Is there some way to control the number of pigeons in Trafalgar Square [or insert your country's pigeon hotspot]?

All good questions. None of which really interest me.

How can we mash up the science of pigeon navigation and a bit of probability theory and come up with something fun and faintly ridiculous? Now you're talking...

For a bit over 10 years now researchers having been attaching GPS devices to the backs of domestic homing pigeons (Columba livia to our classicist friends) before releasing them in more or less odd places. If and when these pigeons make it home, the devices can be removed and we can see exactly where the pigeon has been in the interim (typically at a resolution of a couple of metres, once every second).

This is what a pigeon looks like. Thats a GPS tracker on its back.

A few pigeon paths recorded in the Oxford area. Those red dots sure look exciting don't they? We'll be getting to them eventually...

With such data, our intrepid scientists have shown that probably use landmarks, learn routes home, seem to follow roads and often co-operate in getting home. Sadly, while these findings have revolutionised a popular field of study, been hugely cited and generally proved more than averagely seminal, they didn't include very much probability theory, so I'm going to go ahead and pretty much ignore them from here in.

But where there's data, there's chance to get some machine learning going. So let's get to it...

Paths and Probability

There are many things we might want to learn from the recorded data from the GPS devices. In my research I try to frame learning as a test of various hypotheses using data to adjudicate between them. For example, if we want to learn whether pigeons genuinely follow idiosyncratic routes (which we will) we need to know if the data is more or less likely given this hypothesis than the alternative. If we want to know if the pigeon uses landmarks, we need to find a way to say if the GPS data is more or less likely based on some hypothetical set of landmarks the bird might be using. We need to use probability theory as a link between our data and our theories.

The many recorded locations that a pigeon visits constitute elements of a path that the pigeon actually flies. As with anything probabilistic, we need to start off by finding a way to ask how likely the data (the recorded positions) are. How probable is it that the pigeon flew this path, rather than some alternative route? How can we place probabilities on observations of flight paths?

Well, lets try and get there one step at a time. First I'll just try to give you some idea of the approach we're going to take. In subsequent posts I'll flesh this out with some actual maths.

 If I asked you to place a probability on where the middle of the path (say, the 50th of 100 locations) would be, how would you do it? A reasonable guess would be that on average it would be half way between the release point and the loft. But as the picture above shows, its likely to vary around that point quite a bit. Wherever you think its going to be, you can specify this as a probability distribution, a Gaussian (Normal) distribution, centred on where you think it will be and with a standard deviation that represents your uncertainty.

Now imagine I ask you to put a similar probability on the locations 1/3rd and 2/3rds of the way along the path. We could just as easily make a guess and place Gaussian distributions at both of the points to represent where we think the bird will be. Likely these will be directly 1/3rd and 2/3rds of the way between release and loft. But look at that picture above. If the pigeon starts out to the left of the straight line, its likely to stay out to the left later. So our two locations are going to be correlated, if one is left of centre, the other is likely to be too. They have a joint probability distribution.

The pictures below give some indication how this joint distribution works. We have two correlated variables. Initially we are quite uncertain about both (A). Then we measure one, reducing its uncertainty to zero (B). In addition, the uncertainty in the second variable is reduced, and the expected value moves closer to the first measured value.

(A) Two correlated, unmeasured variables

(B) Variable 1 is measured, variable two is less uncertain

Now, we can extend this to lots of different locations along the path. It is reasonable to imagine that locations will be more correlated the closer they lie along the path. Lets assume we can state a function which we call the covariance function, k(t1, t2), which states how strongly two values (t1, x1) and (t2, x2) should be correlated, and that this gets weaker as the separation of t1 and t2,  dt = |t1-t2| becomes bigger, such as the functions in the figure below.

Correlations get weaker as the difference in t values increases. How fast the correlations decrease depends on the covariance function, k(dt).

Making that assumption, and looking  at 10 points, all jointly distributed, we might get figures like those below

(A) 10 unmeasured variables, correlated according to separation

(B) Measure some variables, others become less uncertain in response.

Going one step further, we might take the number of points we are interested in to infinity, for a continuous path, and then measure just a few of those points

A continuous range of variables, measured in 3 places

What we're getting too, through this exercise, is the concept of a Gaussian process, which is a probability distribution over continuous paths or functions. Much like the Gaussian distribution gives a probability of seeing any number, or set of numbers, a Gaussian process (GP) gives the probability of seeing any path, or any set of points measured on that path. The standard Gaussian distribution can describe any finite number of jointly distributed variables, the GP is simply a Gaussian distribution with an infinite number of variables, representing every possible point on the path.

Gaussian: P(x) = N(x; mean, variance)

Gaussian process: P(path) = GP(path; mean path, covariance function)

The most important property of a GP is that any subset of points on the path (such as the recorded positions from the GPS device - don't confuse GPs and GPS!) follow a multivariate Gaussian distribution,

P(recorded positions) = N(recorded positions, mean positions, covariance matrix)

We'll discuss more about exactly what the covariance matrix and mean positions represent in the next post.

Great! We're on our way. If we can assign probabilities to paths in a consistent manner we can ask if observed paths are more or less likely based on different hypotheses, which allows us to use data to select between those hypotheses. In the next post I'll give a rundown of the properties of GPs and how they work.

[In a switch of textbook, for these pigeon navigation posts I'll be advising you to look at the definitive guide to GPs, Gaussian Processes for Machine Learning, by Rasmussen and Williams, and what I have to assume is the definitive work on using GPs to analyse pigeon flight paths, Prediction of Homing Pigeon Flight Paths using Gaussian Processes, by one R. P. Mann]

Sunday, April 15, 2012

Why Model Selection?: Bayes and Biased Coins

A key tool in the work in the work I do is Bayesian model selection. These days 'Bayesian' is a buzzword that is often used colloquially as a synonym for all relatively sophisticated inference methods, and frequently deployed in papers to great effect, making them seem more exciting. However, model selection, and Bayesian model selection in particular do have concrete meanings beyond the hype, and I think it is important to see, in a few simple examples, why model selection is often (and, I believe, usually) a better alternative to standard significance tests when trying to test scientific hypotheses

While I refuse to be dogmatic in my use of inference methods, I do subscribe to what is known as the 'Bayesian interpretation' of probability theory, which views probabilities as degrees of belief in things, much as the odds at a horse race represent how much the bookmaker believes a given horse is going to win. Again, I hope it is possible to show, with a simple example, that this interpretation has important and useful implications for real inference.

A standard tool in teaching probability is the biased coin, a coin which is more likely to land on one of either heads or tails. Lets look at a couple of examples of biased coin problems to illustrate why model selection can be more sensible than significance testing.

Disclaimer: Examples like this are common in textbooks, including my favourite. I make no claim to originality. Also please let me know if you find any errors in the maths, theres nothing worse than having your argument die by virtue of a missing factor 2 somewhere! (which did eventually happen!)

The Setup

Imagine I offer you two coins, telling you that one is biased towards tails, the other is fair. The biased coin gives tails on average 4 times in 5, the fair coin gives tails half the time You choose one and I ask you to try and guess if it is fair or not. I tell you that you can toss the coin a maximum of 50 times. You toss the coin the allotted 50 times and you get 31 tails, 19 heads. Do you believe the coin is biased?

The Significance Test

The classic way to answer this question is a significance test. We ask 'are the data [the coin tosses] significantly different from what we expect by chance?' To do this we construct two hypotheses:

The test hypothesis (H1): The coin is biased.

The null hypothesis (H0): The coin is fair.

We then look to see whether our data are sufficiently unlikely under the null hypothesis that we might reject it. We do not consider the test hypothesis any further.

The question we ask, mathematically, is 'how likely was I to see 31 or more tails from 50 tosses, if the coin was fair?' The answer is:

P(31 or more tails | fair coin) = ∑50i=31  binopdf(i, 50, 0.5) = 0.0325

Where binopdf is the binomial probability distribution. Traditionally we reject the null hypothesis if the probability is less than 0.05, so here we can claim a significant result. We would then say we believe the coin to be biased, and roughly speaking we would assume we had about a 3-4% of being wrong.


Woah there! Do you have any objections to what we just did? I hope so....

  • Why do we care about 31 or more tails? We didn't get more than (or less than) 31. Why are we basing our conclusions on things that never happened?
  • What the hell happened to H1? Surely we could/should have tested that hypothesis too. Nope, in significance tests we always choose the most 'random' possible hypothesis, label it the null and ignore the rest
  • The setup was specified in some detail. There were exactly 2 coins, only one was biased, and the bias was known. You were asked to choose one at random. How come none of these details appear in our analysis? Are they really unimportant?
Model Selection

Lets deal with the first two of those objections first. Why do we care about more than 31 tails? One can make arguments in terms of deciding 'what would convince me?' before doing the test, but the simplest answer is: because this test is ridiculous. Our conclusions should only ever be based on what actually happened and how likely that was, a positon known as the likelihood principle.

Ok, so what about H1? Surely we could go back, label our hypotheses the other way round and test the biased hypothesis instead? Exactly! Its a quirk of the kind of effects scientists test and how they test them that we have become used to 'rejecting the null hypothesis'. Lets see how easy it is to test H1 directly against H0.

First, we'll calculate just the probability of what actually happened, i.e. 31 tails and 19 heads. First, if the coin is fair:

P(31 tails, 19 heads | fair coin) =  binopdf(31, 50, 0.5) = 0.0270

and if the coin is biased (remember, we know the bias):

P(31 tails, 19 heads | biased coin) =  binopdf(31, 50, 0.8) = 0.0016

And we can now simply state that the data is 17 (yes, 17!) times more likely if the coin is fair than if it is biased.

P(31 tails, 19 heads | fair coin)/P(31 tails, 19 heads | biased coin) = 17

While the significance test suggested the coin was biased, our direct model comparison shows this to be an absurd conclusion. Of course, I have chosen these numbers to illustrate how bad the significance test will sometimes give the right answer. But why jump through the perverse hoops required for the significance test when the direct comparison is both more accurate and simpler?

If we don't happen to know the exact bias of the biased coin in advance, we can still do a comparison, by integrating over the possible biases the coin could have, from 0.5 to 1, with a prior distribution on p of pr(p)=2 for 0.5 < p < 1 (thanks to commenter for correction)

P(31 tails, 19 heads | biased coin) =  ∫1p=0.5binopdf(31, 50, p)pr(p)dp = 0.0372

P(31 tails, 19 heads | fair coin)/P(31 tails, 19 heads | biased coin) = 0.7

So now the biased coin is somewhat more likely because the bias could be nearer the 0.6 necessary to give these results (though the probability of a fair coin is still ~ 40%, not the 3-4% we thought earlier).


So direct comparison of two hypotheses is (I hope you'll agree) better than doing a significance test, if we have a two or more clear hypotheses to test. And if we have no clear alternative to the null hypothesis, what are we even doing? Boldly rejecting the null when we have nothing to replace it with? Doesn't that sound like we're getting ahead of ourselves?

Now to deal with the third objection above. We saw in the last section that we can make use of the exact bias of the coin if it is known, and adjust if it is not. What about al the other info we have in the problem? And to introduce a further connected problem, why are we judging these models on how likely they make the data? I don't know about you, but I want to know how likely the data makes the models....they are not the same thing!

Looking at our original problem setup, a key aspect we haven't considered is how many coins you were offered. It was exactly two, one of which was biased. What if I had offered you a bucket of 1000 coins, only one of which was biased. Even if i didn't tell you the exact bias, could we really repeat the analysis above and conclude that a coin giving 31 tails was likely to be biased?

Without further ado, let me introduce Bayes rule, which allows us to go from the probability of the data, D, to the probability of hypothesis, H

P(H | D) = P(D | H) x P(H)/P(D)

the first term on the right is know as the likelihood, and is what we have previously been using to judge our hypotheseses. But, ignoring P(D), which will cancel out when comparing two hypotheses, that second term on the right, P(H), shows us that something else is going on. This is what we call the prior probability of H, i.e. how likely H was before we saw any data (tossed the coin). In our first example we could easily argue that for both H0 and H1, P(H0)=P(H1)=1/2, since there were two equally choosable coins. With 1000 coins on offer, with a random choice, we should assume that there is only a 1/1000 chance that we picked the biased coin, before we start testing it. Bayes rule says that this prior probability doesn't just disappear once we start doing tests. So if we calculate the ratio between the probability of H0 and H1 we need to include the ratio of prior probabilities

P(fair | 31 tails) / P(biased | 31 tails) 
= P(31 tails | fair) x P(fair) / P(31 tails | biased) x P(biased)
= 0.0270 x 0.999 / 0.0372 x 0.001
= 729

So yes, the data is somewhat more probable from a biased coin, but the sheer unlikeliness of us having picked that coin out of the bucket to start with massively outweighs this slim evidence. If you don't believe this, I invite you to fly over to Sweden and we'll play some betting games with you picking coins out of a bucket and I'll finally have some expendable income!

If you still have doubts about the power of prior beliefs, watch the video below. Although the data your eyes receive is slightly more likely from a concave face, you see a convex face because your brain knows convex faces are hugely more likely

Monday, April 9, 2012

How Fish Shoal (5): Separating Different Effects

In this final post on the analysis in our paper on fish shoaling, I'll show how we can adapt the technique of fitting a function using a neural network, using this to isolate the various different cues that the fish respond to simultaneously.

In general we might imagine that a fish moving in a shoal is presented with an array of potential stimuli at any moment. Just for starters, it has the positions, movements and behaviours of its many neighbouring fish to consider. In addition to this there are other environmental cues, such as the positions of the walls of the fish tank in a laboratory experiment, or the possible locations of predators in the wild. We can expect that the behaviour of the focal fish at any moment will be due to a combination of all these effects (I'm intentionally avoiding the words 'sum' or 'product' for reasons that we'll soon see).

Now, in the last post we learned how we can fit the behaviour of a focal fish as a function of any input stimuli that we choose. We saw an example using just the position of the nearest neighbour to predict the acceleration response of the focal fish. There is nothing in that approach to stop us instead using far more input stimuli. We could, for example, construct a neural network to predict the acceleration from the positions of the nearest 3 neighbours.

This would, in principle, allow us to learn how the acceleration of the focal fish depends, in potentially very complex ways, on the positions of its 3 nearest neighbours (NNs - don't get confused with neural networks!). However, such an approach has a few significant drawbacks. Firstly, from a technical viewpoint, the larger we make the space of possible inputs (i.e. the more stimuli we use), the harder it becomes to train the neural network. The more inputs we have the more combinations of those inputs that are possible. It becomes less and less likely that our data will cover a large enough proportion of those possible combinations to allow us to learn the connections between inputs and outputs.

Secondly, even if we could learn a function of the 3 positions (6 variables in total, each position being an angle and distance), how are we to 'see' what we have learnt? We may be able to try new combinations of the inputs and find the predicted acceleration, but it is going to be almost impossible for us to visualise the function.

Finally, how can we relate this back to previously established theories of collective motion? If we learn some highly complex function of the positions of all the neighbouring fish, what does that tell us about the simple rules of interaction that have previously been the standard way of understanding these phenomena?

We can get around these difficulties by considering the sort of interaction rules that have been suggested before. These have almost exclusively considered additive responses - the response of the fish to 2 neighbours is simply the sum of the response to each neighbour individually. This is akin to most effects in physics - the force on a spaceship is the sum of the gravitational force from the Earth and the gravitational force from the Moon, and the gravitational force from the Sun...etc. The existence of the Moon doesn't change the force exerted on the spaceship by the Earth (at a specific moment in time).

So we can propose a model where the acceleration of the focal fish is a function, f, of the positions of 3 neighbours (p1, p2, p3), this function itself being the sum (see why we avoided that word earlier) of 3 simpler functions, g1, g2, g3.

Acceleration = f(p1, p2, p3) = g1(p1) + g2(p2) + g3(p3) + residual

Now, just as we can model a function using a neural network, we can model 3 functions using 3 neural networks. The eventual acceleration is now the sum of the outputs from those 3 networks. I'm going to stop drawing each network properly and just treat them as black boxes, as I encouraged you to do in the last post.

So the task of estimating one complicated function by learning one big neural network is now changed to the task of estimating 3 hopefully simpler functions using 3 smaller networks. The question is how to learn all 3. There is a choice to be made - do you try and learn all 3 simultaneously, or do you prioritise some over others.

In our paper we took both approaches for different subsets of the stimuli. We considered the position of the 3 nearest neighbours, but also the past behaviour of the focal fish and the position of the wall, give us a schematic as below.

Or, in equation form:

Acceleration = gpast(past) + gwall(wall) + g1(p1) + g2(p2) + g3(p3) + residual

We made two biologically motivated reasonings to decide how to approach learning this combination:

1. There is no reason to suspect, a priori, that the past behaviour of the fish, the position of the tank wall or the positions of the neighbouring fish are more or less important than each other

2. It is implausible to imagine that the focal fish interacts with its 2nd or 3rd nearest neighbours but not with the first nearest. 

Therefore we will learn the first three networks (past, wall, NN1) simultaneously, since each of these should have the chance to be the primary factor in predicting acceleration. Networks 4 and 5 (NN2 and NN3) will be learned subsequently, using whatever part of the fishes' acceleration that has not been accurately predicted by the first 3 networks. This means, for instance, that the interaction between a fish and its second nearest neighbour will only be allowed to account for what cannot be predicted by its interaction with its second nearest neighbour.

The process of learning multiple networks, either simultaneously or in succession is relatively similar. We iteratively learn each network, while assuming that the others are already known. Lets look at how we learn the networks associated with the past, the wall and the nearest neighbour:

1. Start each network in a random configuration, just like when we learn a single network.

2. Now, we assume that the networks associated with the past and the wall are known and correct. We pass our measured values for these stimuli into the networks and record the output, getting a predicted acceleration_past+wall (recall the USE function on the black box from the last post)

3. We now learn the nearest neighbour network using our measured values of the nearest neighbour position as inputs, and the difference between the actual acceleration and the predicted acceleration as outputs

4. Once the NN1 network is learnt, we fix its state and then apply the same technique to learn the past network. Fixing the wall and NN1 networks in their current states we predict what the acceleration should be from the wall and nearest neighbour alone

5. Then, like in step 3, we now learn the past network, using the difference between the predicted and observed accelerations as the output

6. Having learnt the past network, we fix its new state, and we predict the acceleration from the past and the nearest neighbour networks

7. And now we learn the wall network, using the measured values of the wall position and the difference between the predicted acceleration and the observed acceleration

8. We can now test how good our 3 networks combined are at predicting the real acceleration by feeding the measured stimuli through all 3 in their current states.

9. Now we have learnt each of the first 3 networks once. However, we are not finished. From this point we go back to stage 2 and go through stages 2 to 8 again, repeating the whole process until the error between predicted acceleration and real acceleration stops improving.

So by this process we have learnt, simultaneously, the functions associated with the past, the wall and the nearest neighbour. The process of learning each network in this iterative fashion divides the observed accelerations into components associated with the different stimuli in a way that creates the best match between the measured values, assuming that the additive assumption is correct. It is of course possible that there are more complicated interactions that depend on, for example, the positon of the wall and the nearest neighbour in some complex non-additive fashion. We have found the best additive function that estimates the true process.

Now, having found these three networks, we are in position to successively learn the interactions with the second and third neighbours. Remember that we do these after the first 3 since it is hard to imagine a fish consistently ignoring its nearest neighbour while attending to its second nearest neighbour. As such we only learn each of these once, rather than iterating.

10. Group all 3 of the networks we learned before and predict the acceleration

11. Now learn the second nearest neighbour network (NN2) based on the difference between this prediction and the observed acceleration

12. Now group all 4 of the networks we have learnt so far and predict the acceleration

13. Learn the 3rd nearest neighbour network based on the difference between these predictions and the observed accelerations

14. Phew! Now we have finally learnt every network. We can make predictions based on all 5 networks to test how well the combination predicts the real accelerations

15. But nicely, since each network has only 1 or 2 inputs, we can also input the range of possible values for each input into its respective network and plot the output, enabling us to visualise each component and solving our earlier problem with visualising a high dimensional function. When we do so, we get something that looks a lot like the figure below. The results from the past network are not shown, only those from the wall and the 3 neighbours (from top to bottom respectively). On the left hand side we have the predicted accelerations from each of these networks. On the right hand side we also have the predicted turning angle, using the same process but with measured turning angles instead of accelerations. Recall that each plot is a semicircle because the funtions are assumed to be symmetric (acceleration) or anti-symmetric (turning angle)

As we saw in the last post, the interaction with the nearest neighbour replicates many of the features we expect, including distance dependent repulsion and attraction between the two fish. The other interesting result here is how little structure there is for the functions associated with the second and third nearest neighbours. In our paper we interpret this as evidence that the fish primarily interact with their first neighbour only. Such a strong biological interpretation comes with necessary mathematical caveats. It is important to be clear that we have only learnt the best function for mapping stimuli to behaviour out of those which fit the additive structure we proposed. As shown in a another paper published alongside ours, interactions may not always be additive. Also, technically what we have shown is that the positions of the second and third neighbours do not help us to predict the behaviour once we known the position of the first nearest neighbour. While this suggests that no interaction takes place these are subtly different statements.

Despite those final caveats, learning combined functions in this manner is a powerful tool for separating effects in the data, and for learning patterns that may be obscured by other unknown factors (such as the response to the walls of the tank here). I hope you find this useful when considering your own data!

If you want to know about this sort of technique in more detail I can suggest starting on the Wikipedia page for Expectation-Maximisation, or let me plug again (unpaid and unrequested!) my suggested text: David Mackay's textbook, Information Theory, Inference and Learning Algorithms (free online).

Monday, April 2, 2012

How Fish Shoal (4): Using a Neural Network to Learn From Data

To recap, in the first post on this topic, I started by asking how we can use recorded data of fish movements in groups to learn how they interact. I stated that we can see this as inferring a function between the environment and the fish's behaviour, and in the subsequent posts we looked at how we might estimate functions using regression, arriving at the idea of a neural network as a highly flexible tool for performing non-linear regression. In this post we'll see how we can practically use neural networks (as one possible tool among many alternatives) to learn from data, and how this is actually coded in Matlab to show how few of the details we need to concern ourselves with to start doing useful inference.

I'll be referring to code that utilises the Netlab toolbox in Matlab, which you can download for free, and which you can install simply by unzipping the downloaded file and adding the directory to your Matlab path. The code I will use is specific to Netlab, but the basic method applies to using any similar toolbox.

Inference always begins by deciding which outputs (behaviours) we want to predict from which inputs (stimuli, environment). The recorded positions of the various fish over time taken from video tracking are only useful once we make this assignment. In the case of our research we looked at the relative positions and directions of each fish's neighbours in the group as the inputs, and the fish's responses of accelerations (or deceleration) and turning angle, as shown in the figure below taken from our paper.

The relative position and direction of a neighbour (yellow) from the focal fish (red)
So first we take all the recorded positions of the fish, and for every fish at every time step we calculate the following quantities:

1. The angle (theta) and distance (r) to the nearest neighbour, the second nearest neighbour, third nearest etc.

2. The direction (phi) of each neighbour relative to the focal fish

3. How much the fish accelerated (a) and turned (alpha) on the next time step

We also measure quantities associated with where the wall of the tank is relative to the fish, but I'll ignore these for now. 1 and 2 here are our inputs, the stimuli. 3 is the behaviours - what the fish did next in response to the stimuli.

So, assuming that we've tracked our fish and measured the above, lets get inferring....

Let's try seeing how the acceleration of the focal fish is related to the position of the nearest neighbour. Once you've got Netlab installed, you can build a neural network in just one line

my_nn = mlp(2, 10, 1, 'linear');

my_nn: is your neural network (remember from the last post, it can also be called a Multi-Layer Peceptron - mlp)

2: is the number of inputs we want to use. We will be using the angle and distance to the neighbour.

10: is the number of 'hidden nodes' - thats the number of nodes in the middle layer of the diagram we saw in the last post. We can change this number - more nodes make the network more flexible but harder to learn well. I find 20 tends to work ok, but always experiment! Each node will be a sigmoidal function of the inputs, but we're not going to worry about these details here.

'linear': means that the output will be a weighted sum of all the hidden nodes. The only real reason to change this is if the outputs are binary rather than continuous.

Now you have a neural network! But at the moment is doesn't do very much. It's been configured in a random state. You can try putting some numbers in and seeing what come out using mlpfwd

y = mlpfwd(my_nn, [x1, x2]);

where x1and x2 are any possible values of the angle and distance to the nearest fish you want to try, and y is the predicted acceleration. At the moment those predictions will be meaningless, as the network hasn't learnt anything.

Now comes the useful bit. Assume we have three vectors containing the data, theta is a vector of angles to the nearest fish, r is a vector of the distances to the nearest fish and a is a vector of how much the fish accelerated. Make sure these are column vectors. Then we can train network using just a few more lines of code.

options = zeros(1, 18); options(1)=1; options(14) = 100;
my_nn = netopt(my_nn, options, [theta, r], a, 'scg');

netopt is a function that trains the network, based on the data its given. It tries to find the values for all the parameters (like the 'slopes' in the last post) which will produce the best match between what actually comes out of the network when we put the inputs (position of the nearest neighbour) in, and the behaviours we tell it should come out (i.e. the measured accelerations). options is, as the name suggests a number of possible options. Here we only use 2. The first tells Matlab to show the error values as the algorithm learns, the 14th tells netopt to run 100 iterations of the learning algorithm. The learning algorithm is something called 'scaled conjugate gradients', which is the 'scg' at the end.

Now we can input any values of theta and r to the network and it should output a value of the expected acceleration that fits with the data it has already seen. That is about 90% of everything you need to know to start doing inference with a neural network today. All the diagrams and equations in the last post are nice to have in the back of your head while doing this, but essentially you can treat the neural network as a black box. You put data in, in the form of known inputs and outputs. You press a button to make the network 'learn', and then the box will tell you what output you should expect for any input you offer it.

First we show the network some known examples
..then we ask it to predict the output for other inputs
This is in fact the basis of pretty much all of machine-learning. Take a number of known examples of something, such as images of handwritten letters. Plug them into a learning algorithm (of which a neural network is but one among many) to train it. Then use the same algorithm to predict what some unknown examples are.

Now all that remains is to try inputting all the possible values of theta and r that we might be interested in. In our paper we made the further simplification that the function would be symmetric around the axis of the fish - i.e. if the fish will accelerate when a neighbour is ahead on the left, it will also do so if the neighbour is ahead on the right. So we test values of r between 0 and some maximum (like 40cm), and angles between 0 and pi (everywhere on the left of the fish). In Matlab we can make vectors of these test inputs like this:

r_test = linspace(0, 40, 100);
theta_test = linspace(0, pi, 100); %this gives us 100 values of each input

[r_grid, theta_grid] = ndgrid(r_test, theta_test); 
test_input = [r_grid(:), theta_grid(:)];
%this matches every value of r to every value of theta so we can test all pairs

test_acc = mlpfwd(my_nn, test_input); %this puts our test inputs through the network we learned

test_acc = reshape(test_acc, size(r_grid)); 
%and this makes the output accelerations into a matrix so we can visualise it

X = cos(theta_test)*r_test';
Y = sin(theta_test)*r_test';
pcolor(X, Y, test_acc);
%This visualises the output on a nice semi-circle

And so finally we get a plot showing what the network thinks the fish will do for any given position of the nearest neighbouring fish

That B is because this comes from a multipanel image, as we'll see soon
So we confirm some previously held beliefs about how interactions like this work. The focal fish accelerates to catch up with a neighbour in front. It slows down to rejoin a neighbour behind. And if a neighbour is too close (near the centre), this is reversed to move the focal fish to a more comfortable distance.

So in a few lines of code by us, and a lot of preprogrammed code by the makers of Netlab, we have done some quite sophisticated inference with a minimum of real maths. Of course, there are some complications in now getting from the 90% you already know to the 100% you need to get publication ready. You'll need to concern yourself things like multiple local minima of the squared error, cross-validation and such other things. But these are things to worry about once you've got your hands a little dirty and started actually doing some inference...none of them mean you can't start applying these techniques to your data TODAY!

In the next and probably last post on this topic I'll show how we go from learning this relatively simple function with just 2 inputs, to a more complex function accounting for the positions of many neighbours, and we'll investigate the perils of correlation and confounding.

[Again, if you want to read more about the details of any of these techniques, I recommend David Mackay's textbook, Information Theory, Inference and Learning Algorithms (free online). Netlab also contains a large number of demo scripts, of which demomlp1.m demo is similar to this post]

Monday, March 26, 2012

How Fish Shoal (3): From Linear Fits to Neural Networks

In the last post we looked at how we could adjust the slope of a simple straight line fit to minimise the residual error between the data and the approximating function

Behaviour = Slope x Stimulus + residual

Although linear regression like this can be very powerful, and is often used to determine if two variables are linked (via correlation analysis), in general we cannot expect that animal behaviour is always so straightforwardly related to the stimuli or environmental cues we are interested in.

Consider for example how good you might feel over the course of an evening of heavy drinking. Here we can quantify the 'stimulus' as the number of drinks you consume. While some people are lucky enough to always consume in moderation to their acceptable intake, I would wager you have encountered evenings where your general state of enjoyment follows a trend like the one below.

Enjoy inference responsibly...

How can we fit a function to a more complicated looking curve like this? One solution would be to try and guess the shape of the curve....but what function does it look like? If we choose badly then we'll never get a good fit to the data.

We'd like a more flexible alternative that doesn't require so much guesswork. To see how we can get there, first we need to understand how we might apply the previous straight line fitting techniques when we have multiple stimuli. Imagine we think that an animals behaviour is a function of two stimuli

Behaviour = f(Stimulus1, Stimulus2)

Well, how about starting by extending the equation we had for a straight line. The original equation for just one stimulus was:

Behaviour = Slope x Stimulus + residual

For reasons that will become clear, we can draw this relationship using the following schematic

Extending this to two stimuli we have two slopes:

Behaviour = Slope1 x Stimulus1 + Slope2 x Stimulus2 + residual

and similarly a new picture, showing that the behaviour comes from adding the two different factors:

Now the behaviour varies in response to both stimuli. Each slope tells us how strong the relationship between stimulus and behaviour is, and we can find out what these slopes are in much the same way as before, either by trying different values until we get the lowest squared-sum-of-errors (see the last post), or by using the special formulae that exist for telling us the best solution. I don't intend to worry about these formulae here, but if you want to find them or see where they come from then the Wikipedia page on Ordinary Least Squares is a good place to look.

We can extend this basic approach further, including more and more stimuli. But, at this point you may be wondering "How does this help us fit functions like the one about drinking enjoyment above?" At this point we employ a wonderful little trick. Remember we talked about guessing the form of that function? I said that this would be too restrictive. But what if we guess lots of functions? 

The trick here can be seen simply by taking the equation for two stimuli we had above, but now imagine that stimulus 2 is related to stimulus 1. Instead of two different stimuli, imagine instead that we replace stimulus 2 with a function of stimulus 1. Lets try something simple, like Stimulus12.

Behaviour = Slope1 x Stimulus1 + Slope2 x Stimulus12 + residual

Now, if we find the best fit values for these slopes, based on the observed values of Stimulus1 and Stimulus12 we are actually fitting a non-linear, quadratic function to the data. If we know what Stimulus1is then we can easily calculate Stimulus1^2, and then we can treat them as if they were different stimuli. Fitting this function is still the same process as before - we choose different values for the slopes and find those that produce the lowest residual errors. See that our schematic of the model now has an additional layer between the stimulus and the behaviour.

We can take this simple trick further. Instead of many stimuli, we can use many different functions of the same stimulus.  Consider a set of different functions of the stimulus, gi(stimulus), where each value of i specifies a different function. We can model the behaviour as being a sum of these functions, each with its own slope

Behaviour = Slope1 x g1 (Stimulus) + Slope2 x g2 (Stimulus) + Slope3 x g3 (Stimulus)... etc

These functions could each be different powers of Stimulus, giving us a polynomial fit, or they could be any other set of functions. The key point here is that instead of guessing one kind of function to fit, we can try lots of different functions and weight them according to how important they are. Just as in every case before, when we try different values for the slopes, we get different residual errors, and we aim to find the best values that minimise those errors. Of course, as the number of slopes increases it becomes harder to find the best values easily, but in principle the task is the same.

The schematic of this kind of model (below) shows that we now have an expanding middle layer, where each circle represents a different function of the stimulus. The value of Stimulus is passed to each of these functions. The output of each function is then weighted according to the value of its Slope and passed to the Behaviour, which is made from the sum of all these bits.

The next and final stage is to consider developing this middle layer one step further, so we can once again consider multiple different stimuli. Instead of just one stimulus at the start which leads to our middle layer, imagine we have many stimuli to consider. Each function in the middle layer is now dependent on all the different stimuli, so e.g. g1 = g1 (S1 , S2 , S3 ...). Connect each stimulus to every function in the middle layer...

Don't worry about the switch from circles to squares or from black arrows to blue, I'm just reusing an old image!
Now we have a model which is going to look very complicated if we write it down as an equation. In essence though the idea remains the same as our first linear model. Along each line we simply multiply the input by some number, just like the slope of the linear plot. The behaviour predicted by the model is given by adding up all the different bits coming out of the middle layer into the final node, and we can adjust the multiplying numbers on each line to get the lowest error possible. When we find the best values for these numbers we have something which acts as a function taking stimuli to behaviour. We can then put it any value of stimuli we are interested in and see what the function predicts the animal will do.

You may have seen something like the above picture before. It is an example of what is known as an artificial neural network and is of a special type called a multi-layer perceptron. In such models the functions in the middle layer are usually sigmoidal functions that aim to mimic the highly thresholded response of real neurons to stimuli in the brain. Each function takes in a weighted sum of the stimuli, S, and sends out an output according to a profile like that below.

These neural networks provide us with a tool which can fit extremely variable non-linear functions of many stimuli. Neural networks have a lot of parameters that can be varied, each one essentially like the slopes we learnt for the linear model. Just as before, when these parameters are changed they alter the residual error between the behaviour predicted by the model and the observed behaviour. Unfortunately there is no general solution that quickly tells us what these parameters should be like there was for the linear model, but there are lots of clever ways to find good values for these parameters by iteratively changing them, making sure the error keeps going down. But this is a long way beyond the scope of today's post. 

Obviously a short blog post like this leaves many aspects of this kind of model fitting unaddressed, such as how we learn the parameters, or which functions we use for the middle layers. Although I have tried to give you some idea how such a model works by relating it back to the linear straight line fitting, the important thing to remember is what we shall see in the next post: you can use models like this while understanding almost nothing of what is going on inside. Lots of computer scientists have generously studied these sorts of models for decades, creating neat little toolboxes like Netlab for Matlab that allow us to fit complicated functions without getting our hands dirty with the modelling machinery! The important thing is to be secure enough in knowing what is going on in principle that you are happy to look away and let the toolbox do its work. In the next post I will try and show, with a bit of Matlab code, how we actually apply a neural network from this toolbox to (finally!) learn how fish shoal...

[If you are interested in more of the details surrounding this topic, I can highly recommend David Mackay's  textbook, Information Theory, Inference and Learning Algorithms (free online) - try Section V]