**Back in 2014 I was trying to make some progress towards my**

*docent*(Swedish habilitation) by fulfilling the requirement to undertake formal pedagogic training. As it happens, I left Sweden before either could be completed, but I recently went back through my materials, and found this essay I had written as part of that course. In the absence of anything else to to do with it, here it now lies...**Introduction**

Over time people have developed increasingly sophisticated theories of learning and education, and correspondingly teaching methods have changed and adapted. As a result, much is now known about what activities most promote student learning, and the differences between individuals in their learning techniques and strategies.

At the same time, computer scientists have developed increasingly powerful artificial intelligences. The creation of powerful computational methods for learning patterns, making predictions and understanding signals has drawn attention to a more mathematical understanding of how learning happens and can be facilitated.

Some of the parallels between these fields are obvious. For example, the development of artificial neural networks was driven by the analogy between these mathematical structures and the neuronal structure of the brain, and encouraged scientists to describe the brain from a computational perspective (e.g. in [Kovács, 1995]). However, the analogies between theories of learning in education and computer science are deeper than these surface resemblances, and go to the heart of what we consider useful information and knowledge, and what we mean by understanding.

In this report I will review elements of both the pedagogical and machine learning literature to draw attention to specific examples of what I consider to be direct analogues in these two fields, and how these analogies help organise our knowledge of the learning process and motivate approaches to student learning.

**Learning to learn**

When computer scientists first began creating an artificial intelligence, their first approach was to try to encode useful knowledge about the world directly in the machine, by explicitly inclusion in the computer’s programming. For example, in attempting to create a computer vision system that could recognise handwriting letters, the programmer would try to describe in computer code what an ‘A’ or a ‘B’ looked liked in terms that the computer could recognise in the images it received. However, this procedure generally proved dramatically ineffective. The sheer range of ways in which an ‘A’ can be written, the possible permutations on the basic design and the different angles and lighting that the computer could receive defeated the attempt to systematically describe the pattern in this top-down fashion.

Instead, success was first achieved in these tasks when researchers tried the radically different
approach not of teaching the computer each concept individually, but instead teaching the computer
how to learn itself. In 1959 Arthur Samuel defined machine learning as a ‘Field of study that gives
computers the ability to learn without being explicitly programmed’ [Simon, 2013]. By providing
the computer with algorithms that allowed it to observed examples of different letters, and learn
to distinguish these itself from the examples, much greater success was possible in identifying the
letters. In essence, by teaching the computer good methods for learning, the computer could gain
much greater understanding itself, and with less input from the programmer.

The parallel here with the teacher-student relationship is very direct. A teacher is responsible, of course, for providing a great deal of information to a student. But the best teachers are more successful because they teach the students how to learn for the themselves, how to fit new examples into their existing understanding and how to seek the new information and examples they need. At the higher levels of tuition, encouraging and enabling this self-directed learning is essential. Anne Davis Toppins argues that within 30 minutes ‘I can convince most graduate students that they are self-directed learners’ [Toppins, 1987]. However, much as programmers initially tried to directly tell computers what they needed to know, before realising the greater efficiency of teaching them to learn for the themselves, so has the pedagogical approach taken a similar path [Gustafsson et al., 2011]:

'For some lecturers, thinking in terms of emphasising with and supporting the students’ learning and “teaching them to learn”, i.e. supporting them in their development of study skills, can constitute a new or different perspective. [...] Some teachers claim that since the students have studied for such a long time in other school situations, the higher education institution should not have to devote time to the learning procedure.'

In other words, there have been, and indeed still are many lecturers who view their role primarily in terms of transmitting information, rather than in developing the students’ abilities to think and learn for themselves.

In the modern teaching literature, much importance is placed on aiming for, and testing students conceptual knowledge. That is, students are expected to learn not simply a series of factual statements, or isolated results, but instead to incorporate their knowledge into higher level abstract concepts that they can use to understand unfamiliar situations, solve unseen problems and extrapolate their knowledge to new domains. The prevailing doctrine of constructive alignment [Biggs, 1999] that forms the basis for recommended teaching approaches in European countries under the Bologna process is designed to make sure that teaching methods, student activities and assessment assignments all align towards this goal of promoting and testing whether students understand the ‘big picture’.

According to a computer scientists view of knowledge and information, there is a very good reason why we should aim to promote such a concept-centred approach for students. Identifying unifying principles that tie knowledge together and understanding how apparently different fields may link together reduces the amount and the complexity of the information that a student or computer must store, access and process, and maximises the effectiveness of extrapolating to new domains.

Consider as a simple example the data shown in figure 1. How can this data be effectively stored? The simplest method would be the record each pair of (x, y) co-ordinates. Assuming we use a 1 byte per number (single-precision floating point accuracy), this will take us 20 bytes (10 x’s, 10 y’s). But visually we can immediately recognise an important pattern; the data clearly lie along a straight line. If we know the gradient of this line we can immediate translate any value of x into a value of y. Therefore we can reproduce the whole data set by specifying just 12 numbers – the 10 values of x, one value for the intercept and one value of the gradient. Therefore by understanding one big idea, one concept about the data, that they lie along a line, we have almost halved the effort of learning and storing that information. Furthermore, we can now extrapolate to any new slue of x, immediately knowing the correct corresponding value of y. If we had simply memorised the 10 pairs of co-ordinates we would have no way to do this. In the field on machine-learning this line of reasoning has been formalised into the principles of Minimum Message Length or Minimum Description Length, first proposed by Chris Wallace [Wallace and Boulton, 1968] and Jorma Rissanen [Rissanen, 1978] respectively. This states that the best model, or description of data set is the one which requires the least information to store. Modern texts on machine-learning theory focus heavily on the superiority of the simplest possible models that enable reconstruction of the necessary information and stress the connection to the well established principle of Occam’s Razor (e.g. [MacKay, 2003]). Applications of machine learning theory to animal behaviour have further suggested that animals apply the same principles to maximise the value of their limited processing and storage capabilities [Mann et al., 2011], so it is likely that humans also apply similar methods

The parallel here with the teacher-student relationship is very direct. A teacher is responsible, of course, for providing a great deal of information to a student. But the best teachers are more successful because they teach the students how to learn for the themselves, how to fit new examples into their existing understanding and how to seek the new information and examples they need. At the higher levels of tuition, encouraging and enabling this self-directed learning is essential. Anne Davis Toppins argues that within 30 minutes ‘I can convince most graduate students that they are self-directed learners’ [Toppins, 1987]. However, much as programmers initially tried to directly tell computers what they needed to know, before realising the greater efficiency of teaching them to learn for the themselves, so has the pedagogical approach taken a similar path [Gustafsson et al., 2011]:

'For some lecturers, thinking in terms of emphasising with and supporting the students’ learning and “teaching them to learn”, i.e. supporting them in their development of study skills, can constitute a new or different perspective. [...] Some teachers claim that since the students have studied for such a long time in other school situations, the higher education institution should not have to devote time to the learning procedure.'

In other words, there have been, and indeed still are many lecturers who view their role primarily in terms of transmitting information, rather than in developing the students’ abilities to think and learn for themselves.

**Conceptual understanding**In the modern teaching literature, much importance is placed on aiming for, and testing students conceptual knowledge. That is, students are expected to learn not simply a series of factual statements, or isolated results, but instead to incorporate their knowledge into higher level abstract concepts that they can use to understand unfamiliar situations, solve unseen problems and extrapolate their knowledge to new domains. The prevailing doctrine of constructive alignment [Biggs, 1999] that forms the basis for recommended teaching approaches in European countries under the Bologna process is designed to make sure that teaching methods, student activities and assessment assignments all align towards this goal of promoting and testing whether students understand the ‘big picture’.

According to a computer scientists view of knowledge and information, there is a very good reason why we should aim to promote such a concept-centred approach for students. Identifying unifying principles that tie knowledge together and understanding how apparently different fields may link together reduces the amount and the complexity of the information that a student or computer must store, access and process, and maximises the effectiveness of extrapolating to new domains.

Consider as a simple example the data shown in figure 1. How can this data be effectively stored? The simplest method would be the record each pair of (x, y) co-ordinates. Assuming we use a 1 byte per number (single-precision floating point accuracy), this will take us 20 bytes (10 x’s, 10 y’s). But visually we can immediately recognise an important pattern; the data clearly lie along a straight line. If we know the gradient of this line we can immediate translate any value of x into a value of y. Therefore we can reproduce the whole data set by specifying just 12 numbers – the 10 values of x, one value for the intercept and one value of the gradient. Therefore by understanding one big idea, one concept about the data, that they lie along a line, we have almost halved the effort of learning and storing that information. Furthermore, we can now extrapolate to any new slue of x, immediately knowing the correct corresponding value of y. If we had simply memorised the 10 pairs of co-ordinates we would have no way to do this. In the field on machine-learning this line of reasoning has been formalised into the principles of Minimum Message Length or Minimum Description Length, first proposed by Chris Wallace [Wallace and Boulton, 1968] and Jorma Rissanen [Rissanen, 1978] respectively. This states that the best model, or description of data set is the one which requires the least information to store. Modern texts on machine-learning theory focus heavily on the superiority of the simplest possible models that enable reconstruction of the necessary information and stress the connection to the well established principle of Occam’s Razor (e.g. [MacKay, 2003]). Applications of machine learning theory to animal behaviour have further suggested that animals apply the same principles to maximise the value of their limited processing and storage capabilities [Mann et al., 2011], so it is likely that humans also apply similar methods

**Figure 1: By observing conceptual patterns in the data we can reduce the amount of memory needed to store it, whether on a machine or in a human mind. In this simple example identifying the linear relation between the X and Y co-ordinates (Y = 2X), we need to store only the X values, the intercept and the gradient, reducing the number of stored numbers from 20 to 12.**

**Strategic learning**

A common characteristic of high-achieving students is a strategic approach to learning. They have a good overview of what they need to learn to achieve their life goals. They set realistic but challenging learning goals for themselves to the end of learning this material. And they actively seek out information from teachers, reading materials and other sources to aid their learning. Whether their goals are intrinsic (interest in the subject, desire for knowledge) or extrinsic (obtaining a degree, getting a job), this strategic approach to learning systematically produces better outcomes than passively receiving whatever information is offered.

Analogously, in the field of machine learning, recent developments have tended more and more towards ideas termed ‘active learning’ [Settles, 2010]. The previous paradigm of simply offering many examples to the computer to learn from and then assessing or using the results of that process has been overturned. Instead, the programmer/mathematician devises a strategy for the computer to seek out new examples, based on what it wants to achieve (e.g. identifying written letters successfully) and what it currently knows. For example, if the computer has a good idea how to recognise an ‘A’, but frequently confuses a ‘U’ and a ‘V’, it will seek out or request more examples of these letters so that it can improve its knowledge. This way it does not waste time learning redundant material, but maximises the result of its effort by focusing on the most rewarding areas.

Likewise a high-performing student will focus their attentions on areas where they are weak and/or particularly crucial concepts that provide a pivot for understanding. They will ask their teachers for more feedback on their efforts in these areas, spend more time on mastering them and prioritise them ahead of areas of less importance or that are already understood. Mckeachie’s Teaching Tips [McKeachie and Svinicki, 2013] devotes a chapter to the importance encouraging strategic and self-regulated learning. One of their descriptions of a strategic learner states:

‘Strategic learners know when they understand new information and, perhaps more important, when they do not. When they encounter problems studying or learning, they use help-seeking strategies’.

This emphasis on the importance of know where understanding is lacking and the resultant help- seeking strategy perfectly aligns with what information theory tells us is the optimal way to gain useful knowledge.

Mckeachie’s Teaching Tips [McKeachie and Svinicki, 2013] also focuses on the importance of student learning goals. My own research in the field of active-learning corroborate this view, demonstrating that even when a learner has a good learning strategy, the success of that strategy depends intimately on the goals that the learner sets themselves. Indeed, without a suitable goal the learner is unable to define a useful strategy [Garnett et al., 2012]. Thus, in order to develop students strategic learning skills, it is essential first to help them define, and identify what their individual goals are. A student for whom this is an essential course, but who is otherwise uninterested, may be best helped by helping them to clarify what they wish to achieve (a certain final grade for instance), and then working with them to establish what strategy will most likely allow them to reach that outcome. A student with greater intrinsic motivation for the course may need help setting specific staged learning goals that enable a learning strategy. The teacher’s experience in understanding the most effective path through the material would therefore be essential in establishing effective goals that the student can then apply a strategy to achieve.

**Discussion**

While student and machine learning are clearly not direct parallels of each other (could one imagine a machine equivalent for tiredness, or skipping class to watch TV?), the analogies that do exist be- tween the two help us to understand why certain approaches to student learning are more successful than others, via the large body of technical knowledge that exists regarding how machines can be taught. In this report I have analysed a selection of those analogies, aiming to draw conclusions about how students should be taught.

In particular, a common theme of modern pedagogical approaches is to move from information transfer to a student directed learning approach. In a sense, computer scientists have been down this path already, switching from a programmer-led to a computer-led learning approach that has resulted in far superior learning outcomes. This should motivate and support the equivalent transition in student learning

In teaching computers how to think and learn, we have also needed to help them establish goals and strategies for learning, and this is now the forefront of machine learning research. The dramatic improvement in computer learning outcomes when well-developed strategies are employed should remind us that it is the manner in which the student approaches new information and requests help and feedback that matter at least as much as the amount of information they are presented with. Such knowledge demands that we devote time to monitoring and developing students learning strategies and discussing what they hope to achieve via our courses.

Students, like all of us, are presented with a great deal more information than they can easily process and digest. If computer science in the 21st century has taught us anything, it is the importance of identifying general patterns in the vast body of information we are now exposed to via the media, the Internet and other sources. Without relatively simple general principles, information can easily become overwhelming. That the same principle applies in student learning should not surprise us. How is a student to retain all the information we attempt to transfer to them without organising it into general principles rather than a huge array of specific cases? The content of any course therefore should revolve as much around this organisational structure as the raw information itself, demanding generalised understanding rather than specific regurgitation. Thankfully this is the direction modern pedagogy is taking, with such concepts of constructive alignment and the SOLO taxonomy.

**References**

[Biggs, 1999] Biggs, J. (1999). What the student does: teaching for enhanced learning. Higher
Education Research & Development, 18(1):57–75.

[Garnett et al., 2012] Garnett, R., Krishnamurthy, Y., Xiong, X., Schneider, J., and Mann, R. (2012). Bayesian optimal active search and surveying. In Proceedings of the International Con- ference of Machine Learning.

[Gustafsson et al., 2011] Gustafsson, C., Fransson, G., Morberg, ̊A., and Nordqvist, I. (2011). Teaching and learning in higher education: challenges and possibilities.

[Kovács, 1995] Kovács, I. (1995). Maturational windows and adult cortical plasticity, volume 24. Westview Press.

[MacKay, 2003] MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms. Cambridge: Cambridge University Press.

[Mann et al., 2011] Mann, R., Freeman, R., Osborne, M., Garnett, R., Armstrong, C., Meade, J., Biro, D., Guilford, T., and Roberts, S. (2011). Objectively identifying landmark use and predicting flight trajectories of the homing pigeon using gaussian processes. Journal of The Royal Society Interface, 8(55):210–219.

[McKeachie and Svinicki, 2013] McKeachie, W. and Svinicki, M. (2013). McKeachie’s teaching tips. Cengage Learning.

[Rissanen, 1978] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.

[Settles, 2010] Settles, B. (2010). Active learning literature survey. University of Wisconsin, Madison, 52:55–66.

[Simon, 2013] Simon, P. (2013). Too Big to Ignore: The Business Case for Big Data. John Wiley & Sons.

[Toppins, 1987] Toppins, A. D. (1987). Teaching students to teach themselves. College Teaching, 35(3):95–99.

[Wallace and Boulton, 1968] Wallace, C. S. and Boulton, D. M. (1968). An information measure for classification. The Computer Journal, 11(2):185–194.

[Garnett et al., 2012] Garnett, R., Krishnamurthy, Y., Xiong, X., Schneider, J., and Mann, R. (2012). Bayesian optimal active search and surveying. In Proceedings of the International Con- ference of Machine Learning.

[Gustafsson et al., 2011] Gustafsson, C., Fransson, G., Morberg, ̊A., and Nordqvist, I. (2011). Teaching and learning in higher education: challenges and possibilities.

[Kovács, 1995] Kovács, I. (1995). Maturational windows and adult cortical plasticity, volume 24. Westview Press.

[MacKay, 2003] MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms. Cambridge: Cambridge University Press.

[Mann et al., 2011] Mann, R., Freeman, R., Osborne, M., Garnett, R., Armstrong, C., Meade, J., Biro, D., Guilford, T., and Roberts, S. (2011). Objectively identifying landmark use and predicting flight trajectories of the homing pigeon using gaussian processes. Journal of The Royal Society Interface, 8(55):210–219.

[McKeachie and Svinicki, 2013] McKeachie, W. and Svinicki, M. (2013). McKeachie’s teaching tips. Cengage Learning.

[Rissanen, 1978] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.

[Settles, 2010] Settles, B. (2010). Active learning literature survey. University of Wisconsin, Madison, 52:55–66.

[Simon, 2013] Simon, P. (2013). Too Big to Ignore: The Business Case for Big Data. John Wiley & Sons.

[Toppins, 1987] Toppins, A. D. (1987). Teaching students to teach themselves. College Teaching, 35(3):95–99.

[Wallace and Boulton, 1968] Wallace, C. S. and Boulton, D. M. (1968). An information measure for classification. The Computer Journal, 11(2):185–194.