Learning by Imitating

An essay by Latanya Sweeney

"Why can't children simply copy what they see?" This question was posed by Marvin Minsky in a 1996 discussion, as well as in his book, Society of Mind (1985), on page 138. But the inability of children to copy what they see is found throughout the teaching-learning process and is not limited to children learners or the visual system. Instead it is a reflection on the human ability to extract features from the vast details found in human sensory data. Extract what? Focus where? All the sounds and all the images we may hear or see in a short interval contain far more information than we can store verbatim so we process the material by focusing our attention on particular details and performing automatic abstractions. In this way we not only recall experiences but we also learn from them as examples.
In this presentation I first consider empirical and theoretical evidence regarding the nature of imitation as found in human behavior. Then I model imitation in the symbolic computation mechanics provided by Minsky in Society of Mind (1985). Following that, I discuss how internal representations and scripts used in imitating actually evolve from personal experience.

Empirical and Theoretical Evidence

Music teachers often try to help their students develop an "ear" for music. They may try echo playing which requires students to directly imitate the teacher's sound, rhythm and pitch. Essentially students must undergo aural skill enhancement training before these techniques can work otherwise there are too many details for the student to capture. (Winkler, 1995) In a two second sound bite a person can isolate a phrase spoken by a particular speaker in a room while a band plays and other people conduct separate conversations nearby. A sound recording of this time interval contains all the acoustic information present but the listener can only recall some details not all.
Educators are aware that students learn new knowledge by building on previously acquired knowledge and can't simply copy what's done. In fact, the teaching-learning problem concerns itself with ensuring that each student evolves an adequate and correct understanding of the subject matter. Consider for example someone (Bob) teaching you a new dance step. Bob instructs you to follow after him. For a few seconds he gyrates through a myriad of moves with his legs kicking and his body turning. With no framework of dance steps, you can't just imitate his moves. There are far too many details and you have no way of recanting or "chunking" their sequence. On the other hand, if you have a model of dance steps and an accompanying mental vocabulary then when you see known moves, you can abstract them from the details and make a mental sequence of the higher level steps involved.
A behaviorist approach (Skinner, 1987) to the teaching-learning problem involves creating a tightly-structured environment where each step is so incremental it plants the same unambiguous understanding in the mind of the students; the environment conditions the students into providing proper responses. A constructivist approach (Papert, 1980) in contrast evolves the student's internal model of the subject by providing an environment in which students can discover things for themselves by doing and exploring. In both of these cases teaching-learning is not based on verbatim recollection but the student's internal understanding as abstracted from their experiences.
Of course learning and imitation extend far beyond the classroom. Baldwin and Piaget described imitation as having self-teaching functions (Parker, 1993). Around one year in age, children's speech patterns begin to develop as they imitate and practice sounds (Gibson, 1992; Poulson et.al., 1991). Fifty-six children each 19-months in age and unfamiliar to each other were given identical toys and paired off. Researchers recorded long phases of synchronic imitation occurring in nearly all subjects (Asendorpf and Baudonniere, 1993). Preschool children of ages 3 and 5 years were asked to imitate the intonation contours of declarative, interrogative and monotone utterances. Older children imitated contours more frequently than their younger counterparts due largely to them already having a perceptual analysis of the interrogative contour (Loeb and Allen, 1993). Clearly understanding the function and nature of imitation gives us a glimpse into the thinking-learning process.
The internal representation must be an abstraction from the sensory data. If we consider complexity to be measured in terms of the number of variables required by the learner in mastering a skill, then if no abstraction is done the complexity of the task is based on every "pixel" of each visual image and each acoustic noise present in each sound "sample" and so on. If humans employ verbatim processing then even the simple task of identifying bicycles of different colors and shapes becomes intractable. To reduce the complexity the learner must generalize and abstract from the details. The act of imitating not only requires the learner to build an internal representation but to also use that knowledge to demonstrate acquisition of the skill.

A Symbolic Computational Model

How might this all work in symbolic computation terms, and in the particular symbolic computation concepts of Minsky? It is his question afterall. Minsky in Society of Mind (1985) provides mechanics to describe this behavior using picture frames, Trans-frames, and scripts. Consider the dance example described earlier. Assume a repertoire of dance steps are already known that include kicking with the beat, clapping hands on the beat and jumping while turning 90 degrees and then landing on the beat. From these actions, a person could construct a high-level script that describes the sequence of dance steps witnessed; such a script appears in Table 1 below.

Script for Dance Steps:

kick left leg with beat
stand straight and clap hands on beat
kick right leg with beat
stand straight and clap hands on beat
jump 90 degrees left and land on floor with beat

Table 1. High-level script for dancing.
Each step in the script is a Trans-frame, which is a trajectory from a "before" situation to an "after" situation. In this case the Trans-frame for step in Table 1 contains a "before" picture with the person standing straight and the "after" picture differs from the before picture in that the left leg is extended. See diagram 1 below.

Diagram 1. A Trans-frame with its associated before and after picture frames that describe a dance step where the left leg is kicked out on the beat.

The coordination of the movement of the leg to travel smoothly to its destination by the end of the beat reflects a temporal requirement on the before and after picture frames. The steps in a higher-level script typically expand to reveal a series of lower level steps necessary for execution. Since each lower level step also contains a Trans-frame the overall result is a series of Trans-frames that exemplify the continuous movement of the leg from the standing position to the full leg extension. The resulting sequence of Trans-frames are akin to the still frames that comprise a movie.

Evolving Representations and Scripts

A script is a form of procedural abstraction or stepwise-refinement. Any step can be expanded into a sequence of lower steps that accomplish the goal stated by the higher level step. A higher-level script can only be performed if the actions in the accompanying lower-level scripts can be executed. Without the details provided in the lower level scripts, we might know the leg is to be extended in our dance example, but we may not know how to accomplish this feat. The need to have already mastered underlying skills before executing a high-level script was found throughout the empirical evidence presented earlier. Music teachers had to first develop the aural skills of their students before the students could echo play. This case is analogous to the dance example but the picture frames might show finger positions or sheet music on which notes are written mentally. The underlying ability to move the fingers or play the music must already exist in lower-level scripts if the higher-level script can be executed.
Recall the earlier study where children had to already have a perceptual understanding of the interrogative contour before they could imitate the intonation contour of interrogative utterances. This too can be implemented in Minsky's mechanics. The high-level script involves speaking the words that were heard and adding the appropriate intonation. If a child doesn't know how to perform a particular intonation contour, the script will not be executed properly.
Further, if a person does not have the lower level skill to execute the higher-level script, then the person may also not have the ability to detect important features present in the sensory data -- making the person's higher-level script void of important information. Consider the example mentioned earlier of one year old children who imitate and practice sounds. Let's assume for the sake of our example that a person is situated in front of the baby speaking English words. The baby will must first decide how to break the stream of utterances into a few chunks. One strategy might be to record the first three to five syllables. The recording will be faulty since the baby doesn't have a representation developed yet for storing sound patterns. This poses one of the differences between the approach taken in Society of Mind and here. The representation used to store information in a particular context evolves from learning and experience and is not just given innately. Let the initial or universal default representation be a synchronic composite of data from each sense.
The baby's nose may be smelling baby powder and lotion -- all smells to which the baby is accustomed so no new information arrives from that sense. Since the baby's only activity is watching the adult, there are no interesting results from the touch and taste senses either. But both the visual and aural senses have activity. The adult is moving their lips and speaking words while looking at the baby. This gives us a series of groupings of sound and visual movement. The baby may chose to focus on either; choosing the visual movement, the baby will try to \start moving its facial muscles. Since the baby has not yet put any significance on particular visual movements, only gross changes may have been recorded such as moving the head while moving the lips. The baby practices what the baby perceived. This process continues through numerous iterations, refining the representation by focusing attention on particular details and improving the resulting behavior as a result. In this way, both representations, which are data abstractions, and scripts, which are procedural abstractions, evolve from experience and practice. They are the result of the thinking-learning process and not the process itself.

References

Asendorpf, J. and Baudonniere, P. Self-awareness and other-awareness: mirror self-recognition and synchronic imitation among unfamiliar peers. Developmental Psychology Jan 1993, v29, n1, p88-95.
Gibson, J. Parents' Magazine Feb 1992, v67, n2, p158.
Loeb, D. and Allen, G. Preschoolers' imitation of intonation contours. Journal of Speech and Hearing Research Feb 1993, v36, n1, p4-13.
Minsky, M. The society of mind. New York: Simon and Schuster. 1985.
Papert, S. Mindstorms: children, computers, and powerful ideas. New York: Basic Books. 1980.
Parker, S. Imitation and circular reactions as evolved mechanisms for cognitive construction. Human Development Nov-Dec 1993, v36, n6, p309-323.
Poulson, C., Kymissis, E., Reeve, K., Andreatos. M. and Reeve, L. Generalized vocal imitation in infants. Journal of Experimental Child Psychology April 1991, v51, n2, p267-279.
Skinner, B. Programmed Instruction Revisited. The Education Digest, 1987, v52, p12-16.
Winkler, J. How aurally competent are your students? American Music Teacher Dec-Jan 1995, v45, n3, p10-13.

Written May 1996 in response to Marvin Minsky's question.

Latanya Sweeney's Home Page, Last modified Fall 2004 by latanya@dataprivacylab.org