Common ground has been claimed to be a necessary ingredient for many aspects of collaboration and communication. For instance, to understand a spoken sentence, the meaning of the words that the speaker uses must be known to both. Grounding is the process of augmenting and maintaining this common ground. This process involves, in addition to the mentioning of facts and proposals in the presence of another, processes of diagnosis (to monitor the state of the other collaborator) and feedback. When things are going smoothly, feedback is just simple acknowledgement (perhaps implicit), however, when understanding seems to deviate from commonality feedback takes the form of repairs.

There have been several proposals for formally modelling this kind of mutuality. For When common ground concerns simple beliefs, the most common representation of commonality is iterated belief (A believes X and A believes B believes X and A believes B believes A believes X,...), or access to a shared situation, formulated by [Lewis69] as:

Let us say that it is common knowledge in a population P that X if and only if some state of affairs A holds such that:

Clark and Marshall (1981) pointed out that using such a schema requires a number of assumptions in addition to the mere accessibility or presentation of information. Clark and Schaefer (1989)went beyond this, claiming that feedback of some sort was needed to actually ground material in conversation, and that this grounding process was collaborative, requiring effort by both partners to achieve common ground. They point out that it is not necessary to fully ground every aspect of the interaction, merely that the conversants reach the grounding criterion: "The contributor and the partners mutually believe that the partners have understood what the contributor meant to a criterion sufficient for the current purpose." What this criterion may be, of course, depends on the reasons for needing this information in common ground, and can vary with the type of information and the collaborator's local and overall goals. They also point out that the conversants have different ways of providing evidence which vary in strength. These include display of what has been understood, acknowledgments, and continuing with the next expected step, as well as continued attention.

Clark and Brennan (1991) discuss grounding in different media. They point out that different media bring different resources and constraints on grounding as well as having different associated costs. They describe several media (including face-to-face, telephone, video-teleconference, terminal teleconference, and email) according to whether they have the following properties: copresence (can see the same things), visibility (can see each other), audibility (can hear each other), cotemporality (messages received at the same time as sent), simultaneity (can both parties send messages at the same time or do they have to take turns), sequentiality (can the turns get out of sequence), reviewability (can they review messages, after they have been first received), and reviseability (can the producer edit the message privately before sending). Also, the following costs are considered for these media: formulation costs (how easy is it to decide exactly what to say), production costs (articulating or typing the message), reception costs (listening to or reading the message, including attention and waiting time), understanding costs (interpreting the message in context), start-up costs (initiating a conversation, including summoning the other partner's attention), delay costs (making the receiver wait during formulation), asynchronous costs (not being able to tell what is being responded too), speaker change costs, fault costs, and repair costs. Since different media have different combinations of these constraints and costs, one would expect the principle of least collaborative effort to predict different styles of grounding for use in different media.

In the human-computer collaborative systems that we previously designed, communication was mainly text based for the machine agent and based on direct manipulation for the human agent. Direct manipulation includes pointing gestures which are important in grounding, especially for solving referential ambiguities [Frohlich93]. Since our research goal is to design computational agents capable of grounding, we wanted to reduce the cost of grounding by providing agents with multi-modal communication. In a communicative setting, collaborators take advantage of all the media available to help them in their task. In a face to face setting, this includes eye-gaze and gesture as well as speech, but can also include writing notes and drawing schemata. This type of interaction is becoming increasingly important, also, for computer-mediated and human-computer collaboration. As the technologies become more widely available for communication with and through computers by modes other than typing and displaying text, it becomes more important to study how these technologies can facilitate various aspects of collaboration, including grounding.

Grounding is not a monolithic processes. There are many aspects to communicating which involve grounding. Properly communicating and grounding content requires action at multiple levels of interaction. Clark [Clark94] identifies 4 different levels of conversation at which problems for maintaining common ground may arise. These are:

Vocalization and attention - is the receiver attending to the speaker and can the producer successfully articulate the message.

Presentation and Identification - can the message be successfully presented so that the receiver can identify e.g., the words and structure of a sentence

Meaning and Understanding - can the receiver understand what was meant by the message.

Proposal and Uptake - will the receiver commit to the proposal made by the producer?

While actual vocalization really only applies to spoken conversation, we can generalize this level to the notion of access to information. Similarly, level 2 can be generalized to the concept of whether an agent has noticed the information (as a message from the sender, and not, e.g., as part of the environment). In addition, Clark classifies three types of dealing with (potential) problems at any of these levels. These are preventatives, which will prevent a foreseeable problem, warnings for signalling a problem which can't be avoided, and repairs for fixing a problem once it has occurred. We similarly distinguish 3 categories of grounding acts: monitoring, diagnosis and repair. Table 1 illustrates how these 2 dimensions can be used to characterize the relation of action of both participants to the grounding process.
Grounding acts and Conversational Level
Grounding actFrom A's viewpointFrom B's viewpoint
MonitoringPassive/Inferential (How A reasons about B 's knowledge)Pro-active (How B can help A to know about B)
 level 1: A infers whether B can access Xlevel 1: B tells A about what he can access
 level 2: A infers whether B has noticed Xlevel 2: B tells (or shows) A that B perceived X
 level 3: A infers whether B understood X level 3: B tells A how B understands X
 level 4: A infers whether B (dis)agrees level 4: B tells A that B (dis)agrees about X
Diagnosis Active (How A tries to know that B knows X)Reactive (How B participates in A's grounding)
 level 1: A joins B to initiate copresencelevel 1: B joins A
 level 2: A asks B to acknowledge X level 2: B acknowledges X
 level 3: A asks B a question about Xlevel 3: B displays understanding or requests repair of X
 level 4: A persuades B to agree about Xlevel 4: B (dis)agrees on X
Repair How A repairs B's ignorance of X How B repairs the fact that A ignores that B knows X
 level 1: A makes X accessible to Blevel 1: B mentions or manipulates X
 Level 2: A communicates X to Blevel 2: B communicates X to A
 level 3: A repeats / rephrases / explains Xlevel 3: B repeats / rephrases / explains X
 level 4: A argues about Xlevel 4: B argues about X
In section 4, we use aspects of this taxonomy to analyse the grounding behaviour in collaborative problem solving. First, we describe the setting for our observations.

