In a broad sense, prosody can be viewed as the whole gamut of features that do not determine what people are saying, but rather how they are saying it. Originally, the term was used to strictly refer to verbal prosody, i.e., the set of suprasegmental features such as intonation (speech melody), rhythm, tempo, loudness, voice quality and pausing that are encoded in the speech signal itself. More recently, various researchers tend to broaden its definition to also include visual prosody, i.e., specific forms of body language that communication partners send to each other during the interaction, such as facial expressions, arm and body gestures and pointing. Both verbal and visual prosody are omnipresent in natural conversations: it would be extremely unnatural to have utterances produced without variations in intonation, tempo, loudness, etc.; similarly, since conversants can see each other in many forms of spoken communication, it would be odd if they would stay completely immobile during their interactions. The FOAP project is concerned with a functional approach to both verbal and visual prosody (called audiovisual prosody) in spoken conversations. It is intuitively clear that prosody plays an important role in daily life spoken interactions. In general, it provides utterances with ‘extra’ information that is often not explicitly contained in the lexical and syntactic make-up of a sentence. Indeed, various studies of verbal prosody have shown that it can be used to mark information structure and turn-taking or add expressive power to the propositional content of an utterance. But while we have learned a lot about the pragmatics of verbal prosody, we still miss a good deal of real knowledge into how auditive cues combine with visual ones.