The word modal and the word mode both derives from the Latin word modus. Modus means either way of doing things or measure. There are other words that we often use that comes from Modus. Model is an example that means small measure. The meaning of modal as well as mode is typically way of doing things but there can be some exceptions. Multimodal would then mean many ways of doing things.
There are two more precise meanings of the term multimodal:
These two meanings can be actualized at the same time in a handshake. A handshake is a way to communicate and specifically a way to greet. When we do this we use at least three means to appear friendly to the other person. We say "hi" and present ourselves at the same time as we smile or produce another facial expression and touch/grip the other person’s hand. This act activates three senses in both the receiver and the producer. We see each other, we hear each other and we touch each other.
Speech is an example that is based on one way to produce communication. We produce sounds with our speech production organs. If we consider speech to be one way to produce communication then it is unimodal on the production side. On the reception side, on the other hand, it is bimodal. We both see and hear what a person is saying. We can actually deduce a lot just from looking at the lip movements. When we speak in a noisy environment we rely more on seeing the words than on hearing the words. Speech is on the receiver side multimodal.
The combination of text and image is multimodal on the production side but unimodal on the reception side. It only involves vision. Even so, the processing of text and image is different in the receiver. We use different parts of the brain to interpret text and image respectively. It is still not called multimodal perception. A suggestion would be to call it sub modal perception and describe it as two sub modal processes going on simultaneously.
There are several fields that use the term multimodal to explain what is going on. Within the field of Human-Computer-Interaction (HCI) the interface can be multimodal. We use different types of means to create input. We can use a touchscreen, a mouse, a keyboard or voice control. As output we usually get auditive and visual information. The tactile information is immediate when we touch things to produce input.
Within the field of web design the output is usually visible but there are some cases when we have sounds. On the screen we experience a combination of text and image. These modalities can be broken down to sub modalities like color, size, font and movements.
The field of learning is discussing multimodality a lot these days. It is in part a discussion of the cognitive side focusing on how we can process all information that we get at the same time and what combinations that are fruitful. The other part is about the design of learning environments. What modalities do we use? What modalities should we use? Is there a best practice?
Human communication is originally multimodal since we are born with several means of production and several senses. The standard way to communication is therefore multimodal and usually face to face. We start to react when the communication loses one or several modalities. That is when we realize that multimodal communication is helping os to avoid miscommunication and misunderstanding. That is more likely when we only have one modality to rely on.
Multimodality is divided into several scientific disciplines and in turn ino different traditions. One tradition can focus on the production side (output) while another can be focusing on the reception side (input). One tradition can focus on interpretation while another tradition is focusing on objective measurements.
The sociosemiotic tradition is focusing on the interpretation of what we produce. These modes are bound to the cultural and social field of signs that are used in a specific culture. This tradition is present within the field of the design of learning environments, the design of web pages and the visual human communication.
The psycholinguistic tradition is mostly interested in how we receive and process human speech. The studies rely on traditional psychological experiments that also involve video recordings. From this point of view it is obvious that the perception can be both unimodal and multimodal. Multimodal communication can help us achieve clarity and complexity as well as a robustness if the production is well executed. In worst case the simultaneous input can be confusing and disturbing. It is also known that the brain is distributed to process specific kinds of sensory information and that the challenge is to put all the information together into one whole.
The HCI tradition and the human communication tradition are looking at both production/output and reception/input at the same time. This is how we use several modalities and this is how it should be understood.