In recent weeks, the Craiyon (formerly DALL-E mini) multimodal artificial intelligence (AI) model has soared in popularity on the internet. In this model, users input text prompts, which are paired with image inputs to generate newly created images. The images aim to depict text prompts and, as the model learns and responds to more requests, its self-generated images gradually match more closely with the requests of its users. As users have pushed to come up with increasingly bizarre and humorous image results, and these images have been widely reshared, the Craiyon has given the public a playful introduction to multimodal AI technology. However, the potential of multimodal AI deep learning goes far beyond creating shareable joke images—multimodal AIs have a wide range of use cases and will likely come to transform human lifeways and business organization.
Multimodal AI Technology's Influence – From Learning to Shopping & Beyond
One such example of how multimodal AI technology might heavily influence our lives is Facial Emotion Recognition (FER). FER AI models attempt to infer emotional states through analyzing facial expressions present in still images. Facial expressions are coded within the models to align with certain emotional states such as sadness, happiness, or anger. These models could be used to evaluate consumer behavior and help companies understand the popularity of certain products, packaging, or store designs by tracking changing emotional states of shoppers, or they might be used to evaluate learning in real time. By tracking changing emotional states of students, FER technology could help teachers better understand what teaching styles, lesson plans, and methodologies resonate most with their students.
These are just a couple of potential applications of this technology, but its utility is far broader. As multimodal AI technology becomes more refined, FER models will likely become far more ubiquitous in public life. However, FER technology has limitations. RTI’s Lab 58 is working to overcome these limitations by developing a multimodal, deep learning AI model, one that processes audio along with visual data.
What is Multimodal Learning?
Human beings process the world multimodally. We rely on a wide range of sensory inputs to arrive at conclusions or grasp concepts—audio, visual, text, tactile, etc. Humans are also frequently called to produce forms of feedback—text, for example—based on separate varieties of modal inputs, such as image or audio. For example, in our daily lives we might describe the content of images with speech or written text.
Similarly, AI models can be trained to produce outputs corresponding to data inputs of other modalities. Furthermore, they can be trained to produce outputs derived from multiple kinds of data inputs simultaneously. Importantly, output modalities of multimodal learning AI models tend to hold greater accuracy when multiple sources of modal inputs are present as the models learn. But importantly, human beings do not process information and arrive at conclusions with equal weighting given to the different modal inputs—they often privilege some modalities of data over others depending on situational contexts. AI models can also be designed and trained using weighted algorithms such that certain inputs count more heavily when producing outputs. And over time, as they are utilized more extensively, the processing abilities of multimodal deep learning AI models progressively advance, allowing them to produce more relevant, useful outputs.
Multimodal AI Shortcomings – Looks Can Be Deceiving
Much of the skepticism presently directed toward FER technologies concerns the accuracy of pairing facial expressions with certain emotional states. Although it is true that people do tend to produce certain facial expressions or movements when feeling certain emotions (e.g., smiling when happy) at a level significantly greater than chance, facial expressions and emotional states do not always correlate with one another. Modes of emotional expression and communication are not universal and vary significantly across different cultures as well as within specific situational contexts.
Contemporary FER may be unable to follow through on its promise of delivering concrete determinations of emotional states. This being said, human beings themselves are not infallible when interpreting others’ feelings, especially based on visual cues or facial expressions alone. Almost everyone can recall a situation in which they misread someone’s emotional state. As such, the present limitations of FER AI models do not necessarily represent a flaw with the technology itself but more so a difficulty produced by the non-universality of human social relations and emotional expression writ large. Just as human beings depend on further context clues and different sensory modes for detecting emotions (vocal tone and timbre, speech content, etc.), emotion recognition AI models might be improved by utilizing multimodal learning rather than relying solely on visual processing.