Saturday , June 23 2018

An Affect-Based Multimodal
Video Recommendation System

Vilnius Gediminas Technical University,

Sauletekio al. 11, Vilnius, LT-10223, Lithuania (Corresponding author)

Abstract: People watching a video can almost always suppress their speech but they cannot suppress their body language and manage their physiological and behavioral parameters. Affects/emotions, sensory processing, actions/motor behavior and motivation link to the limbic system responsible for instinctive and instantaneous human reactions to their environment or to other people. Limbic reactions are immediate, sure, time-tested and occur among all people. Such reactions are highly spontaneous and reflect the video viewer’s real feelings and desires, rather than deliberately calculated ones. The limbic system also links to emotions, usually conveyed by facial expressions and movements of legs, arms and/or other body parts. All physiological and behavioral parameters require consideration to determine a video viewer’s emotions and wishes. This is the reason an Affect-based multimodal video recommendation system (ARTIST), developed by the authors of the article, is very suitable. The ARTIST was developed and fine-tuned during the course of conducting the TEMPUS project “Reformation of the Curricula on Built Environment in the Eastern Neighbouring Area”. ARTIST can analyze the facial expressions and physiological parameters of a viewer while watching a video. An analysis of a video viewer’s facial expressions and physiological parameters leads to better control over alternative sequences of film clips for a video clips. It can even prompt ending the video, if nothing suitable for the viewer is available in the database. This system can consider a viewer’s emotions (happy, sad, angry, surprised, scared, disgusted and neutral) and choose rational video clips in real time. The analysis of a video viewer’s facial expressions and physiological parameters can indicate possible offers to viewers for video clips they prefer at the moment.

Keywords: facial expressions; physiological video retrieval; affect-based, multimodal, video recommendation system; TEMPUS CENEAST project.

>>Full text
CITE THIS PAPER AS: Arturas KAKLAUSKAS, Renaldas GUDAUSKAS, Matas KOZLOVAS, Lina PECIURE, Natalija LEPKOVA, Justas CERKAUSKAS, Audrius BANAITIS, An Affect-Based Multimodal Video Recommendation System, Studies in Informatics and Control, ISSN 1220-1766, vol. 25(1), pp. 5-14, 2016.

  1. Introduction

Those watching a film rarely speak, but they cannot suppress their body language. Body language is linked to the limbic system responsible for instinctive and instantaneous human reactions to their environment or other people. Such reactions are highly spontaneous and reflect the person’s real feelings and desires, rather than calculated ones. The limbic system is also linked to emotions, usually conveyed through facial expressions and movements of legs, arms or other body parts. All this should be considered in determining the viewer’s emotions and wishes. The deliberate control of one’s body tends to look unnatural: movements fall behind utterances and hardly look genuine.

As someone is watching a film, affective systems can analyse the viewer’s gestures, movements, touches, posture, and face and eye expressions. Such observations offer extra information on the person’s character, emotions and reactions. The monitoring of a film viewer’s facial expressions leads to better control over the sequence of the film’s alternative video clips, or can even prompt to end the film if nothing that might suit the viewer is available in the database.

The system can consider the viewer’s emotions — happy, sad, angry, surprised, scared, disgusted and a neutral state — and choose rational video clips in real time. With the analysis of the body language, viewers can be offered the video clips they prefer at the moment. Such systems are akin to people with high emotional intelligence. Persons well aware and perceptive of own feelings are better at analysing and spotting those of others and can take a deeper, more meaningful approach to the world and people around them.

Currently videos are indicated as really big data. Venter & Stein (2012), for instance, believe that today video images and image sequences comprise about 80 percent of all corporate and public unstructured big data. Recent advances in multimedia technology have led to tremendous increases in the available volume of video data, thereby creating a major requirement for efficient systems to manage such huge data volumes (Mehmood et al. 2015). With the fast proliferation of multimedia and video display devices, searching and watching videos on the Internet has become an indispensable part of our daily lives. Many video-sharing web sites offer the service of searching and recommending videos from an exponentially growing repository of videos uploaded by individual users (Niu et al. 2015).

The term big data, when it refers to videos, often defines the exponential growth and availability of videos. The enormous supply of videos, with their numbers growing daily, and the ability of users to choose any video they need make the use of video content and prescriptive analytics a necessity. Different methods and technologies have been proposed globally to handle this task. Venter & Stein (2012), for instance, believe that, as growth of unstructured data increases, analytical systems must assimilate and interpret images and videos as well as they interpret structured data, such as texts and numbers. Prescriptive analytics leverages the emergence of big data and computational and scientific advances in the fields of statistics, mathematics, operations research, business rules and machine learning (Venter, A. Stein 2012). The explosion of user-generated, untagged multimedia data in recent years, generates a strong need for efficient search and retrieval of this data. The predominant method for content-based tagging is through slow, labor-intensive manual annotation. Consequently, automatic tagging is currently a subject of intensive research. However, it is clear that the process will not be fully automated in the foreseeable future (Koelstra and Patras 2013).

Recently annotation according to an affective or emotional video category has been gaining ground (Joho et al. 2011, Hanjalic et. al. 2008, Moncrieff et al. 2001, Calvo & D’Mello 2010, Wang & Cheong 2006).

The main objective is to make the recommendation personalized and situation sensitive. If the affective content of a video is detected, it will be very easy to build an intelligent video recommendation system, which can recommend videos to users based on users’ current emotion and interest. For example, when the user is sad, the system will automatically recommend happy movies to him/her; when the user is tired, the system may suggest a relaxing movie (Joho et al. 2011). In general, there are three kinds of popular affective analysis methods. Categorical affective content analysis methods usually define a few basic affective groups and discrete emotions, for example, “happy”, “sadness” and “fear”. Then classify video/audio to these predefined groups. The second type of affective analysis method is called Dimensional affective content analysis method (for example, the psychological Arousal-Valence (A-V) Affective Model), which commonly employs the Dimensional Affective Model to compute affective state. The third type of affective analysis method is Personalized affective content analysis method (Lu et al. 2011).

Video classification and recommendation based on affective analysis of viewers are aimed at finding interesting and suitable videos for users by using different metadata. Metadata of videos are of two types: (i) non-affective (such as genre, director, actors, etc.) and (ii) affective (expected feeling or emotion). The methods focused on affective video analysis can be divided into two categories according to the method of generation of Affective Metadata (AM) [5]: (i) explicit (asking the user to point out an affective label for the observed video clip), and (ii) implicit (detecting the user’s affective response or analyzing the affective video component element automatically) (Niu et al. 2015).

A personalized search is the fundamental goal of video content or prescriptive analytics aiming to tailor the integration of data, information and knowledge about a user beyond the explicit query precisely to that person’s tasks.

Niu et al. (2015) point out that the issue of finding videos suited to a user’s personal preferences or measuring the similarity between videos poses various challenges.

Video content and prescriptive analytics are also gaining ground in the Internet of Things. The Affective Tutoring System for Built Environment Management (Kaklauskas et al. 2015), for instance, can track in a student’s computer when and where the student was most productive and share that information with the lecturer’s computer. How might different stakeholders benefit from this concept as well? Could, for instance, a student video analytics system interact with the computers of lecturers in a university? Would a lecturer wish to know, whether a student taking an examination in the classroom, in close proximity, is cheating? Obviously, a lecturer might want such information, but a student would want to conceal such a fact. Privacy issues continue to be a major worry in the future. Partial distribution of biometrics could be valuable — and perhaps even essential — to enter a venue.

Researchers worldwide are working on video retrieval and recommendation systems that employ only unimodal affective analysis. However, multimodal video retrieval and recommendation systems are also under development aiming to overcome the limitations of unimodal systems. Many researchers and practitioners, for instance, combine affective video analysis with physiological information and data to analyze videos: Soleymani et al. (2011) use galvanic skin response (GSR), electromyography (EMG), blood pressure, breathing rate and skin temperature; Money & Agius (2008) use GSR, breathing rate, blood volume pulse feedback and heart rate and Koelstra et al. (2013) use electroencephalogram (EEG) and peripheral physiological signals. A brief review of the above-mentioned systems follows.

Viewer’s attention is based on multiple sensory perceptions, i.e., aural and visual, as well as the viewer’s neuronal signals (Mehmood et al. 2015). Facial expression is one of several modes of nonverbal communication. The message value of various modes may differ depending on context and may be congruent or discrepant with each other. An interesting research topic is the integration of facial expression analysis with that of gesture, prosody, and speech. Combining facial features with acoustic features would help to separate the effects of facial actions due to facial expression and those due to speech related movements (Fox et al. 2003).

Lately interdisciplinary studies (Ringeval et al. 2015, Grafsgaard et al. 2014) have been aiming to develop methods, tools, devices and analytical techniques designed for reliable real-time analysis of emotions from different modalities (physiological signals, audio, and video) and decision making (Filip et al. 2008, 2009, 2014). Achievement of such an aim involves use of physiological sensors, physiological measures, facial expression analysis systems, self-report measures and other tools.


  1. CALVO, R., S. D’MELLO, Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications, IEEE Trans. on Affective Computing, vol. 1(1), 2010, pp. 18-37.
  2. FACEREADER, Reference Manual Version 6. Tool for Automatic Analysis of Facial Expressions. Noldus Information Technology, 2014, p. 183.
  3. FILIP, F. G., Decision Support and Control for Large-scale Complex Systems, Annual Reviews in Control, vol. 32(1), 2008, pp. 61-70.
  4. FILIP, F. G., A. SUDUC, M. BÎZOI, DSS in Numbers, Technological and Economic Development of Economy, vol. 20(1), 2014, pp. 154-164.
  5. FILIP, F. G., K. LEIVISKÄ, Large-Scale Complex Systems. Springer Handbook of Automation. 2009, pp. 619-638.
  6. FOX, N., R. GROSS, P. DE CHAZAL, J. COHN, R. REILLY, Person Identification using Multi-modal Features: Speech, Lip, and Face. in Proc. of ACM Multimedia Workshop in Biometrics Methods and Applications (WBMA 2003), CA, 2003.
  7. GRAFSGAARD, J. F., J. B. WIGGINS, K. E. BOYER, E. N., WIEBE, J. C., LESTER, Predicting Learning and Affect from Multimodal Data Streams in Task-oriented Tutorial Dialogue. In: Stamper, J., Pardos, Z., Mavrikis, M., McLaren, B. M. (Eds.), Proceedings of the 7th international conference on educational data mining, London, England: International Educational Data Mining Society, 2014, pp. 122-129.
  8. HANJALIC, A., R. LIENHART, W. Y. MA, J. R. SMITH, The Holy Grail of Multimedia Information Retrieval: So Close or Yet So Far Away? Proceedings of the IEEE, vol. 96(4), 2008, pp. 541-547.
  9. JOHO, H., J. STAIANO, N. SEBE, J. M. JOSE, Looking at the Viewer: Analysing Facial Activity to Detect Personal Highlights of Multimedia Contents. Multimedia Tools and Applications, vol. 51(2), 2011, pp. 505-523.
  10. KAKLAUSKAS, A., A. KUZMINSKE, E. K. ZAVADSKAS, A. DANIUNAS, G. KAKLAUSKAS, M. SENIUT, J. RAISTENSKIS, A. SAFONOV, R. KLIUKAS, A. JUOZAPAITIS, A. RADZEVICIENE, R. CERKAUSKIENE,). Affective Tutoring System for Built Environment Management. Computers & Education, vol. 82, 2015, pp. 202-216.
  11. KAKLAUSKAS, A., E. K. ZAVADSKAS, V. PRUSKUS, A. VLASENKO, M. SENIUT, G. KAKLAUSKAS, A. MATULIAUSKAITE, V. GRIBNIAK, Biometric and Intelligent Self-assessment of Student Progress System, Computers & Education, vol. 2010, pp. 821-833.
  12. KAKLAUSKAS, A., E. K. ZAVADSKAS, V. PRUSKUS, A. VLASENKO, L. BARTKIENE, R. PALISKIENE, L. ZEMECKYTE, V. GERSTEIN, G. DZEMYDA, G., TAMULEVICIUS, Recommended Biometric Stress Management System. Expert Systems with Applications, vol. 38(11), 2011, pp. 14011-14025.
  13. KENSINGER, E. A. Remembering Emotional Experiences: The Contribution of Valence and Arousal. Reviews in the Neurosciences, vol. 15(4), 2004, pp. 241-252.
  14. KOELSTRA, S., I. PATRAS, Fusion of Facial Expressions and EEG for Implicit Affective Tagging. Image and Vision Computing, vol. 31(2), 2013, pp. 164-174.
  15. LANG, P. J., M. K. GREENWALD, M. M. BRADLEY, A. O. HAMM, Looking at Pictures: Affective, Facial, Visceral, and Behavioural Reactions, Psychophysiology, vol. 30, 1993, pp. 261-273.
  16. LU, Y., N. SEBE, R. HYTNEN, Q. TIAN, Personalization in Multimedia Retrieval: A Survey. Multimedia Tools and Applications, vol. 51, 2011, pp. 247-277.
  17. MEHMOOD, I., M. SAJJAD, S. RHO, S. W. BAIK, Divide-and-Conquer based Summarization Framework for Extracting Affective Video Content. Neurocomputing, in Press, Corrected Proof.
  18. MONCRIEFF, S., C. DORAI, S. VENKATESH, Affect Computing in Film through Sound Energy Dynamics. In: ACM International Conference on Multimedia, 2011.
  19. MONEY, A. G., H. AGIUS, Video Summarisation: A Conceptual Framework and Wurvey of the State of the Art. Journal of Visual Communication and Image Representation, vol. 19(2), 2008, pp. 121-143.
  20. NIU, J., X. ZHAO, M. A. ABDUL AZIZ, A Novel Affect-based Vodel of Similarity Measure of Videos. Neurocomputing, In Press, Corrected Proof, (2015).
  21. RINGEVAL, F., F. EYBEN, E. KROUPI, A. YUCE, J.-P. THIRAN, T. EBRAHIMI, D. LALANNE, B. SCHULLER, Prediction of Asynchronous Dimensional Emotion Ratings from Audiovisual and Physiological Data. Pattern Recognition Letters, vol. 66, 2015, pp. 22-30.
  22. SOLEYMANI, M., M. PANTIC, T. PUN, Multi-modal Emotion Recognition in Response to Videos. Affective Computing, IEEE Transactions on, vol. 3(2), 2011, pp. 211-223.
  23. SONG, M., M. YOU, N. LI, C. CHEN, A Robust Multimodal Approach for Emotion Recognition. Neurocomputing, vol. 71(10-12), 2008, pp. 1913-1920.
  24. VENTER, F., A. STEIN, Images & Videos: Really Big Data. The Institute for Operations Research and the Management Sciences (INFORMS), 2012.
  25. WANG, H., L. CHEONG, Affective understanding in film. IEEE Transactions on Circuits and Systems for Video Technology, vol. 16(6), 2006, pp. 689-704.