Authors:
(1) Pinelopi Papalampidi, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh;
(2) Frank Keller, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh;
(3) Mirella Lapata, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh.
Table of Links
- Abstract and Intro
- Related Work
- Problem Formulation
- Experimental Setup
- Results and Analysis
- Conclusions and References
- A. Model Details
- B. Implementation Details
- C. Results: Ablation Studies
6. Conclusions
In this work, we proposed a trailer generation approach which adopts a graph-based representation of movies and uses interpretable criteria for selecting shots. We also show how privileged information from screenplays can be leveraged via contrastive learning, resulting in a model that can be used for turning point identification and trailer generation. Trailers generated by our model were judged favorably in terms of their content and attractiveness.
In the future we would like to focus on methods for predicting fine-grained emotions (e.g., grief, loathing, terror, joy) in movies. In this work, we consider positive/negative sentiment as a stand-in for emotions, due to the absence of in-domain labeled datasets. Previous efforts have focused on tweets [1], Youtube opinion videos [4], talkshows [20], and recordings of human interactions [8]. Preliminary experiments revealed that transferring fine-grained emotion knowledge from other domains to ours leads to unreliable predictions compared to sentiment which is more stable and improves trailer generation performance. Avenues for future work include new emotion datasets for movies, as well as emotion detection models based on textual and audiovisual cues.
References
[1] Muhammad Abdul-Mageed and Lyle Ungar. EmoNet: Finegrained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 718β728, Vancouver, Canada, July 2017. Association for Computational Linguistics. 8
[2] Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2020. 12
[3] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Proceedings of the Advances in Neural Information Processing Systems, pages 2654β2662, Montreal, Quebec, Canada, 2014. 2, 4
[4] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236β2246, Melbourne, Australia, July 2018. Association for Computational Linguistics. 8
[5] Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. Condensed movies: Story based retrieval with contextual embeddings. In Proceedings of the Asian Conference on Computer Vision, 2020. 2
[6] Pablo Barcelo, Egor V Kostylev, Mikael Monet, Jorge P Β΄ erez, Β΄ Juan Reutter, and Juan Pablo Silva. The logical expressiveness of graph neural networks. In International Conference on Learning Representations, 2019. 12
[7] Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Β΄ Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013. 11
[8] Sanjay Bilakhia, Stavros Petridis, Anton Nijholt, and Maja Pantic. The MAHNOB mimicry database: A database of naturalistic human interactions. Pattern Recognition Letters, 66:52β61, 2015. Pattern Recognition in Human Computer Interaction. 8
[9] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008. 6
[10] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724β4733. IEEE Computer Society, 2017. 6
[11] Paola Cascante-Bonilla, Kalpathy Sitaraman, Mengjia Luo, and Vicente Ordonez. Moviescope: Large-scale analysis of movies using multiple modalities. arXiv preprint arXiv:1908.03180, 2019. 5
[12] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario GuajardoCespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018. 6
[13] James E Cutting. Narrative theory and the dynamics of popular movies. Psychonomic Bulletin and review, 23(6):1713β 1743, 2016. 1 [14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248β255. Ieee, 2009. 6
[15] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems, 28:2224β2232, 2015. 3
[16] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and humanlabeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776β780. IEEE, 2017. 6
[17] Deepanway Ghosal, Navonil Majumder, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. Cosmic: Commonsense knowledge for emotion identification in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 2470β2481, 2020. 6 [18] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440β1448, 2015. 6
[19] Philip John Gorinski and Mirella Lapata. Movie script summarization as graph-based scene extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1066β1076, Denver, Colorado, MayβJune 2015. Association for Computational Linguistics. 5, 12
[20] Michael Grimm, Kristian Kroschel, and Shrikanth Narayanan. The Vera am Mittag German audio-visual emotional speech database. In ICME, pages 865β868. IEEE, 2008. 8
[21] Michael Gutmann and Aapo Hyvarinen. Noise-contrastive Β¨ estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297β304, 2010. 4
[22] Michael Hauge. Storytelling Made Easy: Persuade and Transform Your Audiences, Buyers, and Clients β Simply, Quickly, and Profitably. Indie Books International, 2017. 1, 3, 13
[23] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2, 4
[24] Go Irie, Takashi Satou, Akira Kojima, Toshihiko Yamasaki, and Kiyoharu Aizawa. Automatic trailer generation. In Proceedings of the 18th ACM international conference on Multimedia, pages 839β842, 2010. 1, 2
[25] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017), 2017. 11
[26] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595β608, 2016. 3
[27] Hyounghun Kim, Zineng Tang, and Mohit Bansal. Densecaption matching and frame-selection gating for temporal localization in videoqa. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4812β4822, 2020. 3
[28] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. 3
[29] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. Dailydialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986β995, 2017. 6
[30] David Lopez-Paz, Leon Bottou, Bernhard Sch Β΄ olkopf, and Β¨ Vladimir Vapnik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015. 2
[31] Jordan Louviere, T.N. Flynn, and A. A. J. Marley. Best-worst scaling: Theory, methods and applications. 01 2015. 8
[32] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24- 26, 2017, Conference Track Proceedings, 2017. 11
[33] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879β 9889, 2020. 2
[34] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630β2640, 2019. 2
[35] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404β411, 2004. 7
[36] Cory S Myers and Lawrence R Rabiner. A comparative study of several dynamic time-warping algorithms for connected-word recognition. Bell System Technical Journal, 60(7):1389β1409, 1981. 5
[37] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, 2019. 12
[38] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 4, 5, 11
[39] Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870β 10879, 2020. 4
[40] Pinelopi Papalampidi, Frank Keller, Lea Frermann, and Mirella Lapata. Screenplay summarization using latent narrative structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1920β1933, 2020. 2
[41] Pinelopi Papalampidi, Frank Keller, and Mirella Lapata. Movie plot analysis via turning point identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 1707β1717, 2019. 2, 3, 5, 6, 11, 12
[42] Pinelopi Papalampidi, Frank Keller, and Mirella Lapata. Movie summarization via sparse graph construction. In Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021. 2, 3, 5, 6, 12
[43] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527β 536, 2019. 6
[44] Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3202β3212, 2015. 2
[45] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In International Conference on Learning Representations, 2017. 3
[46] Alan F Smeaton, Bart Lehane, Noel E OβConnor, Conor Brady, and Gary Craig. Automatically selecting shots for action movie trailers. In Proceedings of the 8th ACM international workshop on Multimedia information retrieval, pages 231β238, 2006. 1, 2
[47] John R Smith, Dhiraj Joshi, Benoit Huet, Winston Hsu, and Jozef Cota. Harnessing ai for augmenting creativity: Application to movie trailer creation. In Proceedings of the 25th ACM international conference on Multimedia, pages 1799β 1808, 2017. 2, 7
[48] Siqi Sun, Zhe Gan, Yuwei Fang, Yu Cheng, Shuohang Wang, and Jingjing Liu. Contrastive distillation on intermediate representations for language model compression. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 498β508, 2020. 4
[49] Makarand Tapaswi, Martin Bauml, and Rainer Stiefelhagen. Book2movie: Aligning video scenes with book chapters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1827β1835, 2015. 2
[50] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through questionanswering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631β4640, 2016. 2
[51] Kristin Thompson. Storytelling in the new Hollywood: Understanding classical narrative technique. Harvard University Press, 1999. 1
[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Εukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998β6008, 2017. 3
[53] Lezi Wang, Dong Liu, Rohit Puri, and Dimitris N Metaxas. Learning trailer moments in full-length movies with cocontrastive attention. In European Conference on Computer Vision, pages 300β316. Springer, 2020. 1, 2, 7
[54] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019. 6
[55] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733β 3742, 2018. 4
[56] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Β΄ Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492β1500, 2017. 6
[57] Hongteng Xu, Yi Zhen, and Hongyuan Zha. Trailer generation via a point process-based visual attractiveness model. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 2198β2204, 2015. 2, 7
This paper is available on arxiv under CC BY-SA 4.0 DEED license.