Latent Memory-augmented Graph Transformer for Visual Storytelling

Mengshi Qi*, Jie Qin, DI Huang, Zhiqiang Shen, Yi Yang, Jiebo Luo

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

21 Citations (Scopus)

Abstract

Visual storytelling aims to automatically generate a human-like short story given an image stream. Most existing works utilize either scene-level or object-level representations, neglecting the interaction among objects in each image and the sequential dependency between consecutive images. In this paper, we present a novel Latent Memory-augmented Graph Transformer∼(LMGT ), a Transformer based framework for visual story generation. LMGT directly inherits the merits from the Transformer, which is further enhanced with two carefully designed components, i.e., a graph encoding module and a latent memory unit. Specifically, the graph encoding module exploits the semantic relationships among image regions and attentively aggregates critical visual features based on the parsed scene graphs. Furthermore, to better preserve inter-sentence coherence and topic consistency, we introduce an augmented latent memory unit that learns and records highly summarized latent information as the story line from the image stream and the sentence history. Experimental results on three widely-used datasets demonstrate the superior performance of LMGT over the state-of-the-art methods.

Original languageEnglish
Title of host publicationMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages4892-4901
Number of pages10
ISBN (Electronic)9781450386517
DOIs
Publication statusPublished - 17 Oct 2021
Externally publishedYes
Event29th ACM International Conference on Multimedia, MM 2021 - Virtual, Online, China
Duration: 20 Oct 202124 Oct 2021

Publication series

NameMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia

Conference

Conference29th ACM International Conference on Multimedia, MM 2021
Country/TerritoryChina
CityVirtual, Online
Period20/10/2124/10/21

Bibliographical note

Publisher Copyright:
© 2021 ACM.

Keywords

  • memory network
  • scene graph
  • transformer
  • visual storytelling

Fingerprint

Dive into the research topics of 'Latent Memory-augmented Graph Transformer for Visual Storytelling'. Together they form a unique fingerprint.

Cite this