添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Caption alignment for low resource audio-visual data

Vighnesh Reddy Konda, Mayur Warialani, Rakesh Prasanth Achari, Varad Bhatnagar, Jayaprakash Akula, Preethi Jyothi, Ganesh Ramakrishnan, Gholamreza Haffari , Pankaj Singh

Research output : Chapter in Book/Report/Conference proceeding Conference Paper Research peer-review

Abstract

Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval. Existing work on learning to infer alignments has mostly exploited visual features and ignored the audio signal. Video understanding applications often underestimate the importance of the audio modality. We focus on how to make effective use of the audio modality for temporal localization of captions within videos. We release a new audio-visual dataset that has captions time-aligned by (i) carefully listening to the audio and watching the video, and (ii) watching only the video. Our dataset is audio-rich and contains captions in two languages, English and Marathi (a low-resource language). We further propose an attention-driven multimodal model, for effective utilization of both audio and video for temporal localization. We then investigate (i) the effects of audio in both data preparation and model design, and (ii) effective pretraining strategies (Audioset, ASR-bottleneck features, PASE, etc.) handling low-resource setting to help extract rich audio representations.

Original language English
Title of host publication Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Editors Helen Meng
Place of Publication Baixas FRANCE
Publisher International Speech Communication Association (ISCA)
Pages 3525-3529
Number of pages 5
Volume 2020-October
DOIs
Publication status Published - 2020
Event Annual Conference of the International Speech Communication Association (was Eurospeech) 2020 - Shanghai, China
Duration: 25 Oct 2020 29 Oct 2020
Conference number: 21st
https://www.isca-speech.org/archive/Interspeech_2020/ (Proceedings)
https://www.isca-speech.org/archive/Interspeech_2020/index.html (Website)

Conference

Conference Annual Conference of the International Speech Communication Association (was Eurospeech) 2020
Abbreviated title Interspeech 2020
Country/Territory China
City Shanghai
Period 25/10/20 29/10/20
Internet address

Keywords

  • Caption alignment for videos
  • Low-resource audio-visual corpus
  • Multimodal models
Konda, V. R., Warialani, M., Achari, R. P., Bhatnagar, V., Akula, J., Jyothi, P., Ramakrishnan, G. , Haffari, G. , & Singh, P. (2020). Caption alignment for low resource audio-visual data . In H. Meng (Ed.), Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2020-October, pp. 3525-3529). International Speech Communication Association (ISCA). https://doi.org/10.21437/Interspeech.2020-3157
Konda, Vighnesh Reddy ; Warialani, Mayur ; Achari, Rakesh Prasanth et al. / Caption alignment for low resource audio-visual data . Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. editor / Helen Meng. Vol. 2020-October Baixas FRANCE : International Speech Communication Association (ISCA), 2020. pp. 3525-3529
@inproceedings{be5d5cf9f128451593903457b7231720,
title = "Caption alignment for low resource audio-visual data",
abstract = "Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval. Existing work on learning to infer alignments has mostly exploited visual features and ignored the audio signal. Video understanding applications often underestimate the importance of the audio modality. We focus on how to make effective use of the audio modality for temporal localization of captions within videos. We release a new audio-visual dataset that has captions time-aligned by (i) carefully listening to the audio and watching the video, and (ii) watching only the video. Our dataset is audio-rich and contains captions in two languages, English and Marathi (a low-resource language). We further propose an attention-driven multimodal model, for effective utilization of both audio and video for temporal localization. We then investigate (i) the effects of audio in both data preparation and model design, and (ii) effective pretraining strategies (Audioset, ASR-bottleneck features, PASE, etc.) handling low-resource setting to help extract rich audio representations.",
keywords = "Caption alignment for videos, Low-resource audio-visual corpus, Multimodal models",
author = "Konda, {Vighnesh Reddy} and Mayur Warialani and Achari, {Rakesh Prasanth} and Varad Bhatnagar and Jayaprakash Akula and Preethi Jyothi and Ganesh Ramakrishnan and Gholamreza Haffari and Pankaj Singh",
year = "2020",
doi = "10.21437/Interspeech.2020-3157",
language = "English",
volume = "2020-October",
pages = "3525--3529",
editor = "Meng, {Helen }",
booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
publisher = "International Speech Communication Association (ISCA)",
address = "France",
note = "Annual Conference of the International Speech Communication Association (was Eurospeech) 2020, Interspeech 2020 ; Conference date: 25-10-2020 Through 29-10-2020",
url = "https://www.isca-speech.org/archive/Interspeech_2020/, https://www.isca-speech.org/archive/Interspeech_2020/index.html",

}

Konda, VR, Warialani, M, Achari, RP, Bhatnagar, V, Akula, J, Jyothi, P, Ramakrishnan, G , Haffari, G & Singh, P 2020, Caption alignment for low resource audio-visual data . in H Meng (ed.), Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2020-October, International Speech Communication Association (ISCA), Baixas FRANCE, pp. 3525-3529, Annual Conference of the International Speech Communication Association (was Eurospeech) 2020, Shanghai, China, 25/10/20 . https://doi.org/10.21437/Interspeech.2020-3157
Caption alignment for low resource audio-visual data. / Konda, Vighnesh Reddy; Warialani, Mayur; Achari, Rakesh Prasanth et al.
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. ed. / Helen Meng. Vol. 2020-October Baixas FRANCE: International Speech Communication Association (ISCA), 2020. p. 3525-3529.

Research output : Chapter in Book/Report/Conference proceeding Conference Paper Research peer-review

TY - GEN

T1 - Caption alignment for low resource audio-visual data

AU - Konda, Vighnesh Reddy

AU - Warialani, Mayur

AU - Achari, Rakesh Prasanth

AU - Bhatnagar, Varad

AU - Akula, Jayaprakash

AU - Jyothi, Preethi

AU - Ramakrishnan, Ganesh

AU - Haffari, Gholamreza

AU - Singh, Pankaj

N1 - Conference code: 21st

PY - 2020

Y1 - 2020

N2 - Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval. Existing work on learning to infer alignments has mostly exploited visual features and ignored the audio signal. Video understanding applications often underestimate the importance of the audio modality. We focus on how to make effective use of the audio modality for temporal localization of captions within videos. We release a new audio-visual dataset that has captions time-aligned by (i) carefully listening to the audio and watching the video, and (ii) watching only the video. Our dataset is audio-rich and contains captions in two languages, English and Marathi (a low-resource language). We further propose an attention-driven multimodal model, for effective utilization of both audio and video for temporal localization. We then investigate (i) the effects of audio in both data preparation and model design, and (ii) effective pretraining strategies (Audioset, ASR-bottleneck features, PASE, etc.) handling low-resource setting to help extract rich audio representations.

AB - Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval. Existing work on learning to infer alignments has mostly exploited visual features and ignored the audio signal. Video understanding applications often underestimate the importance of the audio modality. We focus on how to make effective use of the audio modality for temporal localization of captions within videos. We release a new audio-visual dataset that has captions time-aligned by (i) carefully listening to the audio and watching the video, and (ii) watching only the video. Our dataset is audio-rich and contains captions in two languages, English and Marathi (a low-resource language). We further propose an attention-driven multimodal model, for effective utilization of both audio and video for temporal localization. We then investigate (i) the effects of audio in both data preparation and model design, and (ii) effective pretraining strategies (Audioset, ASR-bottleneck features, PASE, etc.) handling low-resource setting to help extract rich audio representations.

KW - Caption alignment for videos

KW - Low-resource audio-visual corpus

KW - Multimodal models

UR - http://www.scopus.com/inward/record.url?scp=85098150159&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-3157

DO - 10.21437/Interspeech.2020-3157

M3 - Conference Paper

AN - SCOPUS:85098150159

VL - 2020-October

SP - 3525

EP - 3529

BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

A2 - Meng, Helen

PB - International Speech Communication Association (ISCA)

CY - Baixas FRANCE

T2 - Annual Conference of the International Speech Communication Association (was Eurospeech) 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

Konda VR, Warialani M, Achari RP, Bhatnagar V, Akula J, Jyothi P et al. Caption alignment for low resource audio-visual data . In Meng H, editor, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2020-October. Baixas FRANCE: International Speech Communication Association (ISCA). 2020. p. 3525-3529 doi: 10.21437/Interspeech.2020-3157