MM24 Grand Challenge DEMON: Demonstrative Instruction Following Challenge

Program

11.1 09:00 am to 12:30 am. Melbourne timezone(UTC+11)

Location: Meeting Room 208, MCEC

Program Schedule
Time	Event	Speaker	Institution
09:00 am	Opening
09:10 am	Keynote: Hallucination in MLLMs	Jinda Lu	University of Science and Technology of China
10:00 am	Challenge Summary	Zhiqi Ge	Zhejiang University
10:30 am	Coffee Break
11:00 am	Challenge Solution Report	Xian Fu	Tianjin University
11:30 am	Challenge Solution Report	Jingyu Wei	National University of Defense Technology
12:00 am	Open Discussion and Future Directions

Demonstrations and task taxonomy of the proposed DEMON challenge. — Demonstrations of DEMON Challenge

Leaderboard

Rank	Team	Score
1	Synergix	60.7
2	Cyan	59.4
3	Temp	58.5

Overview

To facilitate research in interleaved vision-language instruction following, we build DEMON, a comprehensive challenge of 29 tasks with varied, demonstrative instructions in a uniﬁed instruction-response format, covering 20 diverse scenarios. DEMON has three important properties:

Interleaved vision-language context, all the instructions comprise sequences of interconnected images and texts, such as storyboards with scripts, and textbooks with diagrams.
Diverse forms of complex instructions, ranging from generating dialogue for comics, identifying disparities in surveillance images, to engaging in conversational embodied task.
Vast range of instruction-following scenarios, the benchmark encompasses multiple real-world scenarios, including cartoons, industrial images, driving recordings, recipes, etc.

Task Details

Visual Storytelling. This task involves creating a coherent narrative based on a series of images presented in sequence. It tests the model's ability to understand context, sequence, and the progression of events from visual inputs, requiring the construction or prediction of a story continuation.

Multi-Modal Dialogue. In this task, the model must engage in a dialogue that necessitates the interpretation of both visual content and text. It assesses the model's ability to integrate multimodal information and apply it within a conversational context to make decisions or respond to queries.

Visual Relation Inference. The objective is to identify and articulate the relationships or changes between two pictures. This task challenges the model to observe details and analyze the visual information to determine counts, positions, or interactions.

Text-Rich Images QA. This task requires the model to extract and comprehend information from text-heavy images, such as slides or documents. It demands capabilities in text recognition within images and the subsequent application of this information to answer related questions.

Multi-Image Reasoning. The model is tasked with analyzing multiple images to make judgments about them, such as assessing whether they share a similar style or theme. It tests the model's ability to process and compare visual elements, and to make inferences based on visual data.

Multi-Modal Cloze. This task presents the model with a sequence of images or text with a missing element, requiring the correct identification of the subsequent part from the provided options. It evaluates the model's understanding of narrative structure and context comprehension to logically fill in the missing pieces.

Knowledge Grounded QA. The model is challenged to answer questions based on complex diagrams or authoritative text sources. It necessitates a comprehensive understanding of the material and the extraction of relevant information to provide accurate responses.

DEMON Challenge Task Details
Task	Scenario	Dataset	Metric
Multi-Modal Dialogue
Conversational Embodied Dialogue	Embodied	ALFRED	ROUGE-L
Multi-Modal Dialogue	Conversation	MMCoQA	ROUGE-L
Visual Relation Inference
Visual Change Captioning	Surveillance	Spot-the-Diff	ROUGE-L
Visual Change Captioning	Synthetic	CLEVR-Change	ROUGE-L
Visual Relationship Expressing	General	IEdit	ROUGE-L
Subtle Difference Expressing	Fine-Grained	Birds-to-Words	ROUGE-L
Visual Storytelling
Animated Story Completion	Cartoon	AESOP	ROUGE-L
Animated Story Completion	Cartoon	PororoSV	ROUGE-L
Animated Story Completion	Cartoon	FlintstonesSV	ROUGE-L
Sequential Photo Storytelling	Album	VIST	ROUGE-L
Sequential Photo Storytelling	Cartoon	DiDeMoSV	ROUGE-L
Multi-Modal Cloze
Comic Dialogue Identification	Cartoon	COMICS-Dialogue	Accuracy
Comic Panel Identification	Cartoon	COMICS-Panel	Accuracy
Recipe Completion	Recipe	RecipeQA-TextCloze	Accuracy
Visual Step Cloze	Recipe	RecipeQA-VisualCloze	Accuracy
Knowledge Grounded QA
Webpage QA	Webpage	WebQA	Accuracy
Textbook QA	Textbook	TQA	Accuracy
Complex Multimodal QA	Wikipedia	MMQA	Accuracy
Text-Rich Images QA
Slide QA	Slide	SlideVQA	Accuracy
OCR QA	Book Cover	OCR-VQA	Accuracy
Document QA	Document Image	DocVQA	Accuracy
Multi-Image Reasoning
Image-Set QA	Driving Recording	nuScenes	Accuracy
Industrial Inspection	Industrial	VISION	Accuracy
Fashion QA	Fashion	Fashion200K	Accuracy
Property Coherence	General	MIT-States-PropertyCoherence	Accuracy
State Transformation Coherence	General	MIT-States-StateCoherence	Accuracy
Visual Step Matching	Recipe	RecipeQA-ImageCoherence	Accuracy
Multi-Image Visual Entailment	General	NLVR2	Accuracy
Ambiguity Analysis	Mobile Photo	VizWiz	Accuracy

Dataset

For each task, we will provide a dataset with a training set and a test set. The annoations are in the form of a JSON file. A example of the task metadata is shown below.

{
    "dataset": "MMQA",
    "split": "test",
    "num_sample": "500",
    "task_instruction": [
        "Given a collection of relevant data, which includes images, text, and tables, your task is to respond accurately to the ensuing question. You must choose your answer from the Choice List. ",
        "Utilizing the information, including images, text, and tables that I provide, could you provide a correct answer to the following question? You must choose your answer from the Choice List. ",
        "With the aid of provided information like images, text, and tables, your assignment is to respond correctly to the question. You must choose your answer from the Choice List. ",
        "Based on the data provided, which includes various forms like images, text, and tables, please respond to the following query. You must choose your answer from the Choice List. ",
        "Using the relevant information I provide, which encompasses images, text, and tables, please accurately answer the question. You must choose your answer from the Choice List. ",
        "Considering the information in different formats I've provided you, could you formulate a correct response to the ensuing query? You must choose your answer from the Choice List. ",
        "Given a set of information including images, text, and tables, your task is to use this data to answer the subsequent question correctly. You must choose your answer from the Choice List. ",
        "Relying on the furnished information, which includes graphics, text, and tabular data, could you provide an accurate response to the following question? You must choose your answer from the Choice List. ",
        "With the amalgam of information provided, encompassing images, texts, and tables, please construct a correct answer to the following question. You must choose your answer from the Choice List. ",
        "Using the array of information, including imagery, textual data, and tables, could you provide an accurate answer to the posed question? You must choose your answer from the Choice List. "
    ],
    "question_type": "multi-choice"
}

A example of the instance annoation is shown below.

{
    "sample_id": "0",
    "task_instruction_id": "7",
    "task_instance": {
        "context": "Global Table: {table#1} Context: {image#2} Question: What sports is the Ben Piazza 1976 movie title? Choice List:['soccer', 'baseball', 'basketball', 'football'] Your answer is:",
        "images_path": [
            "0.png",
            "1.jpg"
        ]
    },
    "response": "baseball" #contained only in the train set
}

The dataset can be downloaded from google drive.

DEMON Dataset Statistics
	Tasks	Scenarios	Images	Instructions	Avg. Images / Instruction	Avg. Words / Instruction
DEMON-Test	29	19	62.81K	18.18K	3.46	92.69
DEMON-Train	29	19	1.51M	430.72K	3.70	98.58

Evaluation

For open-ended generation tasks, the evaluation will be conducted using ROUGE-L (F1), assessing the semantic and structural alignment of the generated text with reference texts.

For multi-choice tasks, Accuracy will serve as the evaluation metric, measuring the correctness of selected options.

The overall score for each team will be defined as the mean of these scores across all tasks, reflecting a comprehensive measure of performance akin to the scoring of human examinations.

The evaluation code can be found in the github repository.

Submission

To participate in the DEMON challenge, please first register by submitting the form.

Participants can send the submission files to dcd-mllm@outlook.com with the subject "DEMON Challenge Submission". We will evaluate the submissions and announce the results on the website later.

The submission file should keep the same structure as the DEMON-Test dataset, with the response field filled with the predicted answer. Image folders are not required for submission.

Baseline Results

We evaluate several state-of-the-art models on the DEMON challenge. The results are shown below.

Baseline Evaluation Result
Model	Version	Multi Modal Dialogue	Visual Story Telling	Visual Relation Inference	Multi Modal Cloze	Knowledge Grounded QA	Text Rich Images QA	Multi Image Reasoning
BLIP-2	vicuna-7b	11.96	20.10	3.67	18.25	39.73	30.53	39.53
InstructBlip	vicuna-7b	33.58	24.41	11.49	21.20	47.40	44.40	48.55
LLaMA-Adapter V2	llama-7b	14.22	17.57	13.51	18.00	44.80	32.00	44.03
LLaVA	vicuna-7b	7.79	10.70	8.27	15.85	36.20	28.33	41.53
MiniGPT-4	vicuna-7b	13.70	17.07	7.95	16.60	30.27	26.40	43.50
mPLUG-Owl	llama-7b	12.67	19.33	5.40	16.25	33.27	32.47	42.50
OpenFlamingo	llama-7b	16.88	24.22	13.85	21.65	32.00	30.60	41.63
Otter	llama-7b	15.37	15.57	11.39	16.00	41.67	27.73	43.85
VPG-C	llama-2-7b-chat	42.70	24.76	25.50	22.95	51.00	44.93	48.68
VPG-C	vicuna-7b	37.50	25.20	25.90	22.15	48.60	44.93	50.28

Important Dates

Registration Open: 2024-5-1

Challenge Result Submission Deadline: ~~2024-7-1~~ 2024-8-1

Challenge Technical Paper Submission Deadline: ~~2024-7-10~~ 2024-8-10

Organizers

Zhiqi Ge, Zhejiang University, China

Juncheng Li, National University of Singapore, Singapore

Qifan Yu, Zhejiang University, China

Wei Zhou, Zhejiang University, China

Siliang Tang, Zhejiang University, China

Yueting Zhuang, Zhejiang University, China

Contact

For any questions, please contact us at dcd-mllm@outlook.com.