MM24 Grand Challenge DEMON: Demonstrative Instruction Following Challenge

Program

11.1 09:00 am to 12:30 am. Melbourne timezone(UTC+11)

Location: Meeting Room 208, MCEC

Program Schedule
Time Event Speaker Institution
09:00 am Opening
09:10 am Keynote: Hallucination in MLLMs Jinda Lu University of Science and Technology of China
10:00 am Challenge Summary Zhiqi Ge Zhejiang University
10:30 am Coffee Break
11:00 am Challenge Solution Report Xian Fu Tianjin University
11:30 am Challenge Solution Report Jingyu Wei National University of Defense Technology
12:00 am Open Discussion and Future Directions
Demonstrations and task taxonomy of the proposed DEMON challenge.
Demonstrations of DEMON Challenge

Leaderboard

Rank Team Score
1 Synergix 60.7
2 Cyan 59.4
3 Temp 58.5

Overview

To facilitate research in interleaved vision-language instruction following, we build DEMON, a comprehensive challenge of 29 tasks with varied, demonstrative instructions in a unified instruction-response format, covering 20 diverse scenarios. DEMON has three important properties:

  • Interleaved vision-language context, all the instructions comprise sequences of interconnected images and texts, such as storyboards with scripts, and textbooks with diagrams.
  • Diverse forms of complex instructions, ranging from generating dialogue for comics, identifying disparities in surveillance images, to engaging in conversational embodied task.
  • Vast range of instruction-following scenarios, the benchmark encompasses multiple real-world scenarios, including cartoons, industrial images, driving recordings, recipes, etc.

Task Details

Visual Storytelling. This task involves creating a coherent narrative based on a series of images presented in sequence. It tests the model's ability to understand context, sequence, and the progression of events from visual inputs, requiring the construction or prediction of a story continuation.

Multi-Modal Dialogue. In this task, the model must engage in a dialogue that necessitates the interpretation of both visual content and text. It assesses the model's ability to integrate multimodal information and apply it within a conversational context to make decisions or respond to queries.

Visual Relation Inference. The objective is to identify and articulate the relationships or changes between two pictures. This task challenges the model to observe details and analyze the visual information to determine counts, positions, or interactions.

Text-Rich Images QA. This task requires the model to extract and comprehend information from text-heavy images, such as slides or documents. It demands capabilities in text recognition within images and the subsequent application of this information to answer related questions.

Multi-Image Reasoning. The model is tasked with analyzing multiple images to make judgments about them, such as assessing whether they share a similar style or theme. It tests the model's ability to process and compare visual elements, and to make inferences based on visual data.

Multi-Modal Cloze. This task presents the model with a sequence of images or text with a missing element, requiring the correct identification of the subsequent part from the provided options. It evaluates the model's understanding of narrative structure and context comprehension to logically fill in the missing pieces.

Knowledge Grounded QA. The model is challenged to answer questions based on complex diagrams or authoritative text sources. It necessitates a comprehensive understanding of the material and the extraction of relevant information to provide accurate responses.

DEMON Challenge Task Details
Task Scenario Dataset Metric
Multi-Modal Dialogue
Conversational Embodied Dialogue Embodied ALFRED ROUGE-L
Multi-Modal Dialogue Conversation MMCoQA ROUGE-L
Visual Relation Inference
Visual Change Captioning Surveillance Spot-the-Diff ROUGE-L
Visual Change Captioning Synthetic CLEVR-Change ROUGE-L
Visual Relationship Expressing General IEdit ROUGE-L
Subtle Difference Expressing Fine-Grained Birds-to-Words ROUGE-L
Visual Storytelling
Animated Story Completion Cartoon AESOP ROUGE-L
Animated Story Completion Cartoon PororoSV ROUGE-L
Animated Story Completion Cartoon FlintstonesSV ROUGE-L
Sequential Photo Storytelling Album VIST ROUGE-L
Sequential Photo Storytelling Cartoon DiDeMoSV ROUGE-L
Multi-Modal Cloze
Comic Dialogue Identification Cartoon COMICS-Dialogue Accuracy
Comic Panel Identification Cartoon COMICS-Panel Accuracy
Recipe Completion Recipe RecipeQA-TextCloze Accuracy
Visual Step Cloze Recipe RecipeQA-VisualCloze Accuracy
Knowledge Grounded QA
Webpage QA Webpage WebQA Accuracy
Textbook QA Textbook TQA Accuracy
Complex Multimodal QA Wikipedia MMQA Accuracy
Text-Rich Images QA
Slide QA Slide SlideVQA Accuracy
OCR QA Book Cover OCR-VQA Accuracy
Document QA Document Image DocVQA Accuracy
Multi-Image Reasoning
Image-Set QA Driving Recording nuScenes Accuracy
Industrial Inspection Industrial VISION Accuracy
Fashion QA Fashion Fashion200K Accuracy
Property Coherence General MIT-States-PropertyCoherence Accuracy
State Transformation Coherence General MIT-States-StateCoherence Accuracy
Visual Step Matching Recipe RecipeQA-ImageCoherence Accuracy
Multi-Image Visual Entailment General NLVR2 Accuracy
Ambiguity Analysis Mobile Photo VizWiz Accuracy

Dataset

For each task, we will provide a dataset with a training set and a test set. The annoations are in the form of a JSON file. A example of the task metadata is shown below.

{
    "dataset": "MMQA",
    "split": "test",
    "num_sample": "500",
    "task_instruction": [
        "Given a collection of relevant data, which includes images, text, and tables, your task is to respond accurately to the ensuing question. You must choose your answer from the Choice List. ",
        "Utilizing the information, including images, text, and tables that I provide, could you provide a correct answer to the following question? You must choose your answer from the Choice List. ",
        "With the aid of provided information like images, text, and tables, your assignment is to respond correctly to the question. You must choose your answer from the Choice List. ",
        "Based on the data provided, which includes various forms like images, text, and tables, please respond to the following query. You must choose your answer from the Choice List. ",
        "Using the relevant information I provide, which encompasses images, text, and tables, please accurately answer the question. You must choose your answer from the Choice List. ",
        "Considering the information in different formats I've provided you, could you formulate a correct response to the ensuing query? You must choose your answer from the Choice List. ",
        "Given a set of information including images, text, and tables, your task is to use this data to answer the subsequent question correctly. You must choose your answer from the Choice List. ",
        "Relying on the furnished information, which includes graphics, text, and tabular data, could you provide an accurate response to the following question? You must choose your answer from the Choice List. ",
        "With the amalgam of information provided, encompassing images, texts, and tables, please construct a correct answer to the following question. You must choose your answer from the Choice List. ",
        "Using the array of information, including imagery, textual data, and tables, could you provide an accurate answer to the posed question? You must choose your answer from the Choice List. "
    ],
    "question_type": "multi-choice"
}

A example of the instance annoation is shown below.

{
    "sample_id": "0",
    "task_instruction_id": "7",
    "task_instance": {
        "context": "Global Table: {table#1} Context: {image#2} Question: What sports is the Ben Piazza 1976 movie title? Choice List:['soccer', 'baseball', 'basketball', 'football'] Your answer is:",
        "images_path": [
            "0.png",
            "1.jpg"
        ]
    },
    "response": "baseball" #contained only in the train set
}

The dataset can be downloaded from google drive.

DEMON Dataset Statistics
Tasks Scenarios Images Instructions Avg. Images / Instruction Avg. Words / Instruction
DEMON-Test 29 19 62.81K 18.18K 3.46 92.69
DEMON-Train 29 19 1.51M 430.72K 3.70 98.58

Evaluation

For open-ended generation tasks, the evaluation will be conducted using ROUGE-L (F1), assessing the semantic and structural alignment of the generated text with reference texts.

For multi-choice tasks, Accuracy will serve as the evaluation metric, measuring the correctness of selected options.

The overall score for each team will be defined as the mean of these scores across all tasks, reflecting a comprehensive measure of performance akin to the scoring of human examinations.

The evaluation code can be found in the github repository.

Submission

To participate in the DEMON challenge, please first register by submitting the form.

Participants can send the submission files to dcd-mllm@outlook.com with the subject "DEMON Challenge Submission". We will evaluate the submissions and announce the results on the website later.

The submission file should keep the same structure as the DEMON-Test dataset, with the response field filled with the predicted answer. Image folders are not required for submission.

Baseline Results

We evaluate several state-of-the-art models on the DEMON challenge. The results are shown below.

Baseline Evaluation Result
Model Version Multi Modal Dialogue Visual Story Telling Visual Relation Inference Multi Modal Cloze Knowledge Grounded QA Text Rich Images QA Multi Image Reasoning
BLIP-2 vicuna-7b 11.96 20.10 3.67 18.25 39.73 30.53 39.53
InstructBlip vicuna-7b 33.58 24.41 11.49 21.20 47.40 44.40 48.55
LLaMA-Adapter V2 llama-7b 14.22 17.57 13.51 18.00 44.80 32.00 44.03
LLaVA vicuna-7b 7.79 10.70 8.27 15.85 36.20 28.33 41.53
MiniGPT-4 vicuna-7b 13.70 17.07 7.95 16.60 30.27 26.40 43.50
mPLUG-Owl llama-7b 12.67 19.33 5.40 16.25 33.27 32.47 42.50
OpenFlamingo llama-7b 16.88 24.22 13.85 21.65 32.00 30.60 41.63
Otter llama-7b 15.37 15.57 11.39 16.00 41.67 27.73 43.85
VPG-C llama-2-7b-chat 42.70 24.76 25.50 22.95 51.00 44.93 48.68
VPG-C vicuna-7b 37.50 25.20 25.90 22.15 48.60 44.93 50.28

Important Dates

Registration Open: 2024-5-1

Challenge Result Submission Deadline: 2024-7-1 2024-8-1

Challenge Technical Paper Submission Deadline: 2024-7-10 2024-8-10

Organizers

Zhiqi Ge, Zhejiang University, China

Juncheng Li, National University of Singapore, Singapore

Qifan Yu, Zhejiang University, China

Wei Zhou, Zhejiang University, China

Siliang Tang, Zhejiang University, China

Yueting Zhuang, Zhejiang University, China

Contact

For any questions, please contact us at dcd-mllm@outlook.com.