11.1 09:00 am to 12:30 am. Melbourne timezone(UTC+11)
Location: Meeting Room 208, MCEC
Time | Event | Speaker | Institution |
---|---|---|---|
09:00 am | Opening | ||
09:10 am | Keynote: Hallucination in MLLMs | Jinda Lu | University of Science and Technology of China |
10:00 am | Challenge Summary | Zhiqi Ge | Zhejiang University |
10:30 am | Coffee Break | ||
11:00 am | Challenge Solution Report | Xian Fu | Tianjin University |
11:30 am | Challenge Solution Report | Jingyu Wei | National University of Defense Technology |
12:00 am | Open Discussion and Future Directions |
Rank | Team | Score |
---|---|---|
1 | Synergix | 60.7 |
2 | Cyan | 59.4 |
3 | Temp | 58.5 |
To facilitate research in interleaved vision-language instruction following, we build DEMON, a comprehensive challenge of 29 tasks with varied, demonstrative instructions in a unified instruction-response format, covering 20 diverse scenarios. DEMON has three important properties:
Visual Storytelling. This task involves creating a coherent narrative based on a series of images presented in sequence. It tests the model's ability to understand context, sequence, and the progression of events from visual inputs, requiring the construction or prediction of a story continuation.
Multi-Modal Dialogue. In this task, the model must engage in a dialogue that necessitates the interpretation of both visual content and text. It assesses the model's ability to integrate multimodal information and apply it within a conversational context to make decisions or respond to queries.
Visual Relation Inference. The objective is to identify and articulate the relationships or changes between two pictures. This task challenges the model to observe details and analyze the visual information to determine counts, positions, or interactions.
Text-Rich Images QA. This task requires the model to extract and comprehend information from text-heavy images, such as slides or documents. It demands capabilities in text recognition within images and the subsequent application of this information to answer related questions.
Multi-Image Reasoning. The model is tasked with analyzing multiple images to make judgments about them, such as assessing whether they share a similar style or theme. It tests the model's ability to process and compare visual elements, and to make inferences based on visual data.
Multi-Modal Cloze. This task presents the model with a sequence of images or text with a missing element, requiring the correct identification of the subsequent part from the provided options. It evaluates the model's understanding of narrative structure and context comprehension to logically fill in the missing pieces.
Knowledge Grounded QA. The model is challenged to answer questions based on complex diagrams or authoritative text sources. It necessitates a comprehensive understanding of the material and the extraction of relevant information to provide accurate responses.
Task | Scenario | Dataset | Metric |
---|---|---|---|
Multi-Modal Dialogue | |||
Conversational Embodied Dialogue | Embodied | ALFRED | ROUGE-L |
Multi-Modal Dialogue | Conversation | MMCoQA | ROUGE-L |
Visual Relation Inference | |||
Visual Change Captioning | Surveillance | Spot-the-Diff | ROUGE-L |
Visual Change Captioning | Synthetic | CLEVR-Change | ROUGE-L |
Visual Relationship Expressing | General | IEdit | ROUGE-L |
Subtle Difference Expressing | Fine-Grained | Birds-to-Words | ROUGE-L |
Visual Storytelling | |||
Animated Story Completion | Cartoon | AESOP | ROUGE-L |
Animated Story Completion | Cartoon | PororoSV | ROUGE-L |
Animated Story Completion | Cartoon | FlintstonesSV | ROUGE-L |
Sequential Photo Storytelling | Album | VIST | ROUGE-L |
Sequential Photo Storytelling | Cartoon | DiDeMoSV | ROUGE-L |
Multi-Modal Cloze | |||
Comic Dialogue Identification | Cartoon | COMICS-Dialogue | Accuracy |
Comic Panel Identification | Cartoon | COMICS-Panel | Accuracy |
Recipe Completion | Recipe | RecipeQA-TextCloze | Accuracy |
Visual Step Cloze | Recipe | RecipeQA-VisualCloze | Accuracy |
Knowledge Grounded QA | |||
Webpage QA | Webpage | WebQA | Accuracy |
Textbook QA | Textbook | TQA | Accuracy |
Complex Multimodal QA | Wikipedia | MMQA | Accuracy |
Text-Rich Images QA | |||
Slide QA | Slide | SlideVQA | Accuracy |
OCR QA | Book Cover | OCR-VQA | Accuracy |
Document QA | Document Image | DocVQA | Accuracy |
Multi-Image Reasoning | |||
Image-Set QA | Driving Recording | nuScenes | Accuracy |
Industrial Inspection | Industrial | VISION | Accuracy |
Fashion QA | Fashion | Fashion200K | Accuracy |
Property Coherence | General | MIT-States-PropertyCoherence | Accuracy |
State Transformation Coherence | General | MIT-States-StateCoherence | Accuracy |
Visual Step Matching | Recipe | RecipeQA-ImageCoherence | Accuracy |
Multi-Image Visual Entailment | General | NLVR2 | Accuracy |
Ambiguity Analysis | Mobile Photo | VizWiz | Accuracy |
For each task, we will provide a dataset with a training set and a test set. The annoations are in the form of a JSON file. A example of the task metadata is shown below.
{
"dataset": "MMQA",
"split": "test",
"num_sample": "500",
"task_instruction": [
"Given a collection of relevant data, which includes images, text, and tables, your task is to respond accurately to the ensuing question. You must choose your answer from the Choice List. ",
"Utilizing the information, including images, text, and tables that I provide, could you provide a correct answer to the following question? You must choose your answer from the Choice List. ",
"With the aid of provided information like images, text, and tables, your assignment is to respond correctly to the question. You must choose your answer from the Choice List. ",
"Based on the data provided, which includes various forms like images, text, and tables, please respond to the following query. You must choose your answer from the Choice List. ",
"Using the relevant information I provide, which encompasses images, text, and tables, please accurately answer the question. You must choose your answer from the Choice List. ",
"Considering the information in different formats I've provided you, could you formulate a correct response to the ensuing query? You must choose your answer from the Choice List. ",
"Given a set of information including images, text, and tables, your task is to use this data to answer the subsequent question correctly. You must choose your answer from the Choice List. ",
"Relying on the furnished information, which includes graphics, text, and tabular data, could you provide an accurate response to the following question? You must choose your answer from the Choice List. ",
"With the amalgam of information provided, encompassing images, texts, and tables, please construct a correct answer to the following question. You must choose your answer from the Choice List. ",
"Using the array of information, including imagery, textual data, and tables, could you provide an accurate answer to the posed question? You must choose your answer from the Choice List. "
],
"question_type": "multi-choice"
}
A example of the instance annoation is shown below.
{
"sample_id": "0",
"task_instruction_id": "7",
"task_instance": {
"context": "Global Table: {table#1} Context: {image#2} Question: What sports is the Ben Piazza 1976 movie title? Choice List:['soccer', 'baseball', 'basketball', 'football'] Your answer is:",
"images_path": [
"0.png",
"1.jpg"
]
},
"response": "baseball" #contained only in the train set
}
The dataset can be downloaded from google drive.
Tasks | Scenarios | Images | Instructions | Avg. Images / Instruction | Avg. Words / Instruction | |
---|---|---|---|---|---|---|
DEMON-Test | 29 | 19 | 62.81K | 18.18K | 3.46 | 92.69 |
DEMON-Train | 29 | 19 | 1.51M | 430.72K | 3.70 | 98.58 |
For open-ended generation tasks, the evaluation will be conducted using ROUGE-L (F1), assessing the semantic and structural alignment of the generated text with reference texts.
For multi-choice tasks, Accuracy will serve as the evaluation metric, measuring the correctness of selected options.
The overall score for each team will be defined as the mean of these scores across all tasks, reflecting a comprehensive measure of performance akin to the scoring of human examinations.
The evaluation code can be found in the github repository.
To participate in the DEMON challenge, please first register by submitting the form.
Participants can send the submission files to dcd-mllm@outlook.com with the subject "DEMON Challenge Submission". We will evaluate the submissions and announce the results on the website later.
The submission file should keep the same structure as the DEMON-Test dataset, with the response field filled with the predicted answer. Image folders are not required for submission.
We evaluate several state-of-the-art models on the DEMON challenge. The results are shown below.
Model | Version | Multi Modal Dialogue | Visual Story Telling | Visual Relation Inference | Multi Modal Cloze | Knowledge Grounded QA | Text Rich Images QA | Multi Image Reasoning |
---|---|---|---|---|---|---|---|---|
BLIP-2 | vicuna-7b | 11.96 | 20.10 | 3.67 | 18.25 | 39.73 | 30.53 | 39.53 |
InstructBlip | vicuna-7b | 33.58 | 24.41 | 11.49 | 21.20 | 47.40 | 44.40 | 48.55 |
LLaMA-Adapter V2 | llama-7b | 14.22 | 17.57 | 13.51 | 18.00 | 44.80 | 32.00 | 44.03 |
LLaVA | vicuna-7b | 7.79 | 10.70 | 8.27 | 15.85 | 36.20 | 28.33 | 41.53 |
MiniGPT-4 | vicuna-7b | 13.70 | 17.07 | 7.95 | 16.60 | 30.27 | 26.40 | 43.50 |
mPLUG-Owl | llama-7b | 12.67 | 19.33 | 5.40 | 16.25 | 33.27 | 32.47 | 42.50 |
OpenFlamingo | llama-7b | 16.88 | 24.22 | 13.85 | 21.65 | 32.00 | 30.60 | 41.63 |
Otter | llama-7b | 15.37 | 15.57 | 11.39 | 16.00 | 41.67 | 27.73 | 43.85 |
VPG-C | llama-2-7b-chat | 42.70 | 24.76 | 25.50 | 22.95 | 51.00 | 44.93 | 48.68 |
VPG-C | vicuna-7b | 37.50 | 25.20 | 25.90 | 22.15 | 48.60 | 44.93 | 50.28 |
Registration Open: 2024-5-1
Challenge Result Submission Deadline: 2024-7-1 2024-8-1
Challenge Technical Paper Submission Deadline: 2024-7-10 2024-8-10
Zhiqi Ge, Zhejiang University, China
Juncheng Li, National University of Singapore, Singapore
Qifan Yu, Zhejiang University, China
Wei Zhou, Zhejiang University, China
Siliang Tang, Zhejiang University, China
Yueting Zhuang, Zhejiang University, China
For any questions, please contact us at dcd-mllm@outlook.com.