oai:arXiv.org:2408.06631
Computer Science
2024
21/8/2024
End-to-end interpretation is currently the prevailing paradigm for remote sensing fine-grained ship classification (RS-FGSC) task.
However, its inference process is uninterpretable, leading to criticism as a black box model.
To address this issue, we propose a large vision-language model (LVLM) named IFShip for interpretable fine-grained ship classification.
Unlike traditional methods, IFShip excels in interpretability by accurately conveying the reasoning process of FGSC in natural language.
Specifically, we first design a domain knowledge-enhanced Chain-of-Thought (COT) prompt generation mechanism.
This mechanism is used to semi-automatically construct a task-specific instruction-following dataset named TITANIC-FGS, which emulates human-like logical decision-making.
We then train the IFShip model using task instructions tuned with the TITANIC-FGS dataset.
Building on IFShip, we develop an FGSC visual chatbot that redefines the FGSC problem as a step-by-step reasoning task and conveys the reasoning process in natural language.
Experimental results reveal that the proposed method surpasses state-of-the-art FGSC algorithms in both classification interpretability and accuracy.
Moreover, compared to LVLMs like LLaVA and MiniGPT-4, our approach demonstrates superior expertise in the FGSC task.
It provides an accurate chain of reasoning when fine-grained ship types are recognizable to the human eye and offers interpretable explanations when they are not.
Guo, Mingning,Wu, Mengwei,Shen, Yuxiang,Li, Haifeng,Tao, Chao, 2024, IFShip: A Large Vision-Language Model for Interpretable Fine-grained Ship Classification via Domain Knowledge-Enhanced Instruction Tuning