- Introduction
- Model Family
- Leaderboard
- Training Framework
- Evaluation
- Datasets
- Model Checkpoints
- Usage
- Citation
Effective and efficient access to relevant information is essential for disaster management. However, no retrieval model is specialized for disaster management, and existing general-domain models fail to handle the varied search intents inherent to disaster management scenarios, resulting in inconsistent and unreliable performance. To this end, we introduce DMRetriever, the first series of dense retrieval models (33M to 7.6B) tailored for this domain. It is trained through a novel three-stage framework of bidirectional attention adaptation, unsupervised contrastive pre-training, and difficulty-aware progressive instruction fine-tuning, using high-quality data generated through an advanced data refinement pipeline. Comprehensive experiments demonstrate that DMRetriever achieves state-of-the-art (SOTA) performance across all six search intents at every model scale. Moreover, DMRetriever is highly parameter-efficient, with 596M model outperforming baselines over 13.3$ imes$ larger and 33M model exceeding baselines with only 7.6% of their parameters.
DMRetriever is released in multiple scales to support a range of deployment environments, from lightweight on-device use to high-capacity research applications:
- Small (33M / 109M) β Lightweight models for resource-constrained or edge environments
- Medium (335M) β Balanced accuracy and computational efficiency
- Large (596M / 1.5B) β Strong performance with enhanced representational capacity
- XL (4B / 7.6B) β The best-performing IR models
Results across six search intents at various scales in DisastIR-Test.
Notation:
- Bold = best in size group
- Underline = second best
β= statistically significant improvement (p < 0.05, one-tailed Wilcoxon signed-rank test)
| Model | Scale | QA | QAdoc | TW | FC | NLI | STS | Avg. |
|---|---|---|---|---|---|---|---|---|
| thenlper-gte-small | 33M | 18.04 | 9.13 | 10.95 | 49.63 | 37.51 | 55.55 | 30.14 |
| arctic-embed-m | 109M | 33.15 | 14.04 | 8.48 | 35.07 | 38.67 | 56.20 | 30.94 |
| thenlper-gte-base | 109M | 9.18 | 5.42 | 37.91 | 60.45 | 42.52 | 46.07 | 33.59 |
| arctic-embed-m-v1.5 | 109M | 25.76 | 30.41 | 17.95 | 47.97 | 42.88 | 64.16 | 38.19 |
| arctic-embed-s | 33M | 38.58 | 28.81 | 21.33 | 47.21 | 39.85 | 66.96 | 40.46 |
| bge-small-en-v1.5 | 33M | 56.91 | 51.19 | 25.15 | 55.17 | 32.87 | 64.54 | 47.64 |
| bge-base-en-v1.5 | 109M | 51.50 | 52.78 | 46.72 | 59.93 | 41.16 | 68.63 | 53.45 |
| DMRetriever-33M (ours) | 33M | 62.47β | 57.03β | 57.22β | 60.81β | 46.56β | 67.57 | 58.61β |
| DMRetriever-109M (ours) | 109M | 63.19β | 59.55β | 58.88β | 62.48β | 46.93β | 68.79β | 59.97β |
| Model | Scale | QA | QAdoc | TW | FC | NLI | STS | Avg. |
|---|---|---|---|---|---|---|---|---|
| arctic-embed-m-long | 137M | 21.51 | 10.86 | 19.24 | 36.13 | 41.67 | 54.94 | 30.73 |
| arctic-embed-l | 335M | 40.56 | 30.19 | 14.98 | 32.64 | 34.20 | 56.10 | 34.78 |
| bge-large-en-v1.5 | 335M | 56.76 | 54.45 | 32.20 | 54.90 | 35.11 | 64.47 | 49.65 |
| gte-base-en-v1.5 | 137M | 60.51 | 55.62 | 46.26 | 52.24 | 39.59 | 70.40 | 54.10 |
| mxbai-embed-large-v1 | 335M | 64.24 | 62.63 | 39.94 | 58.12 | 40.18 | 68.01 | 55.52 |
| arctic-embed-m-v2.0 | 305M | 61.22 | 62.20 | 47.01 | 57.79 | 42.29 | 64.51 | 55.84 |
| DMRetriever-335M (ours) | 335M | 67.44β | 62.69β | 62.16β | 64.42β | 49.69β | 70.71β | 62.85β |
| Model | Scale | QA | QAdoc | TW | FC | NLI | STS | Avg. |
|---|---|---|---|---|---|---|---|---|
| arctic-embed-l-v2.0 | 568M | 55.23 | 59.11 | 38.11 | 60.10 | 41.07 | 62.61 | 52.70 |
| gte-large-en-v1.5 | 434M | 67.37 | 58.18 | 39.43 | 52.66 | 34.45 | 66.47 | 53.09 |
| Qwen3-Embedding-0.6B | 596M | 66.10 | 52.31 | 62.38 | 64.89 | 50.30 | 67.39 | 60.56 |
| mulling-e5-large-instruct | 560M | 67.97 | 64.64 | 62.25 | 66.78 | 48.51 | 63.42 | 62.26 |
| mulling-e5-large | 560M | 66.99 | 64.01 | 62.81 | 59.87 | 50.93 | 74.12 | 63.12 |
| gte-Qwen2-1.5B-instruct | 1.5B | 69.85 | 59.17 | 65.09 | 62.73 | 55.51 | 73.58 | 64.32 |
| inf-retriever-v1-1.5b | 1.5B | 69.41 | 64.29 | 62.99 | 65.39 | 54.03 | 73.92 | 65.01 |
| DMRetriever-596M (ours) | 596M | 72.44β | 67.50β | 65.79β | 69.15β | 55.71β | 74.73β | 67.55β |
| Model | Scale | QA | QAdoc | TW | FC | NLI | STS | Avg. |
|---|---|---|---|---|---|---|---|---|
| Qwen3-Embedding-8B | 7.6B | 44.21 | 34.38 | 41.56 | 42.04 | 32.53 | 42.95 | 39.61 |
| gte-Qwen2-7B-instruct | 7.6B | 70.24 | 47.41 | 63.08 | 31.62 | 53.71 | 74.88 | 56.82 |
| NV-Embed-v1 | 7.9B | 68.06 | 62.70 | 56.02 | 59.64 | 48.05 | 67.06 | 60.26 |
| Qwen3-Embedding-4B | 4B | 67.20 | 59.14 | 65.28 | 67.16 | 53.61 | 58.51 | 61.82 |
| e5-mistral-7b-instruct | 7.1B | 65.57 | 64.97 | 63.31 | 67.86 | 47.55 | 66.48 | 62.58 |
| NV-Embed-v2 | 7.9B | 74.47 | 69.37 | 42.40 | 68.32 | 58.20 | 76.07 | 64.80 |
| inf-retriever-v1 | 7.1B | 72.84 | 66.74 | 66.23 | 65.53 | 51.86 | 75.98 | 66.53 |
| SFR-Embedding-Mistral | 7.1B | 71.41 | 67.14 | 69.45 | 70.31 | 50.93 | 72.67 | 66.99 |
| Linq-Embed-Mistral | 7.1B | 74.40 | 70.31 | 64.11 | 70.64 | 52.46 | 71.25 | 67.19 |
| DMRetriever-4B (ours) | 4B | 75.32β | 70.23β | 70.55β | 71.44β | 57.63 | 77.38β | 70.42β |
| DMRetriever-7.6B (ours) | 7.6B | 76.19β | 71.27β | 71.11β | 72.47β | 58.81β | 78.36β | 71.37β |
DMRetriever is trained via a three-stage pipeline designed to adapt both encoder-only and decoder-only architectures effectively to disaster management domain:
-
Bidirectional Attention Adaptation
- Script:
sh/PT_BiAttn_Ada.sh
- Script:
-
Unsupervised Contrastive Pre-training
- Scripts:
- Decoder-only:
sh/PT_S2_decoder.sh - Encoder-only:
sh/PT_S2_encoder.sh
- Decoder-only:
- Scripts:
-
Difficulty-aware Progressive Instruction Fine-tuning
- Scripts:
- Decoder-only:
sh/FT_decoder.sh - Encoder-only:
sh/FT_encoder.sh
- Decoder-only:
- Scripts:
DMRetriever provides flexible evaluation tools for both checkpoint-level and folder-level assessment:
- Evaluate multiple checkpoints during training:
sh/Eva_ckpt_folder_level.sh
- Evaluate a single checkpoint on the test set:
sh/Eva_single_ckpt_level.sh
The DisastIR benchmark provides separate datasets for training, validation, and testing, designed to support both efficient model development and comprehensive evaluation.
-
π§ Training Set:
Available at DMRetriever_MTT.
This dataset is used for training retrieval and ranking models in the disaster domain. -
βοΈ Validation & Test Sets:
Available at DisastIR-Devlite.
It includes:- DisastIR-Devlite β a lightweight validation subset for rapid and reliable model development.
- DisastIR-Test β the official test set for final model evaluation on the full DisastIR corpus.
Figure1: Efficiency comparison between DisastIR-Devlite and DisastIR-DevFull.
| Model | QA | QAdoc | TW | FC | NLI | STS | Overall |
|---|---|---|---|---|---|---|---|
| DMRetriever-33M | 0.8467 | 0.7504 | 0.7326 | 0.8859 | 0.8610 | 0.7255 | 0.9008 |
| DMRetriever-596M | 0.8995 | 0.7937 | 0.8042 | 0.9101 | 0.8730 | 0.7725 | 0.9033 |
Table 1. Effectiveness of DisastIR-Devlite. Kendallβs Ο correlation between DisastIR-Devlite and DisastIR-DevFull across multiple subtasks.
Together, these datasets form a complete pipeline for developing and benchmarking disaster-domain information retrieval systems β
from training and validation to rigorous final testing.
All models are publicly available in the following HuggingFace repository.
| Model | Description | Backbone | Backbone Type | Hidden Size | #Layers |
|---|---|---|---|---|---|
| DMRetriever-33M | Base 33M variant | MiniLM | Encoder-only | 384 | 12 |
| DMRetriever-33M-PT | Pre-trained version of 33M | MiniLM | Encoder-only | 384 | 12 |
| DMRetriever-109M | Base 109M variant | BERT-base-uncased | Encoder-only | 768 | 12 |
| DMRetriever-109M-PT | Pre-trained version of 109M | BERT-base-uncased | Encoder-only | 768 | 12 |
| DMRetriever-335M | Base 335M variant | BERT-large-uncased-WWM | Encoder-only | 1024 | 24 |
| DMRetriever-335M-PT | Pre-trained version of 335M | BERT-large-uncased-WWM | Encoder-only | 1024 | 24 |
| DMRetriever-596M | Base 596M variant | Qwen3-0.6B | Decoder-only | 1024 | 28 |
| DMRetriever-596M-PT | Pre-trained version of 596M | Qwen3-0.6B | Decoder-only | 1024 | 28 |
| DMRetriever-4B | Base 4B variant | Qwen3-4B | Decoder-only | 2560 | 36 |
| DMRetriever-4B-PT | Pre-trained version of 4B | Qwen3-4B | Decoder-only | 2560 | 36 |
| DMRetriever-7.6B | Base 7.6B variant | Qwen3-8B | Decoder-only | 4096 | 36 |
| DMRetriever-7.6B-PT | Pre-trained version of 7.6B | Qwen3-8B | Decoder-only | 4096 | 36 |
Please refer to each modelβs Hugging Face page for specific usage instructions, including input format, embedding extraction, and evaluation examples.
If you find this repository helpful, please kindly consider citing the corresponding paper as shown below. Thanks!
@article{yin2025dmretriever,
title={DMRETRIEVER: A Family of Models for Improved Text Retrieval in Disaster Management},
author={Yin, Kai and Dong, Xiangjue and Liu, Chengkai and Lin Allen and Shi, Lingfeng and Mostafavi, Ali and Caverlee, James},
journal={arXiv preprint arXiv:2510.15087},
year={2025}
}