DISRPT/sharedtask2025

Repository for DISRPT2025 Shared Task on Discourse Unit Segmentation, Connective Detection, and Discourse Relation Classification.

Please check our FAQ page on our main website for more information about the Shared Task, Participation, and Evaluation etc.!

Update (17/06/2025): Full training data has been released! (Note: some surprise languages/datasets may be added during the testing phase)

Update (29/05/2025): Parameter count limitation: closed track participants must ensure that the total number of parameters in their system is below 4 billion.

Update (16/05/2025): Sample data has been released!

Test data as well as surprise datasets will be released in July 2025!

Shared task participants are encouraged to follow this repository in case bugs are found and need to be fixed.

Introduction

The DISRPT 2025 shared task, to be held in conjunction with CODI 2025 and EMNLP 2025, introduces the fourth iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the second iteration of a cross-formalism discourse relation classification task.

We will provide training, development, and test datasets from all available languages and treebanks in the RST, eRST, SDRT, PDTB, dependency and ISO formalisms, using a uniform format. Because different corpora, languages and frameworks use different guidelines, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help to push forward the discussion of standards for computational approaches to discourse relations. We include data for evaluation with and without gold syntax, or otherwise using provided automatic parses for comparison to gold syntax data.

This year, we provide a unified set of labels for the discourse relation prediction task, to make evaluation across datasets easier. This labelset only contains 17 different labels, while we counted 353 different labels in the original data.

Types of Data

The tasks are oriented towards finding the locus and type of discourse relations in texts, rather than predicting complete trees or graphs. For frameworks that segment text into non-overlapping spans covering each entire documents (RST and SDRT), the segmentation task corresponds to finding the starting point of each discourse unit. For PDTB-style datasets, the unit-identification task is to identify the spans of discourse connectives that explicitly identify the existence of a discourse relation. These tasks use the files ending in .tok and .conllu for the plain text and parsed scenarios respectively.

For relation classification, two discourse unit spans are given in text order together with the direction of the relation and context, using both plain text data and stand-off token index pointers to the treebanked files. Information is included for each corpus in the .rels file, with token indices pointing to the .tok file, though parse information may also be used for the task. The column to be predicted is the final label column; the penultimate orig_label column gives the original label from the source corpus, which may be different, for reference purposes only. This column may not be used. The relation direction column may be used for prediction and does not need to be predicted by systems (essentially, systems are labeling a kind of ready, unlabeled but directed dependency graph).

Note that some datasets contain discontinuous discourse units, which sometimes nest the second unit in a discourse relation. In such cases, the unit beginning first in the text is considered unit1 and gaps in the discourse unit are given as <*> in the inline text representation. Token index spans point to the exact coverage of the unit either way, which in case of discontinuous units will contain multiple token spans.

Notes on Discourse Relations

Compared to the data of the 2023 shared task, we provide a unified set of 17 labels across all datasets. Note that some datasets will not include all the 17 labels (e.g. attribution is only annotated in RST/eRST/dependency datasets).

The full mapping is given as a json file in utils/mapping_disrpt25.json.

Rules

External resources are allowed, including NLP tools, word embeddings/pre-trained language models, and other gold datasets for MTL etc. However, no further gold annotations of the datasets included in the task may be used (example: you may not use OntoNotes coref to pretrain a system that will be tested on WSJ data from RST-DT or PDTB, since this could contaminate the evaluation; exception: you may do this if you exclude WSJ data from OntoNotes during training). For more details on external resources please see the FAQ on the shared task website.

Training with dev is not allowed. One could do so (e.g. as an experiment) and report the resulting scores in their paper, but such results will not be considered / reported as the official scores of the system in the overall ranking.

We propose a new constraint and two tracks this year: only one multilingual model should be submitted per task, with a limited number of parameters for the Closed-track:

Closed track: Parameter-count limited (<=4B), openly reproducible models will be evaluated by the DISRPT team and ranked.
Open track: We also welcome descriptions of systems based on large / closed models, but these will not participate in the final rankings as we cannot evaluate them.

Please also make sure to use seeds to keep performance as reproducible as possible!

Evaluation

Evaluation scripts are provided for all tasks under utils. In general, final results of each dataset will be reported on the correspondingtest partition.

For datasets without a corresponding training set (e.g. eng.dep.covdtb, tur.pdtb.tedm):

The scores will be reported as any other regular datasets on the test partition using the relation inventory of each respective dataset
- one can collapse relations in any way one would like to during training, but the final results will be reported on each dataset's own relation labels, as indicated in the last column (i.e. label) in the corresponding test .rels file.
Systems can be trained on either a corpus with the same language or any other combination of the datasets available in DISRPT 2025.
For better interpretation of the results, we kindly ask you to
- document the composition of the training data in your README.md file as well as the paper describing the system.
- also report model performance on dev sets (wherever applicable) in the paper describing the system (this can go into the appendix of the paper)

Directories

The shared task repository currently comprises the following directories:

data - individual corpora from various languages and frameworks.
- Folders are given names in the scheme LANG.FRAMEWORK.CORPUS, e.g. eng.rst.gum is the directory for the GUM corpus, which is in English and annotated in the framework of Rhetorical Structure Theory (RST).
- Note that some corpora (eng.rst.rstdt, eng.pdtb.pdtb, tur.pdtb.tdb, zho.pdtb.cdtb) do not contain text or have some documents without text (eng.rst.gum) and text therefore needs to be reconstructed using utils/process_underscores.py.
utils - scripts for validating, evaluating and generating data formats. The official scorer for segmentation and connective detection is seg_eval.py, and the official scorer for relation classification is rel_eval.py.

See the README files in individual data directories for more details on each dataset.

Surprise Language(s) and Dataset(s)

DISRPT25 surprise languages are Polish and Nigerian Pidgin. DISRPT25 also includes a surprise framework with one dataset annotated within the ISO frmaework.

The surprise datasets are:

deu.pdtb.pcc
eng.rst.umuc
pcm.pdtb.disconaija
pol.iso.pdc
zho.pdtb.ted
fra.sdrt.summre: note that this dataset only contains data for the segmentation task

Submitting a System

Systems should be accompanied by a regular workshop paper in the EMNLP format, as described on the CODI workshop website. During submission, you will be asked to supply a URL from which your system can be downloaded. If your system does not download necessary resources by itself (e.g. word embeddings), these resources should be included at the download URL. The system download should include a README file describing exactly how paper results can be reproduced. Please do not supply pre-trained models, but rather instructions on how to train the system using the downloaded resources and make sure to seed your model to rule out random variation in results. For any questions regarding system submissions, please contact the organizers.

Important Dates

May 16th, 2025 sample release
June 17th, 2025 train/dev release
July 16th, 2025 test release
August 4th, 2025 System submissions
September 19th, 2025 Camera ready
November 8 or 9th, 2025 CODI CRAC Workshop, EMNLP, China.

Statistics

corpus	lang	framework	rel_types	discont	underscored	rels	train_toks	train_sents	train_docs	train_segs	dev_toks	dev_sents	dev_docs	dev_segs	test_toks	test_sents	test_docs	test_segs	total_sents	total_toks	total_docs	total_segs	seg_style	syntax	MWTs	ellip	corpus
ces.rst.crdt	ces	rst	17	yes	no	1,249	11,766	663	48	1,152	1,346	81	3	140	1,552	91	3	161	835	14,664	54	1,453	EDU	UD	yes	no	ces.rst.crdt
deu.pdtb.pcc	deu	pdtb	11	yes	no	2,109	26,831	1,773	142	934	3,152	207	17	88	3,239	213	17	94	2,193	33,222	176	1,116	Conn	UD	no	no	deu.pdtb.pcc
deu.rst.pcc	deu	rst	16	yes	no	2,882	26,517	1,572	142	2,534	3,117	184	17	282	3,202	188	17	295	1,944	32,836	176	3,111	EDU	UD	no	no	deu.rst.pcc
eng.dep.covdtb	eng	dep	11	yes	no	4,985	0	0	0	0	29,405	1,162	150	2,754	31,502	1,181	150	2,951	2,343	60,907	300	5,705	EDU	UD	yes	no	eng.dep.covdtb
eng.dep.scidtb	eng	dep	14	yes	no	9,903	62,488	2,570	492	6,740	20,299	815	154	2,130	19,747	817	152	2,116	4,202	102,534	798	10,986	EDU	UD	yes	no	eng.dep.scidtb
eng.erst.gentle	eng	erst	17	yes	no	2,552	0	0	0	0	0	0	0	0	17,979	1,334	26	2,716	1,334	17,979	26	2,716	EDU	UD (gold)	yes	no	eng.erst.gentle
eng.erst.gum	eng	erst	17	yes	part	30,747	193,740	10,910	191	24,756	30,435	1,679	32	3,897	30,715	1,569	32	3,775	14,158	254,890	255	32,428	EDU	UD (gold)	yes	yes	eng.erst.gum
eng.pdtb.gentle	eng	pdtb	12	yes	no	786	0	0	0	0	0	0	0	0	17,979	1,334	26	466	1,334	17,979	26	466	Conn	UD (gold)	yes	no	eng.pdtb.gentle
eng.pdtb.gum	eng	pdtb	13	yes	part	13,879	193,740	10,910	191	6,240	30,435	1,679	32	972	30,715	1,569	32	979	14,158	254,890	255	8,191	Conn	UD (gold)	yes	yes	eng.pdtb.gum
eng.pdtb.pdtb	eng	pdtb	13	yes	yes	47,792	975,544	40,395	1,805	21,484	97,449	3,983	177	2,178	100,386	4,252	180	2,386	48,630	1,173,379	2,162	26,048	Conn	UD (gold)	yes	no	eng.pdtb.pdtb
eng.pdtb.tedm	eng	pdtb	13	yes	no	529	0	0	0	0	2,616	143	2	110	5,569	238	4	231	381	8,185	6	341	Conn	UD	yes	no	eng.pdtb.tedm
eng.rst.oll	eng	rst	17	no	no	2,751	37,265	1,770	293	2,511	4,601	209	17	280	4,605	177	17	288	2,156	46,471	327	3,079	EDU	UD	yes	no	eng.rst.oll
eng.rst.rstdt	eng	rst	17	yes	yes	19,778	169,321	6,672	309	17,646	17,574	717	38	1,797	22,017	929	38	2,346	8,318	208,912	385	21,789	EDU	UD (gold)	yes	no	eng.rst.rstdt
eng.rst.sts	eng	rst	17	no	no	3,058	57,203	2,084	135	2,581	7,129	264	7	291	6,874	243	8	336	2,591	71,206	150	3,208	EDU	UD	yes	no	eng.rst.sts
eng.rst.umuc	eng	rst	15	yes	no	4,997	49,727	1,950	77	4,333	6,005	236	4	565	5,858	238	6	523	2,424	61,590	87	5,421	EDU	UD	yes	no	eng.rst.umuc
eng.sdrt.msdc	eng	sdrt	10	no	no	27,848	166,719	10,494	307	16,285	17,926	1,151	32	1,860	46,707	3,099	101	5,015	14,744	231,352	440	23,160	EDU	UD	no	no	eng.sdrt.msdc
eng.sdrt.stac	eng	sdrt	11	no	no	12,271	42,582	5,946	887	10,159	5,149	717	105	1,239	4,540	731	109	1,154	7,394	52,271	1,101	12,552	EDU	UD	no	no	eng.sdrt.stac
eus.rst.ert	eus	rst	16	yes	no	3,632	30,690	1,599	116	2,785	7,219	366	24	677	7,871	415	24	740	2,380	45,780	164	4,202	EDU	UD	no	no	eus.rst.ert
fas.rst.prstc	fas	rst	14	yes	no	5,191	52,497	1,713	120	4,607	7,033	202	15	576	7,396	264	15	670	2,179	66,926	150	5,853	EDU	UD	yes	no	fas.rst.prstc
fra.sdrt.annodis	fra	sdrt	12	yes	no	3,321	22,515	1,020	64	2,255	5,013	245	11	556	5,171	242	11	618	1,507	32,699	86	3,429	EDU	UD	no	no	fra.sdrt.annodis
fra.sdrt.summre	fra	sdrt	0	0	0	0	210,398	15,582	47	25,532	28,176	2,055	7	3,515	56,818	4,058	13	6,860	21,695	295,392	67	35,907	EDU	UD	no	no	fra.sdrt.summre
ita.pdtb.luna	ita	pdtb	11	yes	no	1,525	16,209	2,423	42	671	2,983	453	6	139	6,050	874	12	261	3,750	25,242	60	1,071	Conn	UD	no	no	ita.pdtb.luna
nld.rst.nldt	nld	rst	16	no	no	2,264	17,562	1,156	56	1,662	3,783	255	12	343	3,553	240	12	338	1,651	24,898	80	2,343	EDU	UD	no	no	nld.rst.nldt
pcm.pdtb.disconaija	pcm	pdtb	13	yes	no	9,903	111,843	7,279	138	3,268	14,561	991	18	369	14,325	972	20	388	9,242	140,729	176	4,025	Conn	UD	no	no	pcm.pdtb.disconaija
pol.iso.pdc	pol	iso	12	yes	no	8,543	129,689	7,518	459	4,226	13,923	790	49	463	13,368	834	48	426	9,142	156,980	556	5,115	EDU	UD	no	no	pol.iso.pdc
por.pdtb.crpc	por	pdtb	12	yes	no	11,327	147,594	4,078	243	3,994	20,102	581	28	621	19,153	535	31	544	5,194	186,849	302	5,159	Conn	UD	no	no	por.pdtb.crpc
por.pdtb.tedm	por	pdtb	13	yes	no	554	0	0	0	0	2,785	148	2	102	5,405	246	4	203	394	8,190	6	305	Conn	UD	no	no	por.pdtb.tedm
por.rst.cstn	por	rst	15	yes	no	4,993	52,177	1,825	114	4,601	7,023	257	14	630	4,132	139	12	306	2,221	63,332	140	5,537	EDU	UD	yes	no	por.rst.cstn
rus.rst.rrt	rus	rst	15	yes	no	25,095	208,982	10,530	188	22,839	24,490	1,107	19	2,555	29,023	1,494	27	3,240	13,131	262,495	234	28,634	EDU	UD	no	no	rus.rst.rrt
spa.rst.rststb	spa	rst	16	yes	no	3,049	43,055	1,548	203	2,472	7,551	254	32	419	8,111	287	32	460	2,089	58,717	267	3,351	EDU	UD	no	no	spa.rst.rststb
spa.rst.sctb	spa	rst	16	yes	no	692	10,253	326	32	473	2,448	76	9	103	3,814	114	9	168	516	16,515	50	744	EDU	UD	no	no	spa.rst.sctb
tha.pdtb.tdtb	tha	pdtb	12	yes	no	10,861	199,135	5,076	139	8,277	27,326	633	19	1,243	30,062	825	22	1,344	6,534	256,523	180	10,864	Conn	UD	no	no	tha.pdtb.tdtb
tur.pdtb.tdb	tur	pdtb	13	yes	yes	3,176	398,515	24,960	159	7,063	49,952	2,948	19	831	47,891	3,289	19	854	31,197	496,358	197	8,748	Conn	UD	yes	no	tur.pdtb.tdb
tur.pdtb.tedm	tur	pdtb	13	yes	no	574	0	0	0	0	2,159	141	2	135	4,127	269	4	247	410	6,286	6	382	Conn	UD	yes	no	tur.pdtb.tedm
zho.dep.scidtb	zho	dep	14	no	no	1,297	11,288	308	69	871	3,852	103	20	301	3,621	89	20	235	500	18,761	109	1,407	EDU	UD	no	no	zho.dep.scidtb
zho.pdtb.cdtb	zho	pdtb	9	yes	yes	5,270	52,061	2,049	125	1,034	11,178	438	21	314	10,075	404	18	312	2,891	73,314	164	1,660	Conn	UD	no	no	zho.pdtb.cdtb
zho.pdtb.ted	zho	pdtb	15	yes	no	13,308	144,581	6,882	56	4,701	17,809	880	8	589	19,520	909	8	668	8,671	181,910	72	5,958	Conn	UD	no	no	zho.pdtb.ted
zho.rst.gcdt	zho	rst	17	yes	no	8,413	47,639	2,026	40	7,470	7,619	331	5	1,144	7,647	335	5	1,092	2,692	62,905	50	9,706	EDU	UD	no	no	zho.rst.gcdt
zho.rst.sctb	zho	rst	17	yes	no	692	9,655	361	32	473	2,264	86	9	103	3,577	133	9	168	580	15,496	50	744	EDU	UD	no	no	zho.rst.sctb
Total	16	6	17	33	3	311,796	3,929,781	195,968	7,461	226,629	545,887	26,567	1,136	226,629	663,896	35,170	1,293	226,629	257,705	5,139,564	9,890	306,914	---	---	17	2	Total
corpus	lang	framework	rel_types	discont	underscored	rels	train_toks	train_sents	train_docs	train_segs	dev_toks	dev_sents	dev_docs	dev_segs	test_toks	test_sents	test_docs	test_segs	total_sents	total_toks	total_docs	total_segs	seg_style	syntax	MWTs	ellip	corpus

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DISRPT/sharedtask2025

Introduction

Types of Data

Notes on Discourse Relations

Rules

Evaluation

Directories

Surprise Language(s) and Dataset(s)

Submitting a System

Important Dates

Statistics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DISRPT/sharedtask2025

Introduction

Types of Data

Notes on Discourse Relations

Rules

Evaluation

Directories

Surprise Language(s) and Dataset(s)

Submitting a System

Important Dates

Statistics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages