BRONCO150 is a corpus containing selected sentences of 150 German discharge summaries of cancer patients (hepatocelluar carcinoma or melanoma) treated at Charite Universitaetsmedizin Berlin or Universitaetsklinikum Tuebingen. All discharge summaries were manually anonymized. The original documents were scrambled at the sentence level to make reconstruction of individual reports impossible.
The corpus is annotated with the following labels and normalized to terminologies given in brackets:
BRONCO150 is provided in five splits (randomSentSet1-5) in XML and CONLL format. Results from baseline state-of-the-art NER methods on all entities using a cross-validation setting can be found in the BRONCO paper (see below).
Madeleine Kittner, Mario Lamping, Damian T Rieke, Julian
Götze, Bariya Bajwa, Ivan Jelas, Gina Rüter, Hanjo Hautow, Mario Sänger,
Maryam Habibi, Marit Zettwitz, Till de Bortoli, Leonie Ostermann,
Jurica Ševa, Johannes Starlinger, Oliver Kohlbacher, Nisar P Malek,
Ulrich Keilholz, Ulf Leser (2021).
Annotation and initial evaluation of a large annotated German oncological corpus
JAMIA Open, Volume 4, Issue 2, ooab025.
Contact author | Precision | Recall | F1 | Method | Publication | ||
---|---|---|---|---|---|---|---|
1 | Johanna Bohn | johanna.e.bohn@gmail.com | 82.08 | 79.46 | 80.75 | We fine-tuned a set of transformer-based language models on the BRONCO150 dataset. Hyperparameter optimisation was conducted using a Bayesian optimisation algorithm called TPE (Tree-Structured Parzen Estimator). The best-performing models are the monolingual GELECTRA large and multilingual XLM-RoBERTa large. | |
2 | Aleksander Salek | salekale@hu-berlin.de | 81.18 | 79.79 | 80.48 | xmlroberta_861 | |
3 | Henning Schäfer | Henning.Schaefer@uk-essen.de | 79.24 | 77.17 | 78.19 | This approach is based on a German pre-trained Transformer language model (deepsetai/gbert), which was subsequently further trained on the BRONCO 150 annotated entities. The model was pre-trained on the German datasets OSCAR, OPUS, Wikipedia and OpenLegalData. | doi |
4 | BRONCO Paper | 79.75 | 68.33 | 73.6 | crf | doi |
Contact author | Precision | Recall | F1 | Method | Publication | ||
---|---|---|---|---|---|---|---|
1 | Johanna Bohn | johanna.e.bohn@gmail.com | 94.17 | 95.22 | 94.69 | We fine-tuned a set of transformer-based language models on the BRONCO150 dataset. Hyperparameter optimisation was conducted using a Bayesian optimisation algorithm called TPE (Tree-Structured Parzen Estimator). The best-performing models are the monolingual GELECTRA large and multilingual XLM-RoBERTa large. | |
2 | Aleksander Salek | salekale@hu-berlin.de | 94.41 | 94.94 | 94.68 | xmlroberta_861 | |
3 | Henning Schäfer | Henning.Schaefer@uk-essen.de | 92.92 | 95.79 | 94.33 | This approach is based on a German pre-trained Transformer language model (deepsetai/gbert), which was subsequently further trained on the BRONCO 150 annotated entities. The model was pre-trained on the German datasets OSCAR, OPUS, Wikipedia and OpenLegalData. | doi |
4 | BRONCO Paper | 94.85 | 87.92 | 91.25 | crf | doi |
Contact author | Precision | Recall | F1 | Method | Publication | ||
---|---|---|---|---|---|---|---|
1 | Johanna Bohn | johanna.e.bohn@gmail.com | 79.51 | 83.98 | 81.68 | We fine-tuned a set of transformer-based language models on the BRONCO150 dataset. Hyperparameter optimisation was conducted using a Bayesian optimisation algorithm called TPE (Tree-Structured Parzen Estimator). The best-performing models are the monolingual GELECTRA large and multilingual XLM-RoBERTa large. | |
2 | Aleksander Salek | salekale@hu-berlin.de | 77.96 | 82.68 | 80.25 | xmlroberta_861 | |
3 | Henning Schäfer | Henning.Schaefer@uk-essen.de | 78.22 | 82.4 | 80.25 | This approach is based on a German pre-trained Transformer language model (deepsetai/gbert), which was subsequently further trained on the BRONCO 150 annotated entities. The model was pre-trained on the German datasets OSCAR, OPUS, Wikipedia and OpenLegalData. | doi |
4 | BRONCO Paper | 84.11 | 73.3 | 78.33 | crf | doi |