Biomedical concepts unseen in the general-domain training set (covariate shift) are difficult to translate accurately.
Two prevailing challenges impact biomedical translation quality when training is done on general-domain data. Therefore, domain adaptation (training on out-of-domain data and testing on in-domain data) from the general domain to the biomedical domain is challenging. For example, the Unified Medical Language System (UMLS) developed by the National Institute of Health contains over 2 million names for over 900,000 concepts, much larger than the set of common English words. Machine translation in the biomedical domain is characterized by a long tail of medical terminology. However, the collection and alignment of parallel corpora requires significant time and labor, and such datasets are not available for all domains or language pairs. Supervised training of machine translation models usually benefits from large amounts of parallel corpora and such effect is the most evident for neural machine translation models.
Recent advances in machine translation have demonstrated translation quality arguably on par with professional human translators in select domains. Such requirements make it challenging to train qualified translators and costly to generate high-quality translations. Biomedical translation requires up-to-date domain knowledge and fluency in the source and target languages.
Example applications include translation of clinical trial consent forms, regulatory documents, and interpretation within point-of-care facilities. Conclusionsīiomedical translation is used across various life science disciplines.
Translation quality continues to improve at a slower pace on larger in-domain data subsets, with a total increase of 33.0 (24.3) BLEU for en \(\rightarrow\)zh (zh \(\rightarrow\)en) directions on the full dataset. We showed that training on out-of-domain data and fine-tuning with as few as 4000 NEJM sentence pairs improve translation quality by 25.3 (13.4) BLEU for en \(\rightarrow\)zh (zh \(\rightarrow\)en) directions. This corpus consists of about 100,000 sentence pairs and 3,000,000 tokens on each side. We developed an effective pipeline to acquire and process an English-Chinese parallel corpus from the New England Journal of Medicine (NEJM). Chinese and English are two of the most widely spoken languages, yet to our knowledge, a parallel corpus does not exist for this language pair in the biomedical domain. While such datasets are abundant in general domains, they are less accessible in the biomedical domain. Machine translation represents an effective alternative, but accurate machine translation requires large amounts of in-domain data. Biomedical language translation requires multi-lingual fluency as well as relevant domain knowledge.