History and Current Status of Disease Ontology

The Disease Ontology has been developed by the Center for Genetic Medicine of Northwestern University since 2003, driven to a large extent by the data aggregation and analysis needs of the NUgene Project at the Center for Genetic Medicine. Disease Ontology was conceived as an open source ontology based on the principles of Gene Ontology, taking disease concepts organized into a directed acyclic graph (DAG). The initial builds of Disease Ontology in 2003 and 2004 used ICD-9 as the foundational vocabulary, and were extensively reorganized by process, system affected, and cause (genetic disorders, infectious diseases, metabolic disorders). ICD-9, although a very pragmatic choice as much of the structured Electronic Medical Records (EMR) data flowing into NUgene was ICD-9 codes, proved to be poor choice as a basis of DO, even after substantial investment in time to create a more general framework for disease due to the highly variable term granularity, mixed term composition of ICD-9 and uneven coverage of disease space, since ICD-9 is a set of billing codes, not a disease terminology. We revisited our initial assumptions, and built a new version of DO based on disease concepts in UMLS. We then used the UMLS concepts to building mappings to SNOMED and ICD-9. We spent some time refining the first few levels of the DO to optimize the conceptual coverage of the higher level terms and the distribution of terms among the branches of the high level terms. As with the previous version of DO, we followed the Open Biomedical Ontology (OBO) and OBO Foundry principals: open, and maintained in a well-defined exchange format: OBO or OWL, unique ID-Space (DOID:), with curation of the Ontology versioned, clearly specified content, with well specified definitions, use relations according to the OBO Relations Ontology standards (is_a, part_of), well-documented public mission and scope (http://diseaseontology.sourceforge.net/), plurality of users, collaborative development. (http://diseaseontology.sourceforge.net/) and an extensive use of external references to map DO terms to concepts in UMLS CUIs, SNOMED, ICD-9, ICD-10, and MeSH terms.

In the current release of DO, there are 341,850 external references to 12,564 DO terms. The table below shows the number of mappings between the current version 3 release (3.002 also called revision 21) and these vocabularies.

External Reference Unique xref:DOID Mappings Unique xrefs ICD-9 186278 10109 UMLS_SNOMEDCT_2005_01_31_AUI 38912 38912 UMLS_NCI2004_11_17_AUI 24049 24049 UMLS_MSH2005_2005_01_17_AUI 21377 21377 UMLS_CUI 17023 17023 UMLS_ST 14674 14674 SNOMEDCT_2005_01_31 13116 13116 UMLS_ICD-9 10048 10048 NCI2004_11_17 6991 6991 UMLS_MTHICD-9_2005_AUI 3611 3611 MSH2005_2005_01_17 3502 3502 UMLS_CSP2004_AUI 2269 2269

Table 2. Unique mappings in DO to ICD-9, SNOMED CT, NCI metathesaurus (EVS), MESH, UMLS and MESH terms. The compound reference names, such as UMLS_SNOMEDCT_2005_01_31_AUI show the release of UMLS used to perform the SNOMED CT mappings. For example, the January 31, 2005 release of the UMLS unique atom identifiers (AUIs) was used for the SNOMED CT mappings. CUIs are the UMLS Concept Unique Identifiers, and the ST is the UMLS string unique identifiers. The actual number of ICD-9 terms linked to DOIDs is far higher, as we have associated each ICD-9 term with all parent DOID terms to facilitate graph traversal. We have not dealt with partial overlap of external references with DO terms – all existing mappings are to exact or closest concept match between a DO term and the external reference source.

Since DO was designed to facilitate the mapping of diseases and associated conditions to particular medical codes such as ICD-9-CM, SNOMED and MESH, we view these mappings as an important part of the value of DO. DO is in OBO format and can be viewed, including viewing the external references, using the latest version of OBO-edit.

The Disease Ontology site (http://diseaseontology.sourceforge.net) provides an overview of the project, related links and contacts.

The DO, with the current version 3 release,at the time of writing, the DO ontology covers a total of 12,564 terms and contains 21,024 branches, organized by anatomical location, environment, infectious agent and by aberrant process. Incremental updates to branches, tags and trunks of the ontology are submitted to sourceforge (http://diseaseontology.svn.sourceforge.net/viewvc/diseaseontology/) in between major version releases. As part of our outreach into the community, the latest version of DO has been included in the BIRN Ontology Task Force site (http://xwiki.nbirn.net:8080/xwiki/bin/view/BIRN-OTF/+Disease+Ontology).

Process and tools used in building DO version 3 The creation of DO version 3 was an expansion of version 2. We took the version 2 graph, which was based on ICD-9 terms from UMLS, where we preserved the UMLS’ CUIs, and cleaned and enhanced that graph based on our previous experience with version 1. We then expanded version 2 by building cross products with the NCI metathesaurus, MeSH, and Complications Screening Program terms (CSP). Graphically, this process is shown below:

Due to license restrictions with SNOMED, SNOMED was not directly used in this process, although we have been working with the SNOMED team to either get them to allow us to release SNOMED IDs in DO or directly incorporate DO into SNOMED. The latter action would require that the same open source approach that DO currently uses would be applied to DO inside SNOMED, something that the SNOMED team has been at least willing to discuss.

The Version 3 Disease Ontology graph is thus made of the following vocabularies, all of which are ‘Level 0’ (freely available) under the UMLS terms. They are:

• NCI Metathesaurus (Enterprise Vocabulary Service) • MeSH (Medical Subject Headings) • ICD-9 & ICD-10

