’13 ShARe/CLEF eHealth’s Top System: UTHealthCCB (Tang et al.)

The First Team’s Paper: Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model

First of all, I am happy to see that I had captured most of the related work accurately in my 2012 HIMSS Poster: i2b2 NLP challenges, MedLEE, SymText/MPlus, MetaMap, KnowledgeMap, cTAKES and HiTEX.

Second, authors clarify the similarity and difference between this task and 2010 i2b2 challenge on clinical problem extraction. The two major differences between these two tasks: 1) ShARe/CLEF task allowed disjoint entities, while 2010 i2b2 clinical problem extraction only dealt with entities of consecutive words; and 2) ShARe/CLEF task required mapping disorder entities to SNOMED-CT (using UMLS CUIs), which was not required in the 2010 i2b2 challenge.

The overview architecture of the disorder concept extraction systems:

  1. Preprocessing: Sentence boundary detection and tokenization
  2. Entity Representation: Representation of disorder mentions
  3. Machine Learning: CRF or SSVM
  4. Entity Parsing: Parse results of disorder mentions
  5. Post-processing: Alignment of sentences and tokens
  6. [For Task 1b] CUI Mapping: Vector Space Model (VSM)

Disorder entity recognition : (i) For consecutive disorder entities, authors used traditional BIO approach (for NER in ML) where each word is labeled as B (beginning of an entity), I (inside an entity), or O(outside of an entity). Thus NER problem turns into a trinary classification problem. (ii) For disjoint disorder entities authors introduced two additional sets of tags D{B,I} and H{B,I}. Words labeled as HB or HI belonged to two or more disjoint concepts. E.g.:

Sentence: “The aortic root and ascending aorta are moderately dilated .”
Encoding: “The/O aortic/DB root/DI and/O ascending/DB aorta/DI are/O moderately/O dilated/HB ./O”

ML algorithms employed :

  1. Conditional Random Fields (CRFs) [CRFsuite]
  2. Structural Support Vector Machines (SSVMs) [SVM-hmm]

Features used : Bag-of-words, part-of-speech (POS) from Stanford tagger, type of notes, section information, word representation from Brown clustering and random indexing, semantic categories of words based on UMLS lookup, MetaMap, or cTAKEs outputs.

[Task 1b] Disorder entity encoding : Authors approach it as a ranking problem where a query is an identified entity (in Task 1a) and the documents are candidate SNOMED-CT terms. For a given disorder entity, corresponding terms of CUIs containing all the words (except stop words) are selected as candidates, their tf-idf vectors are created using all SNOMED-CT terms, and cosine similarities are calculated between pairs of a candidate and the disorder to rank the candidates.

CLEF e-Health 2013

Here is the proceedings link where you can find the task reviews as well as the articles for free.

Task #1 : Annotation of disorder mentions in clinical reports by (1a) identifying a span of text as a disorder mention, and (1b) [optional] mapping the span to a UMLS CUI (i.e. SNOMED-CT codes).

Dataset : From different clinical encounters including radiology reports, discharge summaries, and ECG/ECHO reports, about 181K words annotated by organizers and 200 documents were provided as training  (5811 disorder entities were annotated and mapped to 1007 unique CUIs or CUI-less) and 100 documents were spared for testing (5340 disorder entities with 795 CUIs or CUI-less).

Competition Results : The best systems had an F1 score of 0.75 (0.80 Precision, 0.71 Recall) in Task 1a and an accuracy of 0.59 in Task 1b. Task 1a top three teams are namely, UTHealthCCB.A, NCBI and CLEAR.