’13 ShARe/CLEF eHealth’s Top System: UTHealthCCB (Tang et al.)

The First Team’s Paper: Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model

First of all, I am happy to see that I had captured most of the related work accurately in my 2012 HIMSS Poster: i2b2 NLP challenges, MedLEE, SymText/MPlus, MetaMap, KnowledgeMap, cTAKES and HiTEX.

Second, authors clarify the similarity and difference between this task and 2010 i2b2 challenge on clinical problem extraction. The two major differences between these two tasks: 1) ShARe/CLEF task allowed disjoint entities, while 2010 i2b2 clinical problem extraction only dealt with entities of consecutive words; and 2) ShARe/CLEF task required mapping disorder entities to SNOMED-CT (using UMLS CUIs), which was not required in the 2010 i2b2 challenge.

The overview architecture of the disorder concept extraction systems:

  1. Preprocessing: Sentence boundary detection and tokenization
  2. Entity Representation: Representation of disorder mentions
  3. Machine Learning: CRF or SSVM
  4. Entity Parsing: Parse results of disorder mentions
  5. Post-processing: Alignment of sentences and tokens
  6. [For Task 1b] CUI Mapping: Vector Space Model (VSM)

Disorder entity recognition : (i) For consecutive disorder entities, authors used traditional BIO approach (for NER in ML) where each word is labeled as B (beginning of an entity), I (inside an entity), or O(outside of an entity). Thus NER problem turns into a trinary classification problem. (ii) For disjoint disorder entities authors introduced two additional sets of tags D{B,I} and H{B,I}. Words labeled as HB or HI belonged to two or more disjoint concepts. E.g.:

Sentence: “The aortic root and ascending aorta are moderately dilated .”
Encoding: “The/O aortic/DB root/DI and/O ascending/DB aorta/DI are/O moderately/O dilated/HB ./O”

ML algorithms employed :

  1. Conditional Random Fields (CRFs) [CRFsuite]
  2. Structural Support Vector Machines (SSVMs) [SVM-hmm]

Features used : Bag-of-words, part-of-speech (POS) from Stanford tagger, type of notes, section information, word representation from Brown clustering and random indexing, semantic categories of words based on UMLS lookup, MetaMap, or cTAKEs outputs.

[Task 1b] Disorder entity encoding : Authors approach it as a ranking problem where a query is an identified entity (in Task 1a) and the documents are candidate SNOMED-CT terms. For a given disorder entity, corresponding terms of CUIs containing all the words (except stop words) are selected as candidates, their tf-idf vectors are created using all SNOMED-CT terms, and cosine similarities are calculated between pairs of a candidate and the disorder to rank the candidates.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: