The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge
Abstract Clinical narratives (the text notes found in patients’ medical records) are important information sources for secondary use in research. However, in order to protect patient privacy, they must be de-identified prior to use. Manual de-identification is considered to be the gold standard approach but is tedious, expensive, slow, and impractical for use with large-scale clinical data. Automated or semi-automated de-identification using computer algorithms is a potentially promising alternative. The Informatics Institute of the University of Alabama at Birmingham is applying de-identification to clinical data drawn from the UAB hospital’s electronic medical records system before releasing them for research. We participated in a shared task challenge by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) at the de-identification regular track to gain experience developing our own automatic de-identification tool. We focused on the popular and successful methods from previous challenges: rule-based, dictionary-matching, and machine-learning approaches. We also explored new techniques such as disambiguation rules, term ambiguity measurement, and used multi-pass sieve framework at a micro level. For the challenge’s primary measure (strict entity), our submissions achieved competitive results (f-measures: 87.3%, 87.1%, and 86.7%). For our preferred measure (binary token HIPAA), our submissions achieved superior results (f-measures: 93.7%, 93.6%, and 93%). With those encouraging results, we gain the confidence to improve and use the tool for the real de-identification task at the UAB Informatics Institute. Highlights We described an automatic de-identification (de-id) system for clinical texts. We used three de-id methods: pattern-matching, dictionary-matching, and machine-learning. Dictionary-matching with disambiguation remained a useful de-id approach. We also explored multi-pass sieve framework, term ambiguity measurement and disambiguation rule. The system achieved competitive results in CEGS N-GRID 2016 challenge, de-id regular track. Graphical abstract [DISPLAY OMISSION]
유료 다운로드의 경우 해당 사이트의 정책에 따라 신규 회원가입, 로그인, 유료 구매 등이 필요할 수 있습니다. 해당 사이트에서 발생하는 귀하의 모든 정보활동은 NDSL의 서비스 정책과 무관합니다.
원문복사신청을 하시면, 일부 해외 인쇄학술지의 경우 외국학술지지원센터(FRIC)에서
무료 원문복사 서비스를 제공합니다.
NDSL에서는 해당 원문을 복사서비스하고 있습니다. 위의 원문복사신청 또는 장바구니 담기를 통하여 원문복사서비스 이용이 가능합니다.
- 이 논문과 함께 출판된 논문 + 더보기