Machine Learning Under Distributional Shifts and Data Scarcity for Biomedical Applications


Machine Learning Under Distributional Shifts and Data Scarcity for Biomedical Applications
Tuesday, October 22, 2019
3:30PM – 5PM
POB 6.304

Hien Van Nguyen

Although deep neural networks have emerged as state-of-the-art approaches for multiple medical image analysis, there are several fundamental challenges that prevent them from achieving their full potential. The first problem is the lack of training data. Deep neural networks often require a large number of labeled training examples to achieve superior accuracy over traditional machine learning algorithms or match human performance. For example, it takes more than 100,000 clinically labeled images obtained from multiple medical institutions for deep networks to match human dermatologist’s diagnostic accuracy. While crowdsourcing services provide an efficient way to create labels for natural images or texts, they are usually not appropriate for medical data which have high privacy standards and require significant medical/biological knowledge that most crowdsourcing workers do not possess. Asking domain experts to annotate the data is expensive and inefficient, therefore, often unable to produce a sufficient number of labels for deep networks to flourish. How to effectively deal with the problem of data scarcity remains an open research question. The second challenge is domain shift, caused by the difference in the conditions, or domains, under which the systems were developed and those in which we use the systems. A broad range of important medical and healthcare applications must constantly cope with changing distributions of the input data. Examples of such cases include: classifying lung nodules in normal Computed Tomography (CT) scans while algorithms are trained for low-dose CT scans, detecting skin lesions in Asian patients while available algorithms are instead optimized for Caucasian patients, segmenting organs of interest from magnetic resonance images (MRI) when segmentation algorithms are trained on CT scans and radiographs. Although retraining machine learning systems can significantly improve their accuracy in these situations, it is not always possible to obtain enough labeled training data due to practical constraints. In particular, doctors cannot delay treatments for a month to collect data and retrain tumor detection and segmentation algorithms. Domain shift was one of the main reasons why in 2017 the multimillion-dollar project between MD Anderson Cancer Center and IBM Watson failed to deliver the promise of revolutionizing cancer care using machine learning. In this talk, I will discuss several approaches for dealing with these two challenges when developing machine learning and artificial intelligence models. They include hierarchical sparse modeling, data-efficient deep network architectures, cross-modality image synthesis, and domain-adaptive learning algorithms. Experimental results on important biomedical applications such as lung nodule detection, brain cell segmentation, and cellular apoptotic classification will be presented to validate the effectiveness of our approaches.

Dr. Hien Van Nguyen is an Assistant Professor in the Department of Electrical and Computer Engineering, the University of Houston (UH). He received his Ph.D. from the University of Maryland at College Park (2013), and B.S. degree from the National University of Singapore (2007) under Singaporean Ministry of Foreign Affairs Scholarship. Prior to UH, he was a Research Scientist at Siemens Corporate Research, and a senior research scientist at Uber Self-Driving Car Division. He has co-authored 40+ peer-reviewed articles and 10+ U.S. patents. His research has been funded by NSF, NIH, and startup companies. His work on machine learning for medical diagnosis was featured as a Great Innovative Idea by the Computing Research Association.

Hosted by Tan Bui-Thanh