Recent technology advances in genome sequencing and whole genome profiling have facilitated the generation of massive amounts of multi-layer omic data. Sequencing a human genome will soon become a common task. A new challenge is how to make sense of such a huge amount of data especially when thousands (or even millions) of individual genomes are being explored. This is a bottleneck in understanding our genomes, eg. the molecular mechanisms of genetic diseases, and make use of this knowledge to better our health and wellness. Deriving biological insights from high throughput data sets requires data intensive and sophisticated analyses in which methods in data science driven by biological domain expertise are the key tools.

I am broadly interested in understanding biological systems using high throughput data. My current research agenda focuses on understanding the human cancer genomes via the analysis of large-scale omic datasets, including:
  • Tumor heterogeneity and evolution: Using whole genome, whole exome, and targeted DNA sequencing of temporal, multi-regional, intra- and inter-patient, and xenograft tumors, we seek to understand how cancer evolves, metastasizes, and resists to therapies [1-3].

  • The role of long non-coding RNAs (lncRNAs) in cancers: Utilizing gene expression profiles (eg. RNA-Seq, microarray), epigenetic signatures (eg. ChiP-Seq), RNA-protein interaction data (eg. RIP-Seq), we discover lncRNAs that are potential cancer drivers/biomarkers of tumor progression, metastasis and patient outcome, and subsequently characterize their function and mechanism in cell line and xenograft models [4-7].

Once in a while, I create methods/tools to solve challenges that are beyond the capability of existing analytical approaches. Here are tools/databases that I have developed:
  • ClonEvol: Inferring and visualizing clonal evolution in multi-sample cancer sequencing
  • Allerdictor: a fast and accurate sequence-based allergen prediction tool employing text classification approach with support vector machine
  • The Alternaria Genomes Database: a comprehensive resource for a fungal genus comprised of saprophytes, plant pathogens, and allergenic species

My most updated list of scientific publications can be found on my Google Scholar Profile or my PubMed collection.


  1. Dang, Ha X., et al. "Clonal evolution of metastatic colorectal cancer." Cancer Research 75.15 Supplement (2015): 4109-4109.
  2. Griffith, Malachi, et al. "Optimizing cancer genome sequencing and analysis." Cell systems 1.3 (2015): 210-223.
  3. H. X. Dang, B. S. White, S. M. Foltz, C. A. Miller, J. Luo, R. C. Fields, C. A. Maher; ClonEvol: clonal ordering and visualization in cancer sequencing, Annals of Oncology, , mdx517,
  4. White, Nicole M., et al. "Transcriptome sequencing reveals altered long intergenic non-coding RNAs in lung cancer." Genome Biol 15.8 (2014): 429.
  5. Cabanski, Christopher R., et al. "Pan-cancer transcriptome analysis reveals long noncoding RNAs with conserved function." RNA biology 12.6 (2015): 628-642.
  6. Silva-Fisher, Jessica M., et al. "Metastatic colorectal cancer associated long non-coding RNAs identified by transcriptome sequencing of matched primary and metastatic patient tissues." Cancer Research 75.15 Supplement (2015): 169-169.
  7. White, Nicole M. et al. (2016). Multi-institutional Analysis Shows that Low PCAT-14 Expression Associates with Poor Outcomes in Prostate Cancer. European Urology