Carlo A Trugenberger and David Peregrim
Data are paramount to modern targeted drug design. Precious revelations obtained by applying data mining and
computational chemistry on large molecular databases, innovative at one time, are now everyday procedures for
therapy identification. However, there is an even larger source of valuable information available that can potentially
be tapped for discoveries: repositories constituted by research documents.
While numerical methods for the analysis of structured data like those in genomics and proteomics databases
are well developed and standard toolboxes are easily available, knowledge discovery from unstructured data in text
documents is still considered the “Holy Grail” of text mining and no stable methodology has yet emerged from the
scant few known attempts.
Here we review a recent pilot experiment to discover novel biomarkers and phenotypes for diabetes and obesity
by self-organized text mining of about 120,000 PubMed abstracts, public clinical trial summaries, and internal Merck
research documents by the InfoCodex semantic engine. Retrieval of known entities missed by other traditional
approaches could be demonstrated and the InfoCodex semantic engine was shown to discover new diabetes and
obesity biomarkers and phenotypes, although noticeable noise (uninteresting or obvious terms) was generated.
The reported text mining approach to biomarker discovery shows much promise and has the potential to be
developed into a new avenue for pharmaceutical research, especially to shorten time-to-market of novel drugs, or
speed up early recognition of dead ends and adverse reactions.
Compartilhe este artigo