The UPR High Performance Computing facility (HPCf) was created in 1999 with the broad mission of supporting computational research in Puerto Rico. Bioinformatics always played a central role in the HPCf plans, and in 2001 the Bioinformatics Resource Core (BiRC) was established with support from NCRR BRIN (now PR-INBRE) program. Since then, we have been fortunate to leverage NIH support with additional funds from the UPR and NSF-EPSCoR to create a research resource that is useful to a broad community of computational scientists and biomedical investigators. The Co-Directors of this Core are Dr. Humberto Ortiz Zuazaga (UPR-Rio Piedras) and Dr. Jaime Seguel (UPR-Mayaguez).
At both locations, the PR-INBRE Bioinformatics Resource Core (BiRC) offers a wide spectrum of activities, capabilities and expertise for supporting Biomedical research, education, and services. In general terms, these may be classified as follows:
- Support in the establishment of local and international collaborations with bioinformatics experts and resources
- Provisioning of data processing infrastructure (software and hardware)
- Consulting in all aspects of Bioinformatics data processing, analysis and modeling
- Assistance in planning, designing and analyzing biological laboratory experiments.
- Assistance in planning, statistical design, and analysis to a wide array of experimenters in Engineering and the Sciences to (1) characterize a process, (2) Build statistical models and machine learning algorithms for prediction, and (3) making optimal decisions.
- Identification of Bioinformatics problems and solutions in biomedical research endeavors
- Organization of seminars and workshops with local and international experts
- Organization of activities and educational experiences that expose students and faculty to the current developments in the field, as well as engaging them in research and collaboration.
- Training of scientists and professionals in the knowledge and skills necessary to use bioinformatics tools effectively
- Training on database fundamentals (e.g., relational design, SQL, NoSQL)
- Training on big data tools, such as Hadoop, Spark, RHadoop, SparkR, and PySpark
- Training to scientists and professionals in the knowledge and skills necessary to manage big data systems
- Training of computing systems administration personnel in the implementation and maintenance of Bioinformatics software systems
Consulting and Services
- Consulting in all aspect of Bioinformatics data analysis but particularly in the design and implementation of algorithmic pipelines
- Consulting in all aspects of Big Data management systems
- Assistance in the writing and evaluation of proposals concerning or involving Bioinformatics
- Advising the university, government and business community in how to integrate bioinformatics services
- Production of short, self-contained instructional units, each on a particular Bioinformatics topic.
- Assistance in the design of Bioinformatics curricula.
- Training of potential Bioinformatics instructors on the delivering of Bioinformatics lectures, teaching assessment, and updating of materials and contents.
In addition, BiRC offers scientists and students an opportunity to interact and collaborate in projects being conducted by associated Bioinformatics research groups. Among them are:
Optimization-driven biological analysis (Dr. Mauricio Cabrera-Ríos)
The Applied Optimization Group (AOG) leads a novel nonparametric approach to biological data analysis. In a typical collaborative research project, AOG can contribute
- The search of pertinent microarray experiments in NCBI,
- An objective detection of biological entities with significant changes in a meta-analysis fashion,
- The determination of the most correlated cyclic or tree-like path among the important biological entities, and
- Literature-review to provide biological leads and evidence to the biological entities and their structure.
AOG has ample experience in nonparametric, traditional statistics, as well as empirical modeling.
Bioinformatics algorithms (Dr. Jaime Seguel)
Bioinformatics algorithms are created at an exponential pace. As a way of illustration, the site HTS Mappers (http://bit.ly/hts-mappers) currently holds 83 different short-reads mapping methods that have been created since 2011. Despite the proliferation of methods, the challenges of extracting useful information from biological data sets have increased. Chief among these challenges are
- Software development. Most bioinformatics algorithms are published as prototype or proofs-of-concept. Taking these prototypes into functional software systems requires reviewing their design principles, programming and testing. A full-developed software systems must run correctly and efficiently on different computer platforms and interface easily with a number of other systems in a data process pipeline.
- Algorithm design. According to Moore’s law, hardware powers doubles approximately each two years but the rate of growth of the biological data increases yearly at an exponential pace. Smarter and principled algorithms methods that use structured searches over structured data sets, take advantage of parallel processing opportunities and data compression techniques to speed-up execution time and lower storage space requirements are needed in most real-life applications.
- Benchmarking. In the absence of mathematical models of the cellular mechanisms, most bioinformatics methods are heuristic. And the advancements in modeling depend in the accuracy of the data analysis methods. A way out of this chicken-and-egg problem is extensive benchmarking for a deeper insight on the behavior of the algorithms in different situations. Currently, Bioinformatics algorithms are rarely benchmarked and when they are, are seldom benchmarked in more than one organism.
Genomics (Dr. Taras Oleksyk)
Genomics research is focusing on two main topics,
(1) Human population genomics and genetic epidemiology – describing human genomic admixture and population history in the evolutionary context with implication to human health.
(2) Conservation genomics – genome analysis of island speciation and adaptation in vertebrate species in Puerto Rico and the Caribbean.
Both approaches involve data acquired by the next generation sequencing technology, and provide data for the subsequent bioinformatics analysis and training. Specifically, I am interested in the next generation sequencing and bioinformatics of genome assembly and annotation, as well as for the identification of genome regions under recent selection.
Modeling (Dr. Saylisse Dávila)
Bioinformatics modeling often comes with many challenges, one of the most challenging being the data pre-processing tasks. These pre-processing tasks include: fusing multiple sources of data at the correct level of aggregation, replacing missing values with plausible values, and selecting or extracting the relevant predictors. Also, since there is no single technique that performs the best under all scenarios, bioinformatics models require an extensive iterative process to specify the correct form of the model. In a collaborative project, the following support can be provided:
- Data fusion
- Feature extraction and selection
- Missing value imputation
- Model fitting using traditional statistical and machine learning tools
- Model selection and validation
Data Management (Dr. Manuel Rodriguez-Martínez)
Manipulation of Bioinformatics collections with single-computer data management tools (e.g., Relational Databases) is impractical due to the sheer volume of the data. Big data software tools and systems (e.g., Hadoop, Spark) become necessary to swift thru the data and run machine learning algorithms in a scalable fashion. The following support can be provided to research collaborators:
- Design of big data infrastructure
- Training on database fundamentals (e.g., relational design, SQL, NoSQL)
- Training on big data tools, such as Hadoop and Spark
- Integration of machine learning tools into the big data infrastructure
- Application development and integration thru web services
Data integration and Functional Analysis (Dr. Wandaliz Torres-Garcia)
OMICS datasets are being generated at accelerated rate and becoming more reliable with the advances in nanotechnology offering an immense amount of data to be study efficiently in the aims of answering many unknowns in the area of fundamental biology research and health informatics. It is becoming clear that these datasets are highly heterogeneous and useful data mining techniques together with statistical sound procedures are needed to reveal and clarify molecular drivers. Our research lab can offer its expertise in:
- Data preparation: Inspect data, especially from microarray, RNA sequencing, protein expression – Reverse Phase Protein Array (RPPA), review missing data, compatible matching IDs, evaluate sample size and construct the feature database.
- Feature reduction: Evaluate a series of feature selection methods to extract important predictors taking in consideration main and interaction effects.
- Data Integration: Implementation of data mining techniques to can extract patterns from a plethora of OMICS datasets including microarray, RNA sequencing, RPPA, copy number, mutation calls with high degree of interactions among these type of predictors.
- Model biological interpretation: Examine biological relevance of extracted features and patterns through the use of in silico databases that provide information such as functional data and known pathway information (i.e. KEGG, cBioPortal).
- Preliminary hypothesis results: Able to extract and model data from The Cancer Genome Atlas (TCGA) to test researcher’s hypothesis on cancer related projects to provide for preliminary results on further biological experimentation.
Computational and Statistical Tools for Big Data Analytics in Bioinformatics. (Dr. Edgar Acuña)
Our research group in Computational and Statistical learning for Knowledge Discovery (CASTLE) has ten years of experience developing data pre-processing methods for massive data, in particular, with data coming from biology. We have also three years of experience working with high performance computing environment such as Hadoop and MapReduce in Big Data analytics through the use of super computers available at San Diego Super Computer Center and the TACC of the University of Texas and with the support of the XSEDE project. Our research Group (CASTLE) can provide support in the following areas:
- Data preprocessing techniques for omics data, in particular: data cleaning, data reduction, feature selection and outlier detection.
- Application of state-of-the-art Machine Learning and statistical methods for Bioinformatics and HealthCare.
- Training in the use of special libraries for Bioinformatics data analysis available in the R/Rstudio and iPython Notebook IDE environment.
- Training in high performance computing tools for Bioinformatics such as RHadoop, SparkR and PySpark.
- Application of the two most powerful libraries currently available (Apache Mahout and Spark ML) for Big Data Analytics in Bioinformatics.
Bioinformatics and Genomics (Dr. Steven E Massey)
Dr Massey’s expertise is diverse and includes work in molecular phylogenetics, coarse grained protein structural modelling, simulation of genetic code evolution, protein DNA binding motif prediction, comparative evolutionary genomics of codon reassignments and DNA repair, population genomics and genome wide SNP analysis, ancient DNA and ancient genomics, community profile metagenomics, positive selection genome scans of gut pathogens, grid computing simulation of protein mutational robustness evolution, error correction in nanopore based next generation sequencing technology, oncogene copy number read depth analysis, genetic algorithms for cryptanalysis, and high throughput shotgun metagenomics data mining, specifically comparative meta-metabolomics and novel biochemical pathway discovery. Presently, Dr Massey is interested in high throughput genomics, biomedical genomics, and astro-biogeography.
Bioinformatics and Computational Biology in Health Disparities (Dr. Abiel Roche-Lima)
Research in health disparities have been focused on how ethnic backgrounds can affect disease risk and how people respond to particular treatments or drugs. In the United States, minority groups have traditionally been left out of mainstream medical testing. Bionformatics and computational biology techniques applied to health disparities face problems regarding the lack of public data in research for minority health. The Center for Collaborative Research in Health Disparities (CCRHD) has developed a preliminary plan to provide a single point of contact to assist researcher at UPR-MSC with the general aspects of bioinformatics and computational biology as they apply to the biomedical science research in health disparities. This initiative will have a double function 1) offer bioinformatics services: assist in biomedical studies of health disparities using computational techniques, and 2) bioinformatics research: develop new bioinformatics tools and computational biology techniques focused on the specificity of health disparity data and analysis.
Current bioinformatics research areas include:
- Biological network comparisons and predictions: considering the lack of public data in health disparity projects, we use machine learning methods, based on kernels and automata to represent and predict biological networks. As kernel methods are used, different type of data (e.g., from other projects other than health disparities) can be combined to find general relations. Using automata, large amount of raw sequence data from other sources can be efficiently represented, processed and analyzed, improving the performance of the algorithms.
- Computational genomics and proteomics: analysis, visualization and interpretation of both genomics and proteomics health disparity related projects, e.g., assistance on experimental design, bioinformatics support for extracting quantitative results from genomics or proteomics data and other related applications.
- Functional Analysis: using known correlated genes and significant genomic regions, analyze how genes work together by testing the functional information through different gene set enrichment approaches (e.g., Gene Ontology, pathway databases and curated and generic gene sets) and exploring biological interaction space using network analysis techniques.
Bioinformatics of Gene Expression (Dr. Humberto Ortiz-Zuazaga)
Dr. Ortiz-Zuazaga has developed novel methods of measuring gene expression from microarray and second-generation sequencing data, and determining regulatory gene networks from this data. He already has established successful collaborations with scientists in biomedical research using Big Data, in this award, he will continue to grow these research collaborations, bringing his quantitative and algorithmic skills to bear on novel biomedical problems. Due to his experience in multiple fields, Dr. Ortiz-Zuazaga is uniquely qualified to abstract the basic algorithmic challenges in many biological problems, and can help translate biological questions into data analysis algorithms. Students in his lab will adapt probabilistic data structures to the task of detecting differential gene expression in de-novo RNA-seq experiments, and use these and other data sets to model gene regulatory networks using bioinformatic and statistical methods.
Dr. Ortiz-Zuazaga brings to the project extensive experience in computational biology, ranging from data analysis to modelling and simulation and visualization.