The INtelligent Data Understanding System (INDUS)

Project Summary

Advances in networks, sensors, storage, computing, and high throughput data acquisition, have led to a proliferation of autonomous, distributed data sources in many areas of human activity. New discoveries in biological, physical, and social sciences and engineering are being driven by our ability to discover, share, integrate and analyze disparate types of data. Statistically-based machine learning algorithms offer some of the most cost-effective approaches to discovery of experimentally testable predictive models and hypotheses from data. However, the large size, distributed nature, and autonomy of the data sources (and the attendant differences in access, queries allowed, processing capabilities, structure, organization, and underlying data models and data semantics) present hurdles to effective utilization of machine learning. This research aims to overcome these hurdles by developing efficient, resource-aware distributed algorithms and software services to support collaborative, integrative knowledge acquisition such a setting. The research team will implement, deploy, and evaluate the resulting algorithms using benchmark data sets, associated data models and ontologies, and user-specified inter-ontology mappings on a distributed test-bed of networked databases and services at Iowa State University and Kansas State University. The resulting open-source software can potentially transform collaborative e-science in the same way that Web has transformed information sharing. Broader impacts of this research include enhanced opportunities for research-based training of graduate and undergraduate students, interdisciplinary collaborations, participation of under-represented groups, and development of increasingly sophisticated software to support collaborative, integrative e-science. The ISU project web site ( together with the KSU web site ( provide access to information about the project, benchmark data, publications, software, and documentation.

Project Funding

Research Grant #0711356 - Collaborative Research: Learning Classifiers from Autonomous, Semantically Heterogeneous, Distributed Data, National Science Foundation (2007-2010). Vasant Honavar (PI-ISU) and Doina Caragea (PI-KSU).

Project Publications

  • Parimi, R. and Caragea, D. (2011). Predicting Friendship Links in Social Networks Using a Topic Modeling Approach. In: Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011), Shenzhen, China
  • Caragea, C., Caragea, D., Silvescu, A., and Honavar, V. (2010). Semi-Supervised Prediction of Protein Subcellular Localization Using Abstraction Augmented Markov Models, Special Issue on Machine Learning in Computational Biology (MLCB), BMC Bioinformatics. 2010 Oct 26;11 Suppl 8:S6.
  • Xia, J., Caragea, D. and Brown, S.J. (2010). Prediction of alternatively spliced exons using support vector machines. In: International Journal on Data Mining and Bioinformatics (IJDMB). Vol. 4, No. 4, 411-430.
  • Caragea, C., Silvescu, A., Caragea, D., and Honavar, V. (2010). Semi-Supervised Sequence Classification Using Abstraction Augmented Markov Models. In: Proceedings of the ACM Conference on Bioinformatics and Computational Biology. Niagara Falls, NY.
  • Caragea, C., Silvescu, A., Caragea, D. and Honavar, V. (2010). Abstraction-Augmented Markov Models. In: Proceedings of the IEEE Conference on Data Mining (ICDM 2010). Sydney, Australia.
  • Volkova, S., Caragea, D., Hsu, W.H., Drouhard, J. and Fowles, L. (2010). Boosting Biomedical Entity Extraction by using Syntactic Patterns for Semantic Relation Discovery. In: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence (WI'10), Toronto, Canada.
  • Caragea, C., Caragea, D. and Honavar, V. (2009). Learning Link-Based Classifiers from Ontology-Extended Textual Data. In: Proceedings of the 21st International Conference on Tools with Artificial Intelligence (ICTAI 2009), New Jersey.
  • Xia, J., Caragea, D. and Hsu, W. (2009). Multi Relational Network Analysis Using a Fast Random Walk with Restart. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2009), Miami, FL.
  • Kulkarni, S. and Caragea, D. (2009). Computation of the Semantic Relatedness between Words Using Concept Clouds. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR), part of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K). Madeira, Portugal.
  • Kulkarni, S. and Caragea, D. (2009). Towards Bridging the Web and the Semantic Web. In: Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence (WI'09), Milan, Italy.
  • Haridas, M. and Caragea, D. (2009) Exploring Wikipedia and DMoz as Knowledge Bases for Engineering a User Interests Hierarchy for Social Network Applications. In: Proceedings of the 8th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE 2009), Algarve, Portugal.
  • Honavar, V. and Caragea, D. (2009). Towards Semantics-Enabled Infrastructure for Knowledge Acquisition from Distributed Data. In: Next Generation of Data Mining. Eds.: Kargupta, H., Han, J., Yu, P., Motwani, R., and Kumar, V. CRC Press. Ch. 16, pp. 317-337. Invited Chapter.
  • Caragea, D. and Honavar, V. (2009). Learning Classifiers from Distributed Data Sources. In: Encyclopedia of Database Technologies and Applications, 2nd Ed. Ferraggine, V.E., Doorn, J.H., and Rivero, L.C. (Eds.), pp. 589-596.
  • Caragea, D. and Honavar, V. (2009). Knowledge Acquisition from Semantically Heterogeneous Data. In: Encyclopedia of Data Warehousing and Mining, Second Edition, Wang, J. (Ed.). IGI Publishers, pp. 1110-1116.
  • Xia, J., Caragea, D. and Brown, S.J. (2008). Exploring Alternative Splicing Features using Support Vector Machines. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM’08), Philadelphia, PA.
  • Koul, N., Bahirwani, V., Caragea, C., Caragea, D., and Honavar, V. (2008). Learning from Large Autonomous Data Sources using Sufficient Statistics. Short Paper. In: Proceedings of the International Conference on Web Intelligence (WI 2008), Sydney, Australia.
  • Harmon, S., DeLoach, S., Robby, Caragea, D. (2008). Leveraging Organizational Guidance Policies with Learning to Self-Tune Multiagent Systems. In: Proceedings of the Second IEEE International Conference on Self-Adaption and Self-Organization (SASO’08). Venice, Italy.
  • Bahirwani, V., Caragea, D., Aljandal, W. and Hsu, H.W. (2008). Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network. In: Proceedings of the KDD 2008 Second Workshop on Social Network Mining and Analysis (SNA-KDD). Las Vegas, NV, August 2008. ACM Digital Library.
  • Paradesi, M.S.R., Caragea, D., and Hsu, W.H. (2007). Structural Prediction of Protein-Protein Interactions in Saccharomyces cerevisiae. In: Proceedings of the 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering (BIBE'07). Boston, MA.


This project is supported by the National Science Foundation under Grant No. 0711356. Any opinions, findings, and conclusions or recommendations expressed on this website are those of the authors and do not necessarily reflect the views of the National Science Foundation.

indus.txt · Last modified: 2017/03/29 21:38 by dcaragea
CC Attribution-Noncommercial-Share Alike 4.0 International Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0