Finding Relevant Attributes in High Dimensional Data: A Distributed Computing Hybrid Data Mining Strategy

Valdés, Julio J.; Barton, Alan J.

doi:10.1007/978-3-540-71200-8_20

Julio J. Valdés¹ &
Alan J. Barton¹

Part of the book series: Lecture Notes in Computer Science ((TRS,volume 4374))

559 Accesses
1 Citations

Abstract

In many domains the data objects are described in terms of a large number of features (e.g. microarray experiments, or spectral characterizations of organic and inorganic samples). A pipelined approach using two clustering algorithms in combination with Rough Sets is investigated for the purpose of discovering important combinations of attributes in high dimensional data. The Leader and several k-means algorithms are used as fast procedures for attribute set simplification of the information systems presented to the rough sets algorithms. The data described in terms of these fewer features are then discretized with respect to the decision attribute according to different rough set based schemes. From them, the reducts and their derived rules are extracted, which are applied to test data in order to evaluate the resulting classification accuracy in crossvalidation experiments. The data mining process is implemented within a high throughput distributed computing environment. Nonlinear transformation of attribute subsets preserving the similarity structure of the data were also investigated. Their classification ability, and that of subsets of attributes obtained after the mining process were described in terms of analytic functions obtained by genetic programming (gene expression programming), and simplified using computer algebra systems. Visual data mining techniques using virtual reality were used for inspecting results. An exploration of this approach (using Leukemia, Colon cancer and Breast cancer gene expression data) was conducted in a series of experiments. They led to small subsets of genes with high discrimination power.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alon, U., et al.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings National Academy of Science USA 96, 6745–6750 (1999)
Article Google Scholar
Anderberg, M.: Cluster Analysis for Applications. Academic Press, London (1973)
MATH Google Scholar
Bal, H., et al.: Next Generation Grid(s) European Grid Research 2005 - 2010 Expert Group Report (2003)
Google Scholar
Bazan, J.G., Skowron, A., Synak, P.: Dynamic Reducts as a Tool for Extracting Laws from Decision Tables. In: Raś, Z.W., Zemankova, M. (eds.) ISMIS 1994. LNCS, vol. 869, pp. 346–355. Springer, Heidelberg (1994)
Google Scholar
Borg, I., Lingoes, J.: Multidimensional similarity structure analysis. Springer, New York (1987)
Google Scholar
Chandon, J.L., Pinson, S.: Analyse typologique. Théorie et applications. Masson, Paris (1981)
Google Scholar
Chang, J.C., et al.: Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Mechanisms of Disease. The Lancet 362 (2003)
Google Scholar
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S. (eds.) Proc. Twelfth International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Epema, D.H.J., et al.: A worldwide flock of Condors: Load sharing among workstation clusters. Journal of Future Generation Computer Systems, 53-65 (1996)
Google Scholar
Famili, F., Ouyang, J.: Data mining: understanding data and disease modeling. In: Proceedings of the 21st IASTED International Conference, Applied Informatics, Innsbruck, Austria, Feb. 10-13, 2003, pp. 32–37 (2003)
Google Scholar
Ferreira, C.: Gene Expression Programming: A New Adaptive Algorithm for Problem Solving. Journal of Complex Systems 13(2), 87–129 (2001)
MATH Google Scholar
Ferreira, C.: Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence, Angra do Heroismo, Portugal (2002)
Google Scholar
Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. In: Biometric Soc. Meetings, Riverside, California. Abstract in Biometrics, 21(3), 768 (1965)
Google Scholar
Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of Supercomp. App. 15(3)20, 222–237 (2001)
Google Scholar
Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Article Google Scholar
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 1(27), 857–871 (1973)
Google Scholar
Hartigan, J.: Clustering Algorithms. John Wiley & Sons, Chichester (1975)
MATH Google Scholar
Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11(1), 63–91 (1993)
Article MathSciNet MATH Google Scholar
Jain, A.K., Mao, J.: Artificial Neural Networks for Nonlinear Projection of Multivariate Data. In: Proceedings 1992 IEEE Joint Conf. on Neural Networks, pp. 335–340. IEEE Computer Society Press, Los Alamitos (1992)
Chapter Google Scholar
Jancey, R.C.: Multidimensional group analysis. Australian Journal of Botany 14(1), 127–130 (1966)
Article Google Scholar
Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences 9, 256–278 (1974)
Article MathSciNet MATH Google Scholar
Lingras, P.: Unsupervised Rough Classification using GAs. Journal of Intelligent Information Systems 16(3), 215–228 (2001)
Article MATH Google Scholar
Lingras, P., Yao, Y.: Time Complexity of Rough Clustering: GAs versus K-Means. In: Alpigini, J.J., et al. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 279–288. Springer, Heidelberg (2002)
Chapter Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5-th Symposium on Math. Statist. and Probability, vol. 1, pp. 281–297. Univ. of California Press, Berkeley (1967)
Google Scholar
Nguyen, H.S., Nguyen, S.H.: Some efficient algorithms for rough set methods. In: Proceedings Fifth Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’96), Granada, Spain, July 1996, pp. 1451–1456 (1996)
Google Scholar
Nguyen, H.S., Nguyen, S.H.: Discretization Methods in Data Mining. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery, pp. 451–482. Physica-Verlag, Heidelberg (1998)
Google Scholar
Nguyen, H.S., Skowron, A.: Quantization of real-valued attributes. In: Proceedings Second International Joint Conference on Information Sciences, Wrightsville Beach, NC, September 1995, pp. 34–37 (1995)
Google Scholar
Øhrn, A.: Discernibility and Rough Sets in Medicine: Tools and Applications. PhD thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, December NTNU report 1999:133 (1999), http://www.idi.ntnu.no/~aleks/thesis/
Øhrn, A.: Rosetta Technical Reference Manual. Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway (2001)
Google Scholar
Øhrn, A., Komorowski, J.: Rosetta- A Rough Set Toolkit for the Analysis of Data. In: Proceedings of Third Int. Join Conf. on Information Sciences (JCIS97), Durham, NC, USA, March 1-5, 1997, pp. 403–407 (1997)
Google Scholar
Pawlak, Z.: Rough sets: Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht (1991)
MATH Google Scholar
Peters, J.F., Borkowski, M.: K-means Indiscernibility Relation over Pixels. In: Tsumoto, S., et al. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 580–585. Springer, Heidelberg (2004)
Google Scholar
Press, W.H., et al.: Numerical Recipes in C. Cambridge University Press, New York (1986)
Google Scholar
Press, W.H., et al.: Numerical Recipes in C. The Art of Scientific Computing. Cambridge University Press, Cambridge (1992)
MATH Google Scholar
Sammon, J.W.: A non-linear mapping for data structure analysis. IEEE Trans. on Computers 18, 401–409 (1969)
Article Google Scholar
Tannenbaum, T., et al.: Condor – A Distributed Job Scheduler. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux, MIT Press, Cambridge (2001)
Google Scholar
Thain, D., Tannenbaum, T., Livny, M.: Condor and the Grid. In: Berman, F., Fox, G., Hey, T. (eds.) Grid Computing: Making the Global Infrastructure a Reality, John Wiley & Sons, Chichester (2002)
Google Scholar
Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Journal of Concurrency and Computation: Practice and Experience (2004)
Google Scholar
Valdés, J.J.: Similarity-Based Heterogeneous Neurons in the Context of General Observational Models. Neural Network World 12(5), 499–508 (2002)
Google Scholar
Valdés, J.J.: Virtual Reality Representation of Relational Systems and Decision Rules: An exploratory Tool for understanding Data Structure. In: Hajek, P. (ed.) Theory and Application of Relational Structures as Knowledge Instruments. Meeting of the COST Action 274, Prague, November 14-16 (2002)
Google Scholar
Valdés, J.J.: Virtual Reality Representation of Information Systems and Decision Rules: An Exploratory Tool for Understanding Data and Knowledge. In: Wang, G., et al. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 615–618. Springer, Heidelberg (2003)
Chapter Google Scholar
Valdés, J.J., Barton, A.J.: Gene Discovery in Leukemia Revisited: A Computational Intelligence Perspective. In: Orchard, B., Yang, C., Ali, M. (eds.) IEA/AIE 2004. LNCS (LNAI), vol. 3029, pp. 118–127. Springer, Heidelberg (2004)
Google Scholar
Wróblewski, J.: Ensembles of Classifiers Based on Approximate Reducts. Fundamenta Informaticae 47, 351–360 (2001)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

National Research Council Canada, M50, 1200 Montreal Rd., Ottawa, ON K1A 0R6,
Julio J. Valdés & Alan J. Barton

Authors

Julio J. Valdés
View author publications
You can also search for this author in PubMed Google Scholar
Alan J. Barton
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

James F. Peters Andrzej Skowron Ivo Düntsch Jerzy Grzymała-Busse Ewa Orłowska Lech Polkowski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Valdés, J.J., Barton, A.J. (2007). Finding Relevant Attributes in High Dimensional Data: A Distributed Computing Hybrid Data Mining Strategy. In: Peters, J.F., Skowron, A., Düntsch, I., Grzymała-Busse, J., Orłowska, E., Polkowski, L. (eds) Transactions on Rough Sets VI. Lecture Notes in Computer Science, vol 4374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71200-8_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-71200-8_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71198-8
Online ISBN: 978-3-540-71200-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics