Skip to main content

Finding Relevant Attributes in High Dimensional Data: A Distributed Computing Hybrid Data Mining Strategy

  • Chapter
Transactions on Rough Sets VI

Part of the book series: Lecture Notes in Computer Science ((TRS,volume 4374))

Abstract

In many domains the data objects are described in terms of a large number of features (e.g. microarray experiments, or spectral characterizations of organic and inorganic samples). A pipelined approach using two clustering algorithms in combination with Rough Sets is investigated for the purpose of discovering important combinations of attributes in high dimensional data. The Leader and several k-means algorithms are used as fast procedures for attribute set simplification of the information systems presented to the rough sets algorithms. The data described in terms of these fewer features are then discretized with respect to the decision attribute according to different rough set based schemes. From them, the reducts and their derived rules are extracted, which are applied to test data in order to evaluate the resulting classification accuracy in crossvalidation experiments. The data mining process is implemented within a high throughput distributed computing environment. Nonlinear transformation of attribute subsets preserving the similarity structure of the data were also investigated. Their classification ability, and that of subsets of attributes obtained after the mining process were described in terms of analytic functions obtained by genetic programming (gene expression programming), and simplified using computer algebra systems. Visual data mining techniques using virtual reality were used for inspecting results. An exploration of this approach (using Leukemia, Colon cancer and Breast cancer gene expression data) was conducted in a series of experiments. They led to small subsets of genes with high discrimination power.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alon, U., et al.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings National Academy of Science USA 96, 6745–6750 (1999)

    Article  Google Scholar 

  2. Anderberg, M.: Cluster Analysis for Applications. Academic Press, London (1973)

    MATH  Google Scholar 

  3. Bal, H., et al.: Next Generation Grid(s) European Grid Research 2005 - 2010 Expert Group Report (2003)

    Google Scholar 

  4. Bazan, J.G., Skowron, A., Synak, P.: Dynamic Reducts as a Tool for Extracting Laws from Decision Tables. In: Raś, Z.W., Zemankova, M. (eds.) ISMIS 1994. LNCS, vol. 869, pp. 346–355. Springer, Heidelberg (1994)

    Google Scholar 

  5. Borg, I., Lingoes, J.: Multidimensional similarity structure analysis. Springer, New York (1987)

    Google Scholar 

  6. Chandon, J.L., Pinson, S.: Analyse typologique. Théorie et applications. Masson, Paris (1981)

    Google Scholar 

  7. Chang, J.C., et al.: Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Mechanisms of Disease. The Lancet 362 (2003)

    Google Scholar 

  8. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S. (eds.) Proc. Twelfth International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  9. Epema, D.H.J., et al.: A worldwide flock of Condors: Load sharing among workstation clusters. Journal of Future Generation Computer Systems, 53-65 (1996)

    Google Scholar 

  10. Famili, F., Ouyang, J.: Data mining: understanding data and disease modeling. In: Proceedings of the 21st IASTED International Conference, Applied Informatics, Innsbruck, Austria, Feb. 10-13, 2003, pp. 32–37 (2003)

    Google Scholar 

  11. Ferreira, C.: Gene Expression Programming: A New Adaptive Algorithm for Problem Solving. Journal of Complex Systems 13(2), 87–129 (2001)

    MATH  Google Scholar 

  12. Ferreira, C.: Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence, Angra do Heroismo, Portugal (2002)

    Google Scholar 

  13. Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. In: Biometric Soc. Meetings, Riverside, California. Abstract in Biometrics, 21(3), 768 (1965)

    Google Scholar 

  14. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of Supercomp. App. 15(3)20, 222–237 (2001)

    Google Scholar 

  15. Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)

    Article  Google Scholar 

  16. Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 1(27), 857–871 (1973)

    Google Scholar 

  17. Hartigan, J.: Clustering Algorithms. John Wiley & Sons, Chichester (1975)

    MATH  Google Scholar 

  18. Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11(1), 63–91 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  19. Jain, A.K., Mao, J.: Artificial Neural Networks for Nonlinear Projection of Multivariate Data. In: Proceedings 1992 IEEE Joint Conf. on Neural Networks, pp. 335–340. IEEE Computer Society Press, Los Alamitos (1992)

    Chapter  Google Scholar 

  20. Jancey, R.C.: Multidimensional group analysis. Australian Journal of Botany 14(1), 127–130 (1966)

    Article  Google Scholar 

  21. Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences 9, 256–278 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  22. Lingras, P.: Unsupervised Rough Classification using GAs. Journal of Intelligent Information Systems 16(3), 215–228 (2001)

    Article  MATH  Google Scholar 

  23. Lingras, P., Yao, Y.: Time Complexity of Rough Clustering: GAs versus K-Means. In: Alpigini, J.J., et al. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 279–288. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  24. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5-th Symposium on Math. Statist. and Probability, vol. 1, pp. 281–297. Univ. of California Press, Berkeley (1967)

    Google Scholar 

  25. Nguyen, H.S., Nguyen, S.H.: Some efficient algorithms for rough set methods. In: Proceedings Fifth Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’96), Granada, Spain, July 1996, pp. 1451–1456 (1996)

    Google Scholar 

  26. Nguyen, H.S., Nguyen, S.H.: Discretization Methods in Data Mining. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery, pp. 451–482. Physica-Verlag, Heidelberg (1998)

    Google Scholar 

  27. Nguyen, H.S., Skowron, A.: Quantization of real-valued attributes. In: Proceedings Second International Joint Conference on Information Sciences, Wrightsville Beach, NC, September 1995, pp. 34–37 (1995)

    Google Scholar 

  28. Øhrn, A.: Discernibility and Rough Sets in Medicine: Tools and Applications. PhD thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, December NTNU report 1999:133 (1999), http://www.idi.ntnu.no/~aleks/thesis/

  29. Øhrn, A.: Rosetta Technical Reference Manual. Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway (2001)

    Google Scholar 

  30. Øhrn, A., Komorowski, J.: Rosetta- A Rough Set Toolkit for the Analysis of Data. In: Proceedings of Third Int. Join Conf. on Information Sciences (JCIS97), Durham, NC, USA, March 1-5, 1997, pp. 403–407 (1997)

    Google Scholar 

  31. Pawlak, Z.: Rough sets: Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht (1991)

    MATH  Google Scholar 

  32. Peters, J.F., Borkowski, M.: K-means Indiscernibility Relation over Pixels. In: Tsumoto, S., et al. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 580–585. Springer, Heidelberg (2004)

    Google Scholar 

  33. Press, W.H., et al.: Numerical Recipes in C. Cambridge University Press, New York (1986)

    Google Scholar 

  34. Press, W.H., et al.: Numerical Recipes in C. The Art of Scientific Computing. Cambridge University Press, Cambridge (1992)

    MATH  Google Scholar 

  35. Sammon, J.W.: A non-linear mapping for data structure analysis. IEEE Trans. on Computers 18, 401–409 (1969)

    Article  Google Scholar 

  36. Tannenbaum, T., et al.: Condor – A Distributed Job Scheduler. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux, MIT Press, Cambridge (2001)

    Google Scholar 

  37. Thain, D., Tannenbaum, T., Livny, M.: Condor and the Grid. In: Berman, F., Fox, G., Hey, T. (eds.) Grid Computing: Making the Global Infrastructure a Reality, John Wiley & Sons, Chichester (2002)

    Google Scholar 

  38. Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Journal of Concurrency and Computation: Practice and Experience (2004)

    Google Scholar 

  39. Valdés, J.J.: Similarity-Based Heterogeneous Neurons in the Context of General Observational Models. Neural Network World 12(5), 499–508 (2002)

    Google Scholar 

  40. Valdés, J.J.: Virtual Reality Representation of Relational Systems and Decision Rules: An exploratory Tool for understanding Data Structure. In: Hajek, P. (ed.) Theory and Application of Relational Structures as Knowledge Instruments. Meeting of the COST Action 274, Prague, November 14-16 (2002)

    Google Scholar 

  41. Valdés, J.J.: Virtual Reality Representation of Information Systems and Decision Rules: An Exploratory Tool for Understanding Data and Knowledge. In: Wang, G., et al. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 615–618. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  42. Valdés, J.J., Barton, A.J.: Gene Discovery in Leukemia Revisited: A Computational Intelligence Perspective. In: Orchard, B., Yang, C., Ali, M. (eds.) IEA/AIE 2004. LNCS (LNAI), vol. 3029, pp. 118–127. Springer, Heidelberg (2004)

    Google Scholar 

  43. Wróblewski, J.: Ensembles of Classifiers Based on Approximate Reducts. Fundamenta Informaticae 47, 351–360 (2001)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

James F. Peters Andrzej Skowron Ivo Düntsch Jerzy Grzymała-Busse Ewa Orłowska Lech Polkowski

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this chapter

Cite this chapter

Valdés, J.J., Barton, A.J. (2007). Finding Relevant Attributes in High Dimensional Data: A Distributed Computing Hybrid Data Mining Strategy. In: Peters, J.F., Skowron, A., Düntsch, I., Grzymała-Busse, J., Orłowska, E., Polkowski, L. (eds) Transactions on Rough Sets VI. Lecture Notes in Computer Science, vol 4374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71200-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71200-8_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71198-8

  • Online ISBN: 978-3-540-71200-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics