TY - JOUR
T1 - A Generalized Methodology for Data Analysis
AU - Angelov, Plamen Parvanov
AU - Gu, Xiaowei
AU - Principe, Jose
N1 - Funding Information:
Manuscript received July 13, 2017; accepted September 7, 2017. Date of publication October 12, 2017; date of current version September 14, 2018. This work was supported by The Royal Society “Novel Machine Learning Paradigms to address Big Data Streams,” under Grant IE141329/2014. This paper was recommended by Associate Editor Y. Zhang. (Corresponding author: Xiaowei Gu.) P. P. Angelov is with the School of Computing and Communications, Lancaster University, Lancaster LA1 4WA, U.K., and also holds an Honorary Professor title with Technical University, Sofia, Bulgaria (e-mail: [email protected]).
Funding Information:
Dr. Principe was a recipient of the IEEE EMBS Career Award and the IEEE Neural Network Pioneer Award. He is the past Editor-in-Chief of the IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, the past Chair of the Technical Committee on Neural Networks of the IEEE Signal Processing Society, and the past President of the International Neural Network Society. He is a fellow of the International Academy of Medical and Biological Engineering and American Institute for Medical and Biological Engineering.
Publisher Copyright:
© 2017 IEEE.
PY - 2018/10/12
Y1 - 2018/10/12
N2 - Based on a critical analysis of data analytics and its foundations, we propose a functional approach to estimate data ensemble properties, which is based entirely on the empirical observations of discrete data samples and the relative proximity of these points in the data space and hence named empirical data analysis (EDA). The ensemble functions include the nonparametric square centrality (a measure of closeness used in graph theory) and typicality (an empirically derived quantity which resembles probability). A distinctive feature of the proposed new functional approach to data analysis is that it does not assume randomness or determinism of the empirically observed data, nor independence. The typicality is derived from the discrete data directly in contrast to the traditional approach, where a continuous probability density function is assumed a priori. The typicality is expressed in a closed analytical form that can be calculated recursively and, thus, is computationally very efficient. The proposed nonparametric estimators of the ensemble properties of the data can also be interpreted as a discrete form of the information potential (known from the information theoretic learning theory as well as the Parzen windows). Therefore, EDA is very suitable for the current move to a data-rich environment, where the understanding of the underlying phenomena behind the available vast amounts of data is often not clear. We also present an extension of EDA for inference. The areas of applications of the new methodology of the EDA are wide because it concerns the very foundation of data analysis. Preliminary tests show its good performance in comparison to traditional techniques.
AB - Based on a critical analysis of data analytics and its foundations, we propose a functional approach to estimate data ensemble properties, which is based entirely on the empirical observations of discrete data samples and the relative proximity of these points in the data space and hence named empirical data analysis (EDA). The ensemble functions include the nonparametric square centrality (a measure of closeness used in graph theory) and typicality (an empirically derived quantity which resembles probability). A distinctive feature of the proposed new functional approach to data analysis is that it does not assume randomness or determinism of the empirically observed data, nor independence. The typicality is derived from the discrete data directly in contrast to the traditional approach, where a continuous probability density function is assumed a priori. The typicality is expressed in a closed analytical form that can be calculated recursively and, thus, is computationally very efficient. The proposed nonparametric estimators of the ensemble properties of the data can also be interpreted as a discrete form of the information potential (known from the information theoretic learning theory as well as the Parzen windows). Therefore, EDA is very suitable for the current move to a data-rich environment, where the understanding of the underlying phenomena behind the available vast amounts of data is often not clear. We also present an extension of EDA for inference. The areas of applications of the new methodology of the EDA are wide because it concerns the very foundation of data analysis. Preliminary tests show its good performance in comparison to traditional techniques.
KW - Data mining and analysis
KW - machine learning
KW - pattern recognition
KW - probability
KW - statistics
UR - http://www.research.lancs.ac.uk/portal/en/publications/a-generalized-methodology-for-data-analysis(a9798f79-604f-4d40-a13a-8b214e68133c).html
UR - http://www.scopus.com/inward/record.url?scp=85029703502&partnerID=8YFLogxK
U2 - 10.1109/TCYB.2017.2753880
DO - 10.1109/TCYB.2017.2753880
M3 - Article
SN - 2168-2267
VL - 48
SP - 2981
EP - 2993
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
IS - 10
M1 - 8066441
ER -