Microarray Data Classification using K-Nearest
Neighbor based on Map-Reduce
Amitav Swain (Altisource)
Abstract—Microarray based gene expression profiling has
emerged as an efficient technique for classification, diagnosis,
prognosis, and treatment of cancer disease. The nature of disease
like cancer changes day by day (veracity) in a very frequent
time (velocity) that generates huge volume of data. Therefore,
the analysis of microarray dataset in a very specific time is very
essential that can be achieved with the concept of Big data and
Hadoop. The major drawback in microarray data is the ‘curse
of dimensionality problem’, this hinders the useful information
of data set and leads to computational instability. Therefore,
selecting relevant genes is a challenging task in microarray
data analysis. Most of the existing schemes employ a two stage
process; feature selection/extraction followed by classification. In
this paper, principal component analysis (PCA) and K-Nearest
neighbor (K-NN) are applied to select the relevant feature and
to classify the microarray respectively. These algorithms are
successfully implemented in map reduce concept on Hadoop
framework.
Keywords—Classification, Gene extraction, Hadoop, K-Nearest
Neighbor, Map-Reduce, Microarray, PCA.
I. INTRODUCTION
Accurate diagnosis of the disease particularly “cancer” is
vital for the successful application of any specific therapy.
Even though classification related to cancer diagnosis has
been improved over the last decade significantly, still there
is a scope for proper diagnosis with less subjective methods.
Recent development in diagnosis indicates that DNA microarray
provides an insight to cancer classification at gene level
due to their capabilities to measure abundant ribonucleic acid
(mRNA) transcripts for thousands of genes concurrently.
Microarray based gene expression profiling has emerged
as an efficient technique for cancer classification as well as
for diagnosis, prognosis, and treatment purposes [1]. In recent
years, DNA microarray technique has shown a great impact in
determining the informative genes that cause cancer [2], [3].
The major drawback that exists in microarray data is the curse
of a dimensionality problem, i.e., the number of genes N far
exceeds the number of samples M (N >> M); that hinders
the useful information of data set and the computational
instability. Therefore, the selection of relevance gene remains
a challenge in the analysis of microarray data [4].
Feature (Gene) selection has motivated many scientists in
functional genomics and as a result numerous algorithms have
been developed. The aim of gene selection is to select a small
subset of genes from a larger pool, yielding not only good
performance of classification, but also biologically meaningful
insights. The main objective of the feature selection (FS) is
(a) to avoid over-fitting and improve model (classifier) performance.
(b) to provide faster and more cost effective models and
(c) to gain a deeper insight into the underlying processes that
generate the data. In this paper, t-statistic (filter approach) is
used to select the high relevance genes. The t-statistic assumes
independence among genes while determining the rankings,
and is computationally efficient [5], [6].
The classification using K-NN algorithm is used under
many circumstances in pattern recognition system [7], [8],
[9], [10]. This algorithm provides a simple non-parametric
procedure for the assignment of a class label to the input
pattern based on the class labels represented by the K-closest
neighbors of the vector. The problems encountered in using
K-NN classifier is that, normally each of the samples vectors
considered equally important in the assignment of class label
to the input vector. This causes difficulty in those places where
the sample sets overlap; where a typical vectors are given
as much weight as those that are truly representative of the
classes (clusters). Another problem is, there is no indication
of its strength of membership in that class. To alleviate these
problems of K-NN, fuzzy set [11] has incorporated with K-NN
called Fuzzy K-Nearest Neighbor (Fuzzy K-NN). Fuzzy K-NN
has been developed by using fuzzy class memberships of the
sample set and thus producing a fuzzy classification rule.
Along with the feature selection using t-statistic, K-NN
and Fuzzy K-NN have been proposed as a classifier by
performing 10-fold cross validation. The result obtained for
the experimental work carried out on leukemia dataset shows
that the proposed method (Fuzzy K-NN) performs well and
comparatively obtained better result than K-NN and the existing
classifiers presented in literature.
The rest of the paper is organized as follows: Section
II highlights on the related work in the field of microarray
classification. Section III presents the proposed work for
classifying the microarray data using K-NN and Fuzzy KNN.
Section IV presents the implementation details for the
proposed approach. Section ?? gives a brief idea about the
various performance parameters used to evaluate the performance
of classifiers (models). Section V highlights on the
results obtained, interpretation drawn from it and also presents
the comparative analysis for gene classification of microarray
with existing classifiers in literature. Section VI concludes the
paper with scope for future work.
II. RELATED WORK
This section gives a brief overview of the feature selection
methods and classifiers used by various researchers, and their
respective accuracy rate achieved in gene classification are
tabulated in Table I.
III. PROPOSED WORK
The presence of huge number of insignificant and irrelevant
features that degrades the quality of analysis of the disease
TABLE I: Results obtained by various researchers and practitioners for classification using microarray (leukemia) data set. The
Table gives the feature selection and classification methodologies and their corresponding accuracies.
Author Feature selection/extraction method Classifier used Accuracy (%)
Lee et. al.[7](2003) Bayesian model Artificial Neural Network (ANN), K-Nearest
Neighbor (KNN), Support Vector Machine
(SVM)
97.05
Ye et al. [12](2004) Uncorrelated Linear Discriminant Analysis (ULDA) KNN(k=1) 97.5
Peng et al. [13] (2006) Fisher ratio Naive Bayes (NB), Decision Tree J4.8, SVM 100, 95.83, 98.6
Lipo et al. [5](2007) t-test SVM 100
Zhang and Dend [8](2007) Based Bayes error Filter (BBF) SVM, KNN 100, 98.61
Mundra et. al. [6] (2010) t-test, SVM based t-statistic,SVM with recursive feature
elimination (RFE), SVM based t-statistic with
RFE
SVM 96.88, 98.12,
97.88, 98.41
Lee et. al. [10] (2011) 2
− test hybrid with GA+KNN and SVM 100
Dina et al. [9] (2011) Multiple scoring gene selection technique (MGS-CM) SVM, KNN, Linear discriminant analysis (LDA) 90.97
like ‘cancer’. To enhance the quality, it is very essential to
analyze the dataset in proper perspective. This section presents
the proposed approach for classification of microarray data,
which consists of two phases:
Phase-1: Emphasizes on preprocessing the input data using
various methods such as missing data imputation,
normalization, and feature selection using t-statistic.
Phase-2: K-NN has been used as a classifier.
Fig. 1 shows the graphical representation of proposed
approach and the brief description of the proposed approach
has given as follows
Data
Missing value imputation and
Data normalization
Training data Test data
Feature extraction
using PCA
Training data Test data
Classify data based on K-NN
Adjustment
parameters
Classification result
Evaluations
indexes
Validity of
classifications
?
Output classification results
Yes
No
Fig. 1: Proposed work for microarray classification.
1) Data collection:
The data set for classification analysis which acts as
requisite input to the models is obtained from Kent
Ridge Bio-medical Data Set Repository [1].
2) Missing data imputation and normalization of
dataset:
Missing data of a feature (gene) of microarray data
are imputed by using the mean value of the respective
feature. Input feature values are normalized over the
range [0, 1] using Min-Max normalization technique
[14] discussed in Algorithm 1 and 2 respectively.
3) Division of Dataset:
The data set is divided into two categories such as:
training set and testing set.
4) Feature extraction of dataset:
Map Reduce based PCA has been applied to select
the features (new components) having high relevance
value and hence the curse of dimensionality issue has
been reduced.
5) Build classifier:
Map Reduce based K-NN has been built to classify
the microarray dataset.
6) Test the model:
Models are tested using the testing data set and then
the performance of the classifier has been compared.
IV. IMPLEMENTATION
A. Missing value imputation
This step is completed using Map-Reduce job. First the
data is loaded into the Hadoop Distributed File System
(HDFS).Then the mapper reads the input line wise, here one
line contains one full feature set and then calculates the sum of
all the feature values present and then finds the mean. It emits
the feature set and the mean value into the file system. The
reducer then fills up the missing values with the mean of the
feature set. It then emits out the new feature set corresponding
to the feature set Id as shown in Algorithm 1
B. Normalization of the Data Set
The output obtained from the previous step is first loaded
into the HDFS.The mapper then reads it line wise and calculates
the minimum and maximum value present in the feature
set. It then emits the feature set, minimum value and the
maximum value. The reducer, after reading the output of the
mapper calculates the normalized value, using the formula
given in Algorithm 2 and forms a new feature set. It then
emits the normalized feature set corresponding to the feature
Id.
Algorithm 1 Map Reduce based Missing value imputation
Input: N ×M Matrix, where N is number of features and M
is number of samples.
Output: N ×M Matrix with no missing value.
1: Begin MR Job
2: MAP:
3: Read a feature set Fi.
4: for each feature value fj do
5: if Value present then
6: Sum=Sum+Value
7: end if
8: Calculate the mean mj = Sum/M
9: end for
10: Emit hFi,meani
11: REDUCE:
12: for each feature set Fi do
13: for each feature value fj do
14: if Value not present then
15: fj = mean
16: end if
17: end for
18: end for
19: Emit hi, (f1, f2, ...fM)i
20: End
Algorithm 2 Map Reduce based Min-Max Normalization
Input: N ×M Matrix, where N is number of features and M
is number of samples.
Output: N ×M Matrix with normalized value [0,1].
1: Begin MR Job
2: MAP:
3: Read a feature set Fi.
4: Calculate maximum value of Fi, say Max
5: Calculate minimum value of Fi, say Min
6: Emit hFi, (Min,Max)i
7: REDUCE:
8: for each feature set Fi do
9: for each feature value fj do
10: Calculate nj = fj−Min
Max−Min
11: end for
12: end for
13: Emit hi, (n1, n2, ..., nM)i
14: End
C. Feature Selection Using PCA
From the normalized data set a training set is obtained. The
mapper first fetches a training data sample and then calculates
the mean and Pi = xixTi
and emits hSi, (Pi,mi)i. The reducer
then aggregates all the Pi and finds their mean (B).The reducer
also accumulate all the mean values of the sample into matrix
A. It then calculates C = AAT and covariance matrix (D). It
calculates the eigenvalue and eigen vector and sorts them with
respect to the eigvalues and generates a matrix of eigenvectors
corresponding to the top k eigen values and emits it.
After getting the eigenvectors corresponding to the top-k
features, it is multiplied with the original training and testing
data to get the feature-reduced training and testing data set
respectively.
Algorithm 3 Map Reduce based PCA
Input: M ×N Matrix, where M is number of samples and N
is number of features.
Output: Top K features with Eigen vector of matrix N × K
1: Begin MR Job
2: MAP:
3: Fetch a data sample Si.
4: for each training sample tsi do
5: Calculate the mean (mi)
6: Calculate Pi = xixTi
7: end for
8: Emit hSi, (Pi,mi)i
9: REDUCE:
10: for each sample Si do
11: Calculate the mean of Pi, say B
12: Calculate A = hm1,m2, ...,mMi
13: end for
14: Calculate C = AAT
15: Calculate D = B − C
16: Calculate [eigenvalue, eigenV ector] = eig(D)
17: Sort eigenvalue.
18: Generate a matrix of eigenvectors E, corresponding to top
K eigenvalue.
19: Emit hEi
20: End
D. Map Reduce Based K-NN classification The training
data set is first uploaded into the HDFS and loaded into all
the mappers. Now each mapper reads a sample data from the
testing set and calculates the euclidean distance between from
all the training samples.It accumulates all the distance values
from each training sample and their class label.It emits the Id
of the testing sample and accumulated distances into the file
system. The reducer then sorts the distances in ascending order
and selects the k nearest neighbor or training samples.The
testing instance is assigned to class with the most frequency
as shown in Algorithm 4. The reducer then emits the instance
id and assigned class.
V. RESULTS AND INTERPRETATION
In this section, the obtained results are discussed for the
proposed algorithm (Section III) on a case study viz., leukemia
microarray dataset [1], andovarian cancer [15].
The classification performance is assessed using “10 fold
cross validation (CV)” technique for leukemia data set. 10-
fold cross validation provides more realistic assessment of
classifiers, which generalizes significantly to unseen data.
A. Case study: Leukemia
The leukemia dataset consists of expression profiles of
7129 features (genes), categorized as acute lymphoblastic
leukemia (ALL) and acute myeloid leukemia (AML) classes,
having 38 (27 ALL, 11 AML) trainig and 34 (20 ALL, 14
AML) testing samples [1].
B. Case study: Ovarian cancer
The ovarian cancer dataset consists of 15154 features
(genes), categorized as ‘cancer’ and ‘normal’ classes, having
Algorithm 4 Map Reduce based K-Nearest Neighbor
Input: Let S = {x1, x2, ..., xn} is a set of testing samples to
be classified.
Output: Classification result of testing instance.
1: Begin MR Job
2: MAP:
3: Extract a data sample Si.
4: for each testing samples tsi do
5: for each training samples tsj do
6: Calculate the euclidean distance (dist) between a test
sample and a training sample.
7: Accumulate dist values for each training sample
8: end for
9: Emit hSi, (dist1, dist2, ..., distj)i
10: end for
11: REDUCE:
12: for each testing sample x ⊂ tSi do
13: Sort the dist values in ascending order
14: Select the K nearest neighbor (training sample) to
testing sample.
15: Assign testing sample (x) to the most frequent class(say
’c’) in the set of training sample.
16: if (a tie occurs) then
17: Compute sum of the distances of the neighbors in
each class which are tied.
18: if (no tie occurs) then
19: Classify x in the class of minimum sum.
20: else
21: Classify x in the class of last minimum sum.
22: end if
23: else
24: Classify x in the majority class.
25: end if
26: end for
27: Emit hxi, cii
28: End
253 samples. Out of 253 samples, the dataset contains 162
cancer and 91 normal samples [15]. Out of 253 samples, 131
(31 cancerous, 100 non-cancerous) is considered as training
samples and 122 (60 cancerous, 62 non-cancerous) as a testing
samples.
Since the data set contains a very large number of features
with irrelevant information, feature extraction method (PCA)
has been applied to select the features (genes) which have
high relevance score, and the genes with low relevance score
are stripped off.
After applying PCA, the proposed classification algorithm
“K-NN” has been used to classify the reduced dataset.
The algorithm is run by varying the value of k (nearest
neighbor) and f (number of features). The value of ’k’ is varied
for a given ’f’. The number of misclassified data samples is
then recorded in Table II and Table III.
From Table II, it is observed that for k=3, f=10 and f=15
the number of misclassified samples in the case of Leukemia
Cancer data set is 3.
From Table III, it is observed that for k=3 and f=10 the
number of misclassified samples in the case of Ovarian Cancer
data set is 14.
TABLE II: Leukemia
K/No. of Feature 2 5 8 10 15
1 9 6 6 6 5
3 6 7 4 3 3
5 7 8 5 6 5
7 7 6 5 5 5
9 8 6 7 9 8
TABLE III: Overian
K/No. of Feature 2 5 8 10 15
1 46 20 22 14 18
3 46 26 24 17 19
5 47 29 26 18 21
7 48 29 30 28 28
9 48 33 29 30 33
VI. CONCLUSION
In this paper, an attempt has been made to design the classification
model for classifying the samples of various dataset
in to their class labels. Hence, a hadoop based framework is
designed for construction of K-NN and feature extraction has
been carried out using map-reduce based PCA. The performance
of the classifier for various data set were evaluated by
varying the value of K and number of features (f). From
the computed result, it is observed that map-reduced based KNN
yields faster results for larger dataset over K-NN and the
classifiers exist in literature which are implemented on normal
framework.
Further, the applicability of machine learning techniques
such as support vector machine (SVM), Logistic regression
(LR),and etc. can be implemented using Hadoop framework.
REFERENCES
[1] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P.
Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri et al.,
“Molecular classification of cancer: class discovery and class prediction
by gene expression monitoring,” science, vol. 286, no. 5439, pp. 531–
537, October 1999.
[2] Y. F. Leung and D. Cavalieri, “Fundamentals of cdna microarray data
analysis,” TRENDS in Genetics, vol. 19, no. 11, pp. 649–659, November
2003.
[3] M. Flores, T. Hsiao, Y. Chiu, E. Chuang, Y. Huang, and Y. Chen, “Gene
regulation, modulation, and their applications in gene expression data
analysis.” Advances in bioinformatics, vol. 2013, pp. 360 678–360 678,
January 2013.
[4] G. Lee, C. Rodriguez, and A. Madabhushi, “Investigating the efficacy
of nonlinear dimensionality reduction schemes in classifying gene and
protein expression studies,” Computational Biology and Bioinformatics,
IEEE/ACM Transactions on, vol. 5, no. 3, pp. 368–384, July–September
2008.
[5] L. Wang, F. Chu, and W. Xie, “Accurate cancer classification using
expressions of very few genes,” IEEE/ACM Transactions on Computational
Biology and Bioinformatics (TCBB), vol. 4, no. 1, pp. 40–53,
Jan–Mar 2007.
[6] P. A. Mundra and J. C. Rajapakse, “Gene and sample selection for
cancer classification with support vectors based t-statistic,” Neurocomputing,
vol. 73, no. 13, pp. 2353–2362, May 2010.
[7] K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick,
“Gene selection: a bayesian variable selection approach,” Bioinformatics,
vol. 19, no. 1, pp. 90–97, June 2003.
[8] J.-G. Zhang and H.-W. Deng, “Gene selection for classification of
microarray data based on the bayes error,” BMC bioinformatics, vol. 8,
no. 1, pp. 370–378, October 2007.
[9] D. A. Salem, A. Seoud, R. Ahmed, and H. A. Ali, “Mgs-cm: A
multiple scoring gene selection technique for cancer classification using
microarrays,” International Journal of Computer Applications, vol. 36,
no. 6, December 2011.
[10] C.-P. Lee and Y. Leu, “A novel hybrid feature selection method for
microarray data analysis,” Applied Soft Computing, vol. 11, no. 1, pp.
208–213, January 2011.
[11] L. A. Zadeh, “Fuzzy sets,” Information and control, vol. 8, no. 3, pp.
338–353, June 1965.
[12] J. Ye, T. Li, T. Xiong, and R. Janardan, “Using uncorrelated discriminant
analysis for tissue classification with gene expression data,” IEEE/ACM
Transactions on Computational Biology and Bioinformatics (TCBB),
vol. 1, no. 4, pp. 181–190, October 2004.
[13] Y. Peng, W. Li, and Y. Liu, “A hybrid approach for biomarker discovery
from microarray gene expression data for cancer classification,” Cancer
informatics, vol. 2, p. 301, February 2006.
[14] Y. K. Jain and S. K. Bhandare, “Min max normalization based data
perturbation method for privacy protection,” International Journal of
Computer & Communication Technology (IJCCT), vol. 2, no. 8, pp.
45–50, October 2011.
[15] E. F. Petricoin III, A. M. Ardekani, B. A. Hitt, P. J. Levine, V. A.
Fusaro, S. M. Steinberg, G. B. Mills, C. Simone, D. A. Fishman, E. C.
Kohn et al., “Use of proteomic patterns in serum to identify ovarian
cancer,” The lancet, vol. 359, no. 9306, pp. 572–577, February 2002.
Neighbor based on Map-Reduce
Amitav Swain (Altisource)
Abstract—Microarray based gene expression profiling has
emerged as an efficient technique for classification, diagnosis,
prognosis, and treatment of cancer disease. The nature of disease
like cancer changes day by day (veracity) in a very frequent
time (velocity) that generates huge volume of data. Therefore,
the analysis of microarray dataset in a very specific time is very
essential that can be achieved with the concept of Big data and
Hadoop. The major drawback in microarray data is the ‘curse
of dimensionality problem’, this hinders the useful information
of data set and leads to computational instability. Therefore,
selecting relevant genes is a challenging task in microarray
data analysis. Most of the existing schemes employ a two stage
process; feature selection/extraction followed by classification. In
this paper, principal component analysis (PCA) and K-Nearest
neighbor (K-NN) are applied to select the relevant feature and
to classify the microarray respectively. These algorithms are
successfully implemented in map reduce concept on Hadoop
framework.
Keywords—Classification, Gene extraction, Hadoop, K-Nearest
Neighbor, Map-Reduce, Microarray, PCA.
I. INTRODUCTION
Accurate diagnosis of the disease particularly “cancer” is
vital for the successful application of any specific therapy.
Even though classification related to cancer diagnosis has
been improved over the last decade significantly, still there
is a scope for proper diagnosis with less subjective methods.
Recent development in diagnosis indicates that DNA microarray
provides an insight to cancer classification at gene level
due to their capabilities to measure abundant ribonucleic acid
(mRNA) transcripts for thousands of genes concurrently.
Microarray based gene expression profiling has emerged
as an efficient technique for cancer classification as well as
for diagnosis, prognosis, and treatment purposes [1]. In recent
years, DNA microarray technique has shown a great impact in
determining the informative genes that cause cancer [2], [3].
The major drawback that exists in microarray data is the curse
of a dimensionality problem, i.e., the number of genes N far
exceeds the number of samples M (N >> M); that hinders
the useful information of data set and the computational
instability. Therefore, the selection of relevance gene remains
a challenge in the analysis of microarray data [4].
Feature (Gene) selection has motivated many scientists in
functional genomics and as a result numerous algorithms have
been developed. The aim of gene selection is to select a small
subset of genes from a larger pool, yielding not only good
performance of classification, but also biologically meaningful
insights. The main objective of the feature selection (FS) is
(a) to avoid over-fitting and improve model (classifier) performance.
(b) to provide faster and more cost effective models and
(c) to gain a deeper insight into the underlying processes that
generate the data. In this paper, t-statistic (filter approach) is
used to select the high relevance genes. The t-statistic assumes
independence among genes while determining the rankings,
and is computationally efficient [5], [6].
The classification using K-NN algorithm is used under
many circumstances in pattern recognition system [7], [8],
[9], [10]. This algorithm provides a simple non-parametric
procedure for the assignment of a class label to the input
pattern based on the class labels represented by the K-closest
neighbors of the vector. The problems encountered in using
K-NN classifier is that, normally each of the samples vectors
considered equally important in the assignment of class label
to the input vector. This causes difficulty in those places where
the sample sets overlap; where a typical vectors are given
as much weight as those that are truly representative of the
classes (clusters). Another problem is, there is no indication
of its strength of membership in that class. To alleviate these
problems of K-NN, fuzzy set [11] has incorporated with K-NN
called Fuzzy K-Nearest Neighbor (Fuzzy K-NN). Fuzzy K-NN
has been developed by using fuzzy class memberships of the
sample set and thus producing a fuzzy classification rule.
Along with the feature selection using t-statistic, K-NN
and Fuzzy K-NN have been proposed as a classifier by
performing 10-fold cross validation. The result obtained for
the experimental work carried out on leukemia dataset shows
that the proposed method (Fuzzy K-NN) performs well and
comparatively obtained better result than K-NN and the existing
classifiers presented in literature.
The rest of the paper is organized as follows: Section
II highlights on the related work in the field of microarray
classification. Section III presents the proposed work for
classifying the microarray data using K-NN and Fuzzy KNN.
Section IV presents the implementation details for the
proposed approach. Section ?? gives a brief idea about the
various performance parameters used to evaluate the performance
of classifiers (models). Section V highlights on the
results obtained, interpretation drawn from it and also presents
the comparative analysis for gene classification of microarray
with existing classifiers in literature. Section VI concludes the
paper with scope for future work.
II. RELATED WORK
This section gives a brief overview of the feature selection
methods and classifiers used by various researchers, and their
respective accuracy rate achieved in gene classification are
tabulated in Table I.
III. PROPOSED WORK
The presence of huge number of insignificant and irrelevant
features that degrades the quality of analysis of the disease
TABLE I: Results obtained by various researchers and practitioners for classification using microarray (leukemia) data set. The
Table gives the feature selection and classification methodologies and their corresponding accuracies.
Author Feature selection/extraction method Classifier used Accuracy (%)
Lee et. al.[7](2003) Bayesian model Artificial Neural Network (ANN), K-Nearest
Neighbor (KNN), Support Vector Machine
(SVM)
97.05
Ye et al. [12](2004) Uncorrelated Linear Discriminant Analysis (ULDA) KNN(k=1) 97.5
Peng et al. [13] (2006) Fisher ratio Naive Bayes (NB), Decision Tree J4.8, SVM 100, 95.83, 98.6
Lipo et al. [5](2007) t-test SVM 100
Zhang and Dend [8](2007) Based Bayes error Filter (BBF) SVM, KNN 100, 98.61
Mundra et. al. [6] (2010) t-test, SVM based t-statistic,SVM with recursive feature
elimination (RFE), SVM based t-statistic with
RFE
SVM 96.88, 98.12,
97.88, 98.41
Lee et. al. [10] (2011) 2
− test hybrid with GA+KNN and SVM 100
Dina et al. [9] (2011) Multiple scoring gene selection technique (MGS-CM) SVM, KNN, Linear discriminant analysis (LDA) 90.97
like ‘cancer’. To enhance the quality, it is very essential to
analyze the dataset in proper perspective. This section presents
the proposed approach for classification of microarray data,
which consists of two phases:
Phase-1: Emphasizes on preprocessing the input data using
various methods such as missing data imputation,
normalization, and feature selection using t-statistic.
Phase-2: K-NN has been used as a classifier.
Fig. 1 shows the graphical representation of proposed
approach and the brief description of the proposed approach
has given as follows
Data
Missing value imputation and
Data normalization
Training data Test data
Feature extraction
using PCA
Training data Test data
Classify data based on K-NN
Adjustment
parameters
Classification result
Evaluations
indexes
Validity of
classifications
?
Output classification results
Yes
No
Fig. 1: Proposed work for microarray classification.
1) Data collection:
The data set for classification analysis which acts as
requisite input to the models is obtained from Kent
Ridge Bio-medical Data Set Repository [1].
2) Missing data imputation and normalization of
dataset:
Missing data of a feature (gene) of microarray data
are imputed by using the mean value of the respective
feature. Input feature values are normalized over the
range [0, 1] using Min-Max normalization technique
[14] discussed in Algorithm 1 and 2 respectively.
3) Division of Dataset:
The data set is divided into two categories such as:
training set and testing set.
4) Feature extraction of dataset:
Map Reduce based PCA has been applied to select
the features (new components) having high relevance
value and hence the curse of dimensionality issue has
been reduced.
5) Build classifier:
Map Reduce based K-NN has been built to classify
the microarray dataset.
6) Test the model:
Models are tested using the testing data set and then
the performance of the classifier has been compared.
IV. IMPLEMENTATION
A. Missing value imputation
This step is completed using Map-Reduce job. First the
data is loaded into the Hadoop Distributed File System
(HDFS).Then the mapper reads the input line wise, here one
line contains one full feature set and then calculates the sum of
all the feature values present and then finds the mean. It emits
the feature set and the mean value into the file system. The
reducer then fills up the missing values with the mean of the
feature set. It then emits out the new feature set corresponding
to the feature set Id as shown in Algorithm 1
B. Normalization of the Data Set
The output obtained from the previous step is first loaded
into the HDFS.The mapper then reads it line wise and calculates
the minimum and maximum value present in the feature
set. It then emits the feature set, minimum value and the
maximum value. The reducer, after reading the output of the
mapper calculates the normalized value, using the formula
given in Algorithm 2 and forms a new feature set. It then
emits the normalized feature set corresponding to the feature
Id.
Algorithm 1 Map Reduce based Missing value imputation
Input: N ×M Matrix, where N is number of features and M
is number of samples.
Output: N ×M Matrix with no missing value.
1: Begin MR Job
2: MAP:
3: Read a feature set Fi.
4: for each feature value fj do
5: if Value present then
6: Sum=Sum+Value
7: end if
8: Calculate the mean mj = Sum/M
9: end for
10: Emit hFi,meani
11: REDUCE:
12: for each feature set Fi do
13: for each feature value fj do
14: if Value not present then
15: fj = mean
16: end if
17: end for
18: end for
19: Emit hi, (f1, f2, ...fM)i
20: End
Algorithm 2 Map Reduce based Min-Max Normalization
Input: N ×M Matrix, where N is number of features and M
is number of samples.
Output: N ×M Matrix with normalized value [0,1].
1: Begin MR Job
2: MAP:
3: Read a feature set Fi.
4: Calculate maximum value of Fi, say Max
5: Calculate minimum value of Fi, say Min
6: Emit hFi, (Min,Max)i
7: REDUCE:
8: for each feature set Fi do
9: for each feature value fj do
10: Calculate nj = fj−Min
Max−Min
11: end for
12: end for
13: Emit hi, (n1, n2, ..., nM)i
14: End
C. Feature Selection Using PCA
From the normalized data set a training set is obtained. The
mapper first fetches a training data sample and then calculates
the mean and Pi = xixTi
and emits hSi, (Pi,mi)i. The reducer
then aggregates all the Pi and finds their mean (B).The reducer
also accumulate all the mean values of the sample into matrix
A. It then calculates C = AAT and covariance matrix (D). It
calculates the eigenvalue and eigen vector and sorts them with
respect to the eigvalues and generates a matrix of eigenvectors
corresponding to the top k eigen values and emits it.
After getting the eigenvectors corresponding to the top-k
features, it is multiplied with the original training and testing
data to get the feature-reduced training and testing data set
respectively.
Algorithm 3 Map Reduce based PCA
Input: M ×N Matrix, where M is number of samples and N
is number of features.
Output: Top K features with Eigen vector of matrix N × K
1: Begin MR Job
2: MAP:
3: Fetch a data sample Si.
4: for each training sample tsi do
5: Calculate the mean (mi)
6: Calculate Pi = xixTi
7: end for
8: Emit hSi, (Pi,mi)i
9: REDUCE:
10: for each sample Si do
11: Calculate the mean of Pi, say B
12: Calculate A = hm1,m2, ...,mMi
13: end for
14: Calculate C = AAT
15: Calculate D = B − C
16: Calculate [eigenvalue, eigenV ector] = eig(D)
17: Sort eigenvalue.
18: Generate a matrix of eigenvectors E, corresponding to top
K eigenvalue.
19: Emit hEi
20: End
D. Map Reduce Based K-NN classification The training
data set is first uploaded into the HDFS and loaded into all
the mappers. Now each mapper reads a sample data from the
testing set and calculates the euclidean distance between from
all the training samples.It accumulates all the distance values
from each training sample and their class label.It emits the Id
of the testing sample and accumulated distances into the file
system. The reducer then sorts the distances in ascending order
and selects the k nearest neighbor or training samples.The
testing instance is assigned to class with the most frequency
as shown in Algorithm 4. The reducer then emits the instance
id and assigned class.
V. RESULTS AND INTERPRETATION
In this section, the obtained results are discussed for the
proposed algorithm (Section III) on a case study viz., leukemia
microarray dataset [1], andovarian cancer [15].
The classification performance is assessed using “10 fold
cross validation (CV)” technique for leukemia data set. 10-
fold cross validation provides more realistic assessment of
classifiers, which generalizes significantly to unseen data.
A. Case study: Leukemia
The leukemia dataset consists of expression profiles of
7129 features (genes), categorized as acute lymphoblastic
leukemia (ALL) and acute myeloid leukemia (AML) classes,
having 38 (27 ALL, 11 AML) trainig and 34 (20 ALL, 14
AML) testing samples [1].
B. Case study: Ovarian cancer
The ovarian cancer dataset consists of 15154 features
(genes), categorized as ‘cancer’ and ‘normal’ classes, having
Algorithm 4 Map Reduce based K-Nearest Neighbor
Input: Let S = {x1, x2, ..., xn} is a set of testing samples to
be classified.
Output: Classification result of testing instance.
1: Begin MR Job
2: MAP:
3: Extract a data sample Si.
4: for each testing samples tsi do
5: for each training samples tsj do
6: Calculate the euclidean distance (dist) between a test
sample and a training sample.
7: Accumulate dist values for each training sample
8: end for
9: Emit hSi, (dist1, dist2, ..., distj)i
10: end for
11: REDUCE:
12: for each testing sample x ⊂ tSi do
13: Sort the dist values in ascending order
14: Select the K nearest neighbor (training sample) to
testing sample.
15: Assign testing sample (x) to the most frequent class(say
’c’) in the set of training sample.
16: if (a tie occurs) then
17: Compute sum of the distances of the neighbors in
each class which are tied.
18: if (no tie occurs) then
19: Classify x in the class of minimum sum.
20: else
21: Classify x in the class of last minimum sum.
22: end if
23: else
24: Classify x in the majority class.
25: end if
26: end for
27: Emit hxi, cii
28: End
253 samples. Out of 253 samples, the dataset contains 162
cancer and 91 normal samples [15]. Out of 253 samples, 131
(31 cancerous, 100 non-cancerous) is considered as training
samples and 122 (60 cancerous, 62 non-cancerous) as a testing
samples.
Since the data set contains a very large number of features
with irrelevant information, feature extraction method (PCA)
has been applied to select the features (genes) which have
high relevance score, and the genes with low relevance score
are stripped off.
After applying PCA, the proposed classification algorithm
“K-NN” has been used to classify the reduced dataset.
The algorithm is run by varying the value of k (nearest
neighbor) and f (number of features). The value of ’k’ is varied
for a given ’f’. The number of misclassified data samples is
then recorded in Table II and Table III.
From Table II, it is observed that for k=3, f=10 and f=15
the number of misclassified samples in the case of Leukemia
Cancer data set is 3.
From Table III, it is observed that for k=3 and f=10 the
number of misclassified samples in the case of Ovarian Cancer
data set is 14.
TABLE II: Leukemia
K/No. of Feature 2 5 8 10 15
1 9 6 6 6 5
3 6 7 4 3 3
5 7 8 5 6 5
7 7 6 5 5 5
9 8 6 7 9 8
TABLE III: Overian
K/No. of Feature 2 5 8 10 15
1 46 20 22 14 18
3 46 26 24 17 19
5 47 29 26 18 21
7 48 29 30 28 28
9 48 33 29 30 33
VI. CONCLUSION
In this paper, an attempt has been made to design the classification
model for classifying the samples of various dataset
in to their class labels. Hence, a hadoop based framework is
designed for construction of K-NN and feature extraction has
been carried out using map-reduce based PCA. The performance
of the classifier for various data set were evaluated by
varying the value of K and number of features (f). From
the computed result, it is observed that map-reduced based KNN
yields faster results for larger dataset over K-NN and the
classifiers exist in literature which are implemented on normal
framework.
Further, the applicability of machine learning techniques
such as support vector machine (SVM), Logistic regression
(LR),and etc. can be implemented using Hadoop framework.
REFERENCES
[1] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P.
Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri et al.,
“Molecular classification of cancer: class discovery and class prediction
by gene expression monitoring,” science, vol. 286, no. 5439, pp. 531–
537, October 1999.
[2] Y. F. Leung and D. Cavalieri, “Fundamentals of cdna microarray data
analysis,” TRENDS in Genetics, vol. 19, no. 11, pp. 649–659, November
2003.
[3] M. Flores, T. Hsiao, Y. Chiu, E. Chuang, Y. Huang, and Y. Chen, “Gene
regulation, modulation, and their applications in gene expression data
analysis.” Advances in bioinformatics, vol. 2013, pp. 360 678–360 678,
January 2013.
[4] G. Lee, C. Rodriguez, and A. Madabhushi, “Investigating the efficacy
of nonlinear dimensionality reduction schemes in classifying gene and
protein expression studies,” Computational Biology and Bioinformatics,
IEEE/ACM Transactions on, vol. 5, no. 3, pp. 368–384, July–September
2008.
[5] L. Wang, F. Chu, and W. Xie, “Accurate cancer classification using
expressions of very few genes,” IEEE/ACM Transactions on Computational
Biology and Bioinformatics (TCBB), vol. 4, no. 1, pp. 40–53,
Jan–Mar 2007.
[6] P. A. Mundra and J. C. Rajapakse, “Gene and sample selection for
cancer classification with support vectors based t-statistic,” Neurocomputing,
vol. 73, no. 13, pp. 2353–2362, May 2010.
[7] K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick,
“Gene selection: a bayesian variable selection approach,” Bioinformatics,
vol. 19, no. 1, pp. 90–97, June 2003.
[8] J.-G. Zhang and H.-W. Deng, “Gene selection for classification of
microarray data based on the bayes error,” BMC bioinformatics, vol. 8,
no. 1, pp. 370–378, October 2007.
[9] D. A. Salem, A. Seoud, R. Ahmed, and H. A. Ali, “Mgs-cm: A
multiple scoring gene selection technique for cancer classification using
microarrays,” International Journal of Computer Applications, vol. 36,
no. 6, December 2011.
[10] C.-P. Lee and Y. Leu, “A novel hybrid feature selection method for
microarray data analysis,” Applied Soft Computing, vol. 11, no. 1, pp.
208–213, January 2011.
[11] L. A. Zadeh, “Fuzzy sets,” Information and control, vol. 8, no. 3, pp.
338–353, June 1965.
[12] J. Ye, T. Li, T. Xiong, and R. Janardan, “Using uncorrelated discriminant
analysis for tissue classification with gene expression data,” IEEE/ACM
Transactions on Computational Biology and Bioinformatics (TCBB),
vol. 1, no. 4, pp. 181–190, October 2004.
[13] Y. Peng, W. Li, and Y. Liu, “A hybrid approach for biomarker discovery
from microarray gene expression data for cancer classification,” Cancer
informatics, vol. 2, p. 301, February 2006.
[14] Y. K. Jain and S. K. Bhandare, “Min max normalization based data
perturbation method for privacy protection,” International Journal of
Computer & Communication Technology (IJCCT), vol. 2, no. 8, pp.
45–50, October 2011.
[15] E. F. Petricoin III, A. M. Ardekani, B. A. Hitt, P. J. Levine, V. A.
Fusaro, S. M. Steinberg, G. B. Mills, C. Simone, D. A. Fishman, E. C.
Kohn et al., “Use of proteomic patterns in serum to identify ovarian
cancer,” The lancet, vol. 359, no. 9306, pp. 572–577, February 2002.