-
Tables summarize the classification performance (accuracy(TABLE 1), precision(TABLE 2), and recall(TABLE 3))
of the two classification algorithms. Remember that in fact, there is no difference between training and testing
data in KNN classification algorithm. However for the comparison purpose, I divided data set into three categories,
training, testing, and both (= training + testing) data.
-
From the TABLE 1, we can say that KNN classification algorithm outperforms CB classification algorithm in
classification accuracy. Also, the accuracy is improved as the number of nearest neighbors (i.e., k) increases.
However, at some k, accuracy is not improved. Actually, finding optimal k is one of issues in KNN classification
algorithm. Also, the difference of accuracy between the two algorithms is largest among the three classification
performance measures (i.e., accuracy, precision, and recall).
-
For the case of precision (TABLE 2), KNN classification algorithm has still better performance over all data types
(i.e., training, testing, and both data). However, the difference of precision between the two algorithms becomes
smaller than that of accuracy. Moreover, for some k (i.e., k=1, 2, and 5), CB classification algorithm shows
slightly better precision values and shows consistent values overall experiments. So we can say that CB
classification algorithm gives robust precision values.
-
On the other hand, CB classification algorithm results in better recall than KNN classification algorithm does.
Like its result on Preston (see TABLE 2), CB classification algorithm shows robust recall values.
-
KNN classification outperforms CB classification algorithm only for accuracy. Reversely, CB classification
algorithm shows similar precision and better recall performance. Therefore, from TABLE 1, 2 and 3, it is not clear
which algorithm is better for gene classification because there is no particular difference between the two algorithms.
-
In order to closely look at the classification performance, we check how the three performance measures are changed
according to the change of number (i.e., percentage) of training data set. Here from 10 to 90 percentage of whole data
were selected as training data and the remaining data are used as testing data.
-
FIGURE 1/2/3 show how accuracy/recall/precision are changed with KNN and size of training data. Interestingly, shapes
and magnitude (i.e., value) of lines for training/testing/both data are similar. As I mentioned before, this is due to
the fact that there is no difference between training and testing data. One important thing we can learn from FIGURE
1/2/3 is the fluctuation of lines. Being fluctuated or crossing other lines tells us that the performance of KNN
classification algorithm can be affected by the size of training data. However the amount of fluctuation is not big.
They also show increasing shapes along with the number of training data.
-
FIGURE 4/5/6 show CB classification algorithm's characteristics. With 10 percentage of data as a training data,
it has highest accuracy, precision, and recall value for training data and the training curves gradually decrease as
the size of training data increases. However both testing and both data are not changed much (rather almost constant).
Moreover there is no severe fluctuation and no crossing lines. The two lines for testing and both data are almost parallel.
It means that it is not sensitive to the number of training data. Rather we can say that it is also robust to the number
of training data.
-
Through FIGURE 1-6, we check each algorithm's general performance and characteristics. Additionally we compare both
algorithms in the same categories: FIGURE 7/8/9 compare the three performance on training data, FIGURE 10/11/12 check
them on testing data, and 13/14/15 do on both data. As we see the accuracy performance through TABLE 1/2/3, we can see
that KNN has better accuracy performance. Interestingly, Interestingly, the shape of learning curves(LCs) of the two
algorithms on testing and training data are really similar except some minor difference. The lines on testing and
training data are close and rather in parallel each other. It means that both algorithms has similar precision and recall
performance and both are robust to the size of training data even though KNN is not more robust than CB is.
-
In fact, it is not an unexpected result that KNN classification algorithm is robust to the size of training data.
In our experiments (not shown in this webpage) using KNN and CB classification algorithm for vector space model
(VSM) of text document data, CB classification algorithms clearly outperformed KNN classification algorithm in
the three classification measures as well as learning curves. So we guess that this result is owing to the hidden
(i.e., implicit) characteristics of gene expression data and different normalization. For example, in VSM, we just
use the frequency of each word in each document and based on this frequency, we give weight for each word both locally
and globally and normalize each document vector using 2-norm. So possible values in matrix ranges from 0 to 1. However
in gene expression data, we treat missing values as 0.0 and normalize them using 2-norm, so they can have positive/
negative values.
-
In summary, the classification performance is so similar that it is not easy to say which is better. However, this
result gives us more interesting consideration as follows: In computational time and scalability point of view,
CB classification clearly outperforms KNN classification algorithm because it just computes the angles (i.e.,
cosines) of the two gene vectors between small number of concept vectors and a testing gene. Remember that in KNN
classification algorithm, we compute all cosines between training genes and a testing gene every iteration and need
additional computation to get k nearest neighbors. So, we can expect similar (or sometimes better) classification
performance within even shorter time. Additionally, concept vector is a normalized centroid vector, so it is a
representative vector for the class (or category). Therefore it is not sensitive to erroneous or missing values
because it is resulted from being summed and being normalized. In other words, small erroneous or missing values
doesn't make big changes of values in a centroid vector. This also can answer to the results shown in above TABLEs.
As mentioned before, even though we use different samples of training data at each run, CB classification
algorithm results in constant performance because concept vector is not changed much at each run. Further more,
in KNN classification algorithm, the performance is directly related with the distance between neighbors and
a testing gene. So with bad selection of training, we can get bad classification result.
|