Brief overview of this project
|
We start our project with the following questions:
|
Brief overview of basic classification algorithms
|
1. k-Nearest Neighbor(KNN) Algorithm
|
2. Naive Bayesian(NB) Algorithm
|
3. Concept Vector-based(CB) Algorithm
|
Brief overview of classification algorithms that are implemented and tried in this project.
4. Singular Value Decomposition(SVD)-based Algorithm
|
5. Variation of Naive Bayesian Algorithm
|
6. Variations of KNN algorithm
|
7. Hierarchical algorithms
|
The idea is to make good use of hierarchical structure of data set in top-down manner. By utilizing
known (or artificial) hierarchical structure, the classification problem can be decomposed into a
set of smaller problems corresponding to hierarchical splits in tree structure. We first start
testing the top level of tree to distinguish classes. After then, we repeatedly consider only the
child (or children) of the appropriate parent that we selected at the parent's level (i.e., previous
level). For implementing hierarchical classification algorithm, we need to preprocess a given hierarchical structure and use additional data structures in order keep tract of hierarchical information. Details of additional data structures are specified in Data Description section. | ||
|
8. Combination algorithms
| The idea is to reduce the dimensionality of VSM and keep useful information. So, we first compute concept vectors for given categories (or classified classes using clustering algorithm), then, using the concept vectors as projection matrix, do projection of both training and testing data. Finally, we apply KNN algorithm on the projected VSM model that has reduced dimensionality. | |||
|
We briefly describe the following two datasets used in this project and also specify the options with which we use to preprocess them using rainbow or mc.
1. 20 Newsgroup Data Set
|
2. 4 Universities Data Set
|
3. Rainbow and MC
Common parameter setting for every algorithm:
|
1. k-Nearest Neighbor(KNN) Algorithm
|
2. Naive Bayesian(NB) Algorithm
|
3. Concept Vector-based(CB) Algorithm
|
4. Singular Value Decomposition-based Algorithm
|
5. Hierarchical Algorithm
|
6. Combination Algorithm
|
| We select six classification algorithms and compare their performance in the point of accuracy, precision, recall, and shape of learning curves. However we show only both accuracy and learning curves for accuracy as below. So to get details of corresponding precision and recall performance, we need to keep the intermediate results when we are running each driver. |
|
How to evaluate classification performance?
|
1. Accuracy for training/testing/both data
2. Learning Curves(LCs) for training/testing/both data
3. Concept Vector graphs
|