next up previous
Next: Experiment Up: Using Co-clustering for Predicting Previous: Missing value prediction based

Data Analysis

The first important step of a data mining process is to analyze the dataset and discover statistical characteristics of the specific dataset. The analysis could help us to select the appropriate method which could both reduce the computational cost and achieve an acceptable performance. This section presents some characteristics of the subset Netflix dataset and discusses some potential approaches based on this analysis.
The first two charts, Figure 1 and 2, are the histogram of the average ratings by movies and the histogram of the average ratings by users, respectively. Both charts show that the ratings are biased toward high values. The average values for both of the graphs are around 3.6 and most of ratings are from 2 to 5.

Figure 1: The Average Rating Distribution By Movie
\begin{figure}
\begin{center}
\epsfig{figure=graph/AveRatDistByMov.eps,width=3.1in}
\end{center}
\end{figure}

Figure 2: The Average Rating Distribution By User
\begin{figure}
\begin{center}
\epsfig{figure=graph/AveRatDistByUser.eps,width=3.1in}
\end{center}
\end{figure}

Figure 3: The Average Number of Ratings for Each Movie
\begin{figure}
\begin{center}
\epsfig{figure=graph/aveMovRatByMovCount.eps,width=3.1in}
\end{center}
\end{figure}

Figure 3 is a curve that is computed by dividing movies into bins based on their average ratings. For each bin, we compute the average number of ratings for each movie. In the other words, the chart presents the number of ratings for each movie given the average rating of that movie. From this graph, we can induce that the highly-rated movies are voted more than low-rated movies. On both extremes the number of ratings is very low since people just ignore the worst movies and think that it is unnecessary to rate the best movies, their ranks are so obvious.

Figure 4: The Average Number of Ratings for Each User
\begin{figure}
\begin{center}
\epsfig{figure=graph/aveMovRatByUserCount.eps,width=3.1in}
\end{center}
\end{figure}
Figure 4 presents the number of ratings by each user given an average rating of that user. It is computed by dividing users into bins based on there average ratings. Then for each bin, we compute the average number of ratings. We can imply from this chart that people with less number of ratings usually give negative votes. They did the rating work only when the movies are so disappointed. On the other hand people with a high number of ratings have more variety on rating and their scores are usually from 2.5 to 5. They are truly movie fans. This is also means the rating profiles of users with high average values give us more information and should be exploited by prediction methods.

Figure 5: The Average Standard Deviation for Each User Given the Average Rating
\begin{figure}
\begin{center}
\epsfig{figure=graph/StdDevByMovAve.eps,width=3.1in}
\end{center}
\end{figure}
Figure 5 presents the average standard deviation for each movie given the average rating of that movie. From this graph, we learn that the ratings for the worst and the best movies are quite stable. This can also give us the following idea for rating prediction: for those movies with lowest and highest ratings we can use the average ratings as predicted values and for movies with the middle ratings we can exploit some complex methods for prediction.

Figure 6: The Sparseness of the Rating Matrix of the First 100 Users x 120 Movies
\begin{figure}
\begin{center}
\epsfig{figure=graph/SparseMatrix.eps,width=3.1in}
\end{center}
\end{figure}
Figure 6 which is 1% of the rating subset shows the sparseness of the rating matrix. The percentage of ratings is 1.71% which is a true challenge for any missing value prediction methods.


next up previous
Next: Experiment Up: Using Co-clustering for Predicting Previous: Missing value prediction based
Tuyen Huynh 2007-05-09