Metric Evaluation in Self-Organizing Maps

By Leif Johnson with Dana Ballard, December 2008

Self-Organizing Maps (SOMs) are unsupervised learning algorithms that map a set of r-dimensional data points onto another set of r-dimensional prototype points. The prototype points are arranged into a fixed, s-dimensional topology. The value of r is usually large, while s usually a small value like 2, which makes SOMs ideal tools for data visualization, clustering, and related low-dimensional tasks. Additionally, because of this unsupervised topological mapping, SOMs are useful tools for studying other topological mapping processes like cortical self-organization.

To store a new data point p in a SOM, one calculates the distance from p to each of the prototype points in the map using some metric φ: Rr × RrR, choosing the prototype with the smallest distance as the "winner" w. The map moves w towards p by some small proportion η of the distance between w and p, but it also moves all topological neighbors of w towards p by some small amount that decreases with the topological distance from w. As such, neighborhoods tend form in the map around prototype points that frequently win similar (in metric distance) data points.

The distance metric used to perform the mapping has a significant impact on this neighborhood effect, since it determines the choice of w. In this short study, I investigated a subspace of frequently-used distance metrics to determine how these metrics impacted the formation of neighborhoods and larger-scale features in SOMs. The remainder of this web page describes the dataset and the initialization technique that I used for my experiments, the metrics and experimental method that I used, some general observations drawn from these experiments, and some conclusions based on these observations.

Dataset

SOMs that are too small do not generally produce results with enough sophistication to be interesting, while maps that are too large are generally difficult to analyze. With this in mind, I constructed a 4-dimensional dataset representing line segments (note: not entire lines) in the first quadrant of the Cartesian plane. A line segment is determined by an angle θ from the x-axis and a distance ρ from the origin along that angle, plus the Cartesian x- and y-coordinates of the center of the segment in question along the line. This representation does not include the length of each line segment, but it can be assumed to be some small constant value.

To generate the dataset, I sampled ρ and θ uniformly so that they varied from 0 to 1 (here a theta value of 1 mapped onto an angle of π/2 radians), and then for each ρ and θ I sampled all line segments in the first quadrant of the Cartesian plane, by starting at coordinates (ρ cos θ, ρ sin θ) and moving along the line in question in chunks of length δ. For these experiments, I used 11 samples along each of ρ and θ, and δ was set to 0.1. All components of all vectors were limited to [0, 1]. The initial dataset contained 1031 4-dimensional points.

SOMs generally require each point in a dataset to be presented more than once to arrive at convergence. To convert this initial dataset into a training dataset, I sampled uniformly from the original dataset, with replacement, until the training dataset contained ten times more points than the original set. This randomly constructed but fixed training dataset avoids a source of randomness while performing experiments, since all randomness involved in constructing the training dataset is constant for each experiment.

Initialization

SOMs are normally initialized so that their prototype points contain small random values, but I wanted to avoid randomness in these experiments, and I also wanted to force learning to proceed in a known direction for a given data point presentation. With this in mind, and because for my experiments r = 4, I assigned each dimension of the prototype space to have a maximum in one corner of the map. Dimension 1, say, was initialized to be 1 at the upper-left, say, corner of the map, decreasing linearly to 0 at the diagonally opposite corner of the map. Each dimension in the map, then, had a unique maximum and a minimum corner in all experiments.

For all experiments I used square SOMs containing 2500 4-dimensional prototype points (50x50).

Metrics and method

For these experiments I explored the family of weighted Euclidean metrics with weight vector σ:

φσ(a, b) = sqrt Σ σi (bi - ai)2

I systematically varied σ so that it took on all values in {0, 1, 2, 3}4. For each value of σ I trained a single SOM, initialized as described above, using a single pass through the entire training dataset. After training I recorded the values of the prototype points in each map as a series of images:

I named each image based on the weight metric that created it (e.g., 1123 or 1220). Hover over any image on this page to see the weight vector that produced that image.

Observations

I collected several of the resulting map images and attempted to arrange them in ways that demonstrate trends in map layouts as σ changed.

Pairwise comparison

Because the training dataset had 4 dimensions, a simple linear progression of images is insufficient for exploring the weight space for these Euclidean metrics. Instead, I arranged the images for several maps into a nested grid, shown below, that reveals the progression of maps as a pair of weights in σ change. The upper-right triangle of the grid shows maps trained using a metric vector consisting of at least two 0s, while the bottom-left of the grid shows those metrics containing at least two 3s. Within each block, only two of the weights change, as indicated by the large numbers along the diagonal; i.e., in the second column and the second row, only the weight associated with the second dimension changes.

1
2
3
4

From this table I observed several trends as σ varied systematically through weight space. First, the maps trained with a weight of zero in any dimension tended to differ significantly from those maps where the weight along that dimension was nonzero. Second, maps trained with nonzero weights in dimension 1 (corresponding to ρ) or dimension 2 (corresponding to θ) tended to form "chunkier" clusters than maps trained with weight vectors having nonzero values along the third or fourth dimension (x and y, respectively). Finally, map topologies tended to be more visually different the smaller the weight vectors, especially if, as just described, any of the weights are zero.

Combinatorics

I also wanted to see what happened to maps when more than just a pair of the weights changed at a time. In this set of visualizations, all weights are either 3 or a fixed value in {0, 1, 2}. There is one group of images for each fixed value, and each group consists of a series of 1, 4, 6, 4, and then 1 map images. In the top row of each group, all weights are the initial value, and in the bottom row, all weights are 3. In each row from the top to the bottom, the number of 3s in the weight vector increases by 1.

0 1 2

Once again I noticed that weight vectors with more zeros often showed more gross structure than the more "cloudy" maps trained with all-nonzero weight vectors. In particular, in the second row of the 0 group, the images show clearly which dimension contains the nonzero weight, while the other dimensions are more or less noise. Interestingly, again, in the maps where the nonzero value is in dimension 1 or 2, there is a "chunkier" overall structure to the map.

Progression

Intrigued by the apparent dichotomy between zero and nonzero weights, I ran a small second experiment to identify, if possible, where a shift occurrs between the gross structure of the 1000 map and the cloudiness of the 1111 map. In this experiment, I started with a 1111 (20111) weight vector and repeatedly doubled the weight along the first dimension, giving 2111 (21111), 4111 (22111), and so on. Effectively, I was trying to explore what happens when the relative weights of all but one dimension approach zero. At the end of the progression is the 1000 map for comparison.





...

In particular, I was looking here for the appearance of the "chunks" in the map as the strength of the last three weights approached zero. Although some of the later maps start to look like they might be showing some of this chunkiness, the difference between 1000 and 217111 is still significant. The first few doublings reveal strong bias towards the first dimension, but instead of chunks the map seems to form stripe-shaped regions along the first dimension, which bleed into the other dimensions.

Larger maps

I trained a series of larger maps using the 1100 and 1111 weight vectors. These maps show the behavior on the same dataset as you increase the available neurons far beyond the cardinality of the dataset itself.

1100

In the 1100 case, the largest map shows some interesting behavior: because the ρ-θ pairs are few in number and are stored in unique locations in the map (thanks to the static map initialization described above), the x-y values associated with them get "dragged" into the circular areas associated with the two weighted components of each vector. The map clearly shows which ρ-θ values are more often associated with large (or small) x and y values.

1111

Conclusions

Even though I only investigated a single family of distance metrics and did not vary other SOM parameters during these experiments, the gross structure of the resulting maps is quite interesting.

The most striking conclusion I have drawn from these experiments is that, in the space of weighted Euclidean distance metrics, a zero weight is significantly different from a nonzero weight. Even numerically small weights yield significantly different map structures than zero weights.

Second, the chunkiness in the map seems to be due to the number of distinct values to be stored in the map. For instance, the map produced with weight vector 1000 displays discrete regions, while the map produced with weight vector 0001 displays more of a cloudy overall structure. This is because the training dataset contains only 10 distinct values along the first dimension, but hundreds of distinct values along the fourth dimension. Increasing the size of the map, relative to the number of distinct values to be stored, is probably the only way to ensure a map with distinct regions.

Finally, because this task was invented with an eye toward later simulation of motor cortex self-organization, I want to speculate a little on the application of this zero-weight effect in that context. These experiments used a fixed weight vector throughout the life of each SOM. It could be interesting to interpret weight vectors in some sort of neural connectivity standpoint, and imagine what happens to these vectors as, e.g., neurons die and brain connectivity is reduced. Perhaps maps that train partially and then have some zeros introduced into their weight vectors would display interesting behavior.