mersenneforum.org Special cluster analysis needed?
 Register FAQ Search Today's Posts Mark Forums Read

 2005-09-09, 01:29 #1 Mystwalker     Jul 2004 Potsdam, Germany 83110 Posts Special cluster analysis needed? Hi there! First of all, I'm aware that the following question is at least unusual here. If you know of a better place to ask, I'd be happy to know. Here is my problem: I have 7 sets of 8 vectors each. I know that in most cases, each vector of a set is "similar" to a vector of each other set (for the sake of a better name (known to me), let's call it a "class" - hence, I have 8 classes). Now, I want to put these vectors together (in order to get a mean value for each class). Some kind of k-means cluster analysis seems to be a good choice. But so far, I haven't found a solution s.t. in each cluster, there is exactly one vector of each set. Is there a variant that does exactly this? Or is there an alternative approach? I hesitate using one set as a reference system and trying to match the vectors of each set against it, because I'm quite sure the results would vary depending on the set I choose as reference system. Note: A solution that is easily applicable using the computer would be optimal. Thanks for your help! Last fiddled with by Mystwalker on 2005-09-09 at 01:30
2005-09-09, 14:52   #2
R.D. Silverman

Nov 2003

26·113 Posts

Quote:
 Originally Posted by Mystwalker Hi there! First of all, I'm aware that the following question is at least unusual here. If you know of a better place to ask, I'd be happy to know. Here is my problem: I have 7 sets of 8 vectors each. I know that in most cases, each vector of a set is "similar" to a vector of each other set (for the sake of a better name (known to me), let's call it a "class" - hence, I have 8 classes). Now, I want to put these vectors together (in order to get a mean value for each class). Some kind of k-means cluster analysis seems to be a good choice. But so far, I haven't found a solution s.t. in each cluster, there is exactly one vector of each set. Is there a variant that does exactly this? Or is there an alternative approach? I hesitate using one set as a reference system and trying to match the vectors of each set against it, because I'm quite sure the results would vary depending on the set I choose as reference system. Note: A solution that is easily applicable using the computer would be optimal. Thanks for your help!

This *might* be an interesting problem, but the problem is not well posed.

You start with 8 equivalence classes of vectors. You do not state an
equivalence criterion (or criteria). You need to define one. What does
it MEAN to be "similar"? Do you mean, for example, "nearly parallel"?
Or do you mean "nearly the same norm"? Or some other criterion?

We are given 7 sets of 8 vectors each. Are the elements of each set
guaranteed to be in different equivalence classes?

You say you want to "put these vectors together". But you do not say what
this means. Nor is it clear which vectors you refer to with the word "these".
Nor do you give a definition of "mean value" for a set of vectors.

My interpretation of your problem is that for all 56 vectors you want to
identify which equivalence class each one belongs to. Then for each
class you want want to compute some kind of "average" for the 7 vectors
in that class. Is this correct?

 2005-09-10, 00:13 #3 Mystwalker     Jul 2004 Potsdam, Germany 3×277 Posts Thanks for your reply. Rereading my posting, I have to admit that it is too sloppy to be understandable. Sorry. I think I can explain it best by giving you example data. This is how a single data set looks like: Code: F1 F2 F3 F4 F5 F6 F7 F8 NOO 0.957 0.080 0.030 0.111 -0.003 -0.078 0.178 -0.008 NOSTCM -0.001 0.959 0.013 0.115 -0.009 0.039 0.005 0.072 MNOP 0.304 0.058 -0.016 0.085 -0.001 -0.056 0.942 0.003 MSOO -0.031 0.857 -0.009 0.160 -0.008 -0.010 0.028 0.326 NONSA 0.953 0.072 0.085 -0.016 -0.003 -0.041 0.152 -0.016 NOA 0.610 0.053 0.780 -0.015 -0.001 -0.025 0.071 -0.011 NOSA -0.065 0.005 0.993 -0.006 0.002 0.004 -0.046 0.000 NOST 0.349 0.830 0.061 0.341 0.002 0.027 0.066 0.007 MNOL -0.008 0.575 -0.007 -0.002 0.009 -0.032 0.004 0.803 WMC 0.956 0.079 0.029 0.127 -0.006 -0.082 0.176 0.000 DIT -0.085 0.026 -0.007 0.070 -0.010 0.990 -0.049 -0.017 NOC -0.007 -0.010 0.001 0.002 1.000 -0.010 0.000 0.005 CBO -0.029 0.261 -0.017 0.934 0.007 0.121 0.057 0.083 RFC 0.350 0.528 -0.002 0.677 -0.010 -0.087 0.084 -0.199 LCOM 0.943 0.028 0.031 -0.018 0.000 0.041 -0.048 -0.004 Here, the ${ F_i }$ 's are the vectors. (They come from a principal component analysis with subsequent varimax rotation, by the way.) I have six more such data sets. As "similar" or "resemble" surely are badly defined (I apologize), I try to give a criterion: The lower the (Euclidean) distance between the endpoints of vectors ${ F_x }$ and ${ F_y }$ (both starting at the same origin), the more they are alike. I have already observed that almost all vectors of the other data sets have a small distance to one of this data set (although the order changes, e.g. ${ F_4 }$ of another project is "near" this project's ${ F_2 }$ and so on). What I need is a method to group together the "similar" vectors (which seems to be a job for a cluster analysis), with the constraint that in every cluster, there is exactly one vector of each data set. Another possibility is to take one data set as a reference and to try to match the vectors of each other data set to the vectors of this one ("${ F_4 }$ has the smallest distance to ${ F_{2ref} }$, ${ F_1 }$ to ${ F_{8ref} }$, ..."). But depending on the data set I take as reference, I might get slightly different mappings. Concerning the averaging, you're correct. I plan to calculate the element-wise mean values in order to get an "average" vector. AFAIK, this ensures that the sum of distances (as defined above) to the vectors in that class/cluster is minimal. Is that correct? Cluster analysis seems to also take this approach to get the center of the clusters. I hope the problem has become clearer. If not, I'm completely willing to provide more information. I'd also be happy if you could give me a tip where to read more about geoemtry, because I realized that I have quite some troubles expressing my problem due to the lack of fitting (english) terms.

 Similar Threads Thread Thread Starter Forum Replies Last Post fivemack Software 5 2016-09-27 22:13 efiGeek Msieve 17 2015-12-06 14:31 Prime95 Software 9 2011-04-07 02:19 Prime95 Software 29 2005-07-04 09:59 georgekh Software 22 2004-11-09 14:39

All times are UTC. The time now is 09:09.

Mon Oct 26 09:09:41 UTC 2020 up 46 days, 6:20, 0 users, load averages: 1.82, 1.93, 1.98