20050909, 01:29  #1 
Jul 2004
Potsdam, Germany
831_{10} Posts 
Special cluster analysis needed?
Hi there!
First of all, I'm aware that the following question is at least unusual here. If you know of a better place to ask, I'd be happy to know. Here is my problem: I have 7 sets of 8 vectors each. I know that in most cases, each vector of a set is "similar" to a vector of each other set (for the sake of a better name (known to me), let's call it a "class"  hence, I have 8 classes). Now, I want to put these vectors together (in order to get a mean value for each class). Some kind of kmeans cluster analysis seems to be a good choice. But so far, I haven't found a solution s.t. in each cluster, there is exactly one vector of each set. Is there a variant that does exactly this? Or is there an alternative approach? I hesitate using one set as a reference system and trying to match the vectors of each set against it, because I'm quite sure the results would vary depending on the set I choose as reference system. Note: A solution that is easily applicable using the computer would be optimal. Thanks for your help! Last fiddled with by Mystwalker on 20050909 at 01:30 
20050909, 14:52  #2  
Nov 2003
2^{6}·113 Posts 
Quote:
This *might* be an interesting problem, but the problem is not well posed. You start with 8 equivalence classes of vectors. You do not state an equivalence criterion (or criteria). You need to define one. What does it MEAN to be "similar"? Do you mean, for example, "nearly parallel"? Or do you mean "nearly the same norm"? Or some other criterion? We are given 7 sets of 8 vectors each. Are the elements of each set guaranteed to be in different equivalence classes? You say you want to "put these vectors together". But you do not say what this means. Nor is it clear which vectors you refer to with the word "these". Nor do you give a definition of "mean value" for a set of vectors. My interpretation of your problem is that for all 56 vectors you want to identify which equivalence class each one belongs to. Then for each class you want want to compute some kind of "average" for the 7 vectors in that class. Is this correct? Please clarify. 

20050910, 00:13  #3 
Jul 2004
Potsdam, Germany
3×277 Posts 
Thanks for your reply.
Rereading my posting, I have to admit that it is too sloppy to be understandable. Sorry. I think I can explain it best by giving you example data. This is how a single data set looks like: Code:
F1 F2 F3 F4 F5 F6 F7 F8 NOO 0.957 0.080 0.030 0.111 0.003 0.078 0.178 0.008 NOSTCM 0.001 0.959 0.013 0.115 0.009 0.039 0.005 0.072 MNOP 0.304 0.058 0.016 0.085 0.001 0.056 0.942 0.003 MSOO 0.031 0.857 0.009 0.160 0.008 0.010 0.028 0.326 NONSA 0.953 0.072 0.085 0.016 0.003 0.041 0.152 0.016 NOA 0.610 0.053 0.780 0.015 0.001 0.025 0.071 0.011 NOSA 0.065 0.005 0.993 0.006 0.002 0.004 0.046 0.000 NOST 0.349 0.830 0.061 0.341 0.002 0.027 0.066 0.007 MNOL 0.008 0.575 0.007 0.002 0.009 0.032 0.004 0.803 WMC 0.956 0.079 0.029 0.127 0.006 0.082 0.176 0.000 DIT 0.085 0.026 0.007 0.070 0.010 0.990 0.049 0.017 NOC 0.007 0.010 0.001 0.002 1.000 0.010 0.000 0.005 CBO 0.029 0.261 0.017 0.934 0.007 0.121 0.057 0.083 RFC 0.350 0.528 0.002 0.677 0.010 0.087 0.084 0.199 LCOM 0.943 0.028 0.031 0.018 0.000 0.041 0.048 0.004 Here, the 's are the vectors. (They come from a principal component analysis with subsequent varimax rotation, by the way.) I have six more such data sets. As "similar" or "resemble" surely are badly defined (I apologize), I try to give a criterion: The lower the (Euclidean) distance between the endpoints of vectors and (both starting at the same origin), the more they are alike. I have already observed that almost all vectors of the other data sets have a small distance to one of this data set (although the order changes, e.g. of another project is "near" this project's and so on). What I need is a method to group together the "similar" vectors (which seems to be a job for a cluster analysis), with the constraint that in every cluster, there is exactly one vector of each data set. Another possibility is to take one data set as a reference and to try to match the vectors of each other data set to the vectors of this one (" has the smallest distance to , to , ..."). But depending on the data set I take as reference, I might get slightly different mappings. Concerning the averaging, you're correct. I plan to calculate the elementwise mean values in order to get an "average" vector. AFAIK, this ensures that the sum of distances (as defined above) to the vectors in that class/cluster is minimal. Is that correct? Cluster analysis seems to also take this approach to get the center of the clusters. I hope the problem has become clearer. If not, I'm completely willing to provide more information. I'd also be happy if you could give me a tip where to read more about geoemtry, because I realized that I have quite some troubles expressing my problem due to the lack of fitting (english) terms. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Cluster software  fivemack  Software  5  20160927 22:13 
Cuda and a cluster  efiGeek  Msieve  17  20151206 14:31 
Celeron  special benchmarks needed  Prime95  Software  9  20110407 02:19 
V24.12 special benchmarks needed  Prime95  Software  29  20050704 09:59 
Prime95 on a Cluster???  georgekh  Software  22  20041109 14:39 