mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Miscellaneous Math

Reply
 
Thread Tools
Old 2005-09-09, 01:29   #1
Mystwalker
 
Mystwalker's Avatar
 
Jul 2004
Potsdam, Germany

83110 Posts
Question Special cluster analysis needed?

Hi there!

First of all, I'm aware that the following question is at least unusual here. If you know of a better place to ask, I'd be happy to know.


Here is my problem:
I have 7 sets of 8 vectors each. I know that in most cases, each vector of a set is "similar" to a vector of each other set (for the sake of a better name (known to me), let's call it a "class" - hence, I have 8 classes).

Now, I want to put these vectors together (in order to get a mean value for each class).

Some kind of k-means cluster analysis seems to be a good choice. But so far, I haven't found a solution s.t. in each cluster, there is exactly one vector of each set.
Is there a variant that does exactly this?

Or is there an alternative approach?
I hesitate using one set as a reference system and trying to match the vectors of each set against it, because I'm quite sure the results would vary depending on the set I choose as reference system.

Note:
A solution that is easily applicable using the computer would be optimal.

Thanks for your help!

Last fiddled with by Mystwalker on 2005-09-09 at 01:30
Mystwalker is offline   Reply With Quote
Old 2005-09-09, 14:52   #2
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

26·113 Posts
Default

Quote:
Originally Posted by Mystwalker
Hi there!

First of all, I'm aware that the following question is at least unusual here. If you know of a better place to ask, I'd be happy to know.


Here is my problem:
I have 7 sets of 8 vectors each. I know that in most cases, each vector of a set is "similar" to a vector of each other set (for the sake of a better name (known to me), let's call it a "class" - hence, I have 8 classes).

Now, I want to put these vectors together (in order to get a mean value for each class).

Some kind of k-means cluster analysis seems to be a good choice. But so far, I haven't found a solution s.t. in each cluster, there is exactly one vector of each set.
Is there a variant that does exactly this?

Or is there an alternative approach?
I hesitate using one set as a reference system and trying to match the vectors of each set against it, because I'm quite sure the results would vary depending on the set I choose as reference system.

Note:
A solution that is easily applicable using the computer would be optimal.

Thanks for your help!

This *might* be an interesting problem, but the problem is not well posed.

You start with 8 equivalence classes of vectors. You do not state an
equivalence criterion (or criteria). You need to define one. What does
it MEAN to be "similar"? Do you mean, for example, "nearly parallel"?
Or do you mean "nearly the same norm"? Or some other criterion?

We are given 7 sets of 8 vectors each. Are the elements of each set
guaranteed to be in different equivalence classes?

You say you want to "put these vectors together". But you do not say what
this means. Nor is it clear which vectors you refer to with the word "these".
Nor do you give a definition of "mean value" for a set of vectors.

My interpretation of your problem is that for all 56 vectors you want to
identify which equivalence class each one belongs to. Then for each
class you want want to compute some kind of "average" for the 7 vectors
in that class. Is this correct?

Please clarify.
R.D. Silverman is offline   Reply With Quote
Old 2005-09-10, 00:13   #3
Mystwalker
 
Mystwalker's Avatar
 
Jul 2004
Potsdam, Germany

3×277 Posts
Default

Thanks for your reply.
Rereading my posting, I have to admit that it is too sloppy to be understandable. Sorry.

I think I can explain it best by giving you example data.

This is how a single data set looks like:

Code:
		F1		F2		F3		F4		F5		F6		F7		F8
NOO		0.957		0.080		0.030		0.111		-0.003		-0.078		0.178		-0.008
NOSTCM		-0.001		0.959		0.013		0.115		-0.009		0.039		0.005		0.072
MNOP		0.304		0.058		-0.016		0.085		-0.001		-0.056		0.942		0.003
MSOO		-0.031		0.857		-0.009		0.160		-0.008		-0.010		0.028		0.326
NONSA		0.953		0.072		0.085		-0.016		-0.003		-0.041		0.152		-0.016
NOA		0.610		0.053		0.780		-0.015		-0.001		-0.025		0.071		-0.011
NOSA		-0.065		0.005		0.993		-0.006		0.002		0.004		-0.046		0.000
NOST		0.349		0.830		0.061		0.341		0.002		0.027		0.066		0.007
MNOL		-0.008		0.575		-0.007		-0.002		0.009		-0.032		0.004		0.803
WMC		0.956		0.079		0.029		0.127		-0.006		-0.082		0.176		0.000
DIT		-0.085		0.026		-0.007		0.070		-0.010		0.990		-0.049		-0.017
NOC		-0.007		-0.010		0.001		0.002		1.000		-0.010		0.000		0.005
CBO		-0.029		0.261		-0.017		0.934		0.007		0.121		0.057		0.083
RFC		0.350		0.528		-0.002		0.677		-0.010		-0.087		0.084		-0.199
LCOM		0.943		0.028		0.031		-0.018		0.000		0.041		-0.048		-0.004

Here, the {$ F_i $} 's are the vectors. (They come from a principal component analysis with subsequent varimax rotation, by the way.)

I have six more such data sets. As "similar" or "resemble" surely are badly defined (I apologize), I try to give a criterion:
The lower the (Euclidean) distance between the endpoints of vectors {$ F_x $} and {$ F_y $} (both starting at the same origin), the more they are alike.
I have already observed that almost all vectors of the other data sets have a small distance to one of this data set (although the order changes, e.g. {$ F_4 $} of another project is "near" this project's {$ F_2 $} and so on).

What I need is a method to group together the "similar" vectors (which seems to be a job for a cluster analysis), with the constraint that in every cluster, there is exactly one vector of each data set.

Another possibility is to take one data set as a reference and to try to match the vectors of each other data set to the vectors of this one ("{$ F_4 $} has the smallest distance to {$ F_{2ref} $}, {$ F_1 $} to {$ F_{8ref} $}, ..."). But depending on the data set I take as reference, I might get slightly different mappings.

Concerning the averaging, you're correct. I plan to calculate the element-wise mean values in order to get an "average" vector. AFAIK, this ensures that the sum of distances (as defined above) to the vectors in that class/cluster is minimal. Is that correct? Cluster analysis seems to also take this approach to get the center of the clusters.


I hope the problem has become clearer. If not, I'm completely willing to provide more information. I'd also be happy if you could give me a tip where to read more about geoemtry, because I realized that I have quite some troubles expressing my problem due to the lack of fitting (english) terms.
Mystwalker is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Cluster software fivemack Software 5 2016-09-27 22:13
Cuda and a cluster efiGeek Msieve 17 2015-12-06 14:31
Celeron - special benchmarks needed Prime95 Software 9 2011-04-07 02:19
V24.12 special benchmarks needed Prime95 Software 29 2005-07-04 09:59
Prime95 on a Cluster??? georgekh Software 22 2004-11-09 14:39

All times are UTC. The time now is 09:09.

Mon Oct 26 09:09:41 UTC 2020 up 46 days, 6:20, 0 users, load averages: 1.82, 1.93, 1.98

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.