开发者

implementation of the Gower distance function

I have a matrix (size: 28 columns and 47 rows) with numbers. This matrix has an extra row that is contains headers for the columns ("ordinal" and "nominal").

I want to use the Gower distance function on this matrix. Here says that:

The final dissimilarity between the ith and jth units is obtained as a weighted sum of dissimilarities for each variable:

    d(i,j) = sum_k(delta_ijk * d_ijk ) / sum_k( delta_ijk )

In particular, d_ijk represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:

  • factor or character columns are considered as categorical nominal variables and d_ijk = 0 if

    x_ik =x_jk, 1 otherwise;

  • ordered columns are considered as categorical ordinal variables and

    the values are substituted with the

    corresponding position index, r_ik in the factor levels. The开发者_JS百科se position

    indexes (that are different from the output of the R function rank) are

    transformed in the following manner

z_ik = (r_ik - 1)/(max(r_ik) - 1)

These new values, z_ik, are treated as observations of an

interval scaled variable.

As far as the weight delta_ijk is concerned:

  • delta_ijk = 0 if x_ik = NA or x_jk = NA;
  • delta_ijk = 1 in all the other cases.

I know that there is a gower.dist function, but I must do it that way. So, for "d_ijk", "delta_ijk" and "z_ik", I tried to make functions, as I didn't find a better way.

I started with "delta_ijk" and i tried this:

Delta=function(i,j){for (i in 1:28){for (j in 1:47){  
+{if (MyHeader[i,j]=="nominal")
+ result=0
+{else if (MyHeader[i,j]=="ordinal") result=1}}}}
+;result}

But I got error. So I got stuck and I can't do the rest.

P.S. Excuse me if I make mistakes, but English is not a language I very often.


Why do you want to reinvent the wheel billyt? There are several functions/packages in R that will compute this for you, including daisy() in package cluster which comes with R.

First things first though, get those "data type" headers out of your data. If this truly is a matrix then character information in this header row will make the whole matrix a character matrix. If it is a data frame, then all columns will likely be factors. What you want to do is code the type of data in each column (component of your data frame) as 'factor' or 'ordered'.

df <- data.frame(A = c("ordinal",1:3), B = c("nominal","A","B","A"),
                 C = c("nominal",1,2,1))

Which gives this --- note that all are stored as factors because of the extra info.

> head(df)
        A       B       C
1 ordinal nominal nominal
2       1       A       1
3       2       B       2
4       3       A       1
> str(df)
'data.frame':   4 obs. of  3 variables:
 $ A: Factor w/ 4 levels "1","2","3","ordinal": 4 1 2 3
 $ B: Factor w/ 3 levels "A","B","nominal": 3 1 2 1
 $ C: Factor w/ 3 levels "1","2","nominal": 3 1 2 1

If we get rid of the first row and recode into the correct types, we can compute Gower's coefficient easily.

> headers <- df[1,]
> df <- df[-1,]
> DF <- transform(df, A = ordered(A), B = factor(B), C = factor(C))
> ## We've previously shown you how to do this (above line) for lots of columns!
> str(DF)
'data.frame':   3 obs. of  3 variables:
 $ A: Ord.factor w/ 3 levels "1"<"2"<"3": 1 2 3
 $ B: Factor w/ 2 levels "A","B": 1 2 1
 $ C: Factor w/ 2 levels "1","2": 1 2 1
> require(cluster)
> daisy(DF)
Dissimilarities :
          2         3
3 0.8333333          
4 0.3333333 0.8333333

Metric :  mixed ;  Types = O, N, N 
Number of objects : 3

Which gives the same as gower.dist() for this data (although in a slightly different format (as.matrix(daisy(DF))) would be equivalent):

> gower.dist(DF)
          [,1]      [,2]      [,3]
[1,] 0.0000000 0.8333333 0.3333333
[2,] 0.8333333 0.0000000 0.8333333
[3,] 0.3333333 0.8333333 0.0000000

You say you can't do it this way? Can you explain why not? As you seem to be going to some degree of effort to do something that other people have coded up for you already. This isn't homework, is it?


I'm not sure what your logic is doing, but you are putting too many "{" in there for your own good. I generally use the {} pairs to surround the consequent-clause:

Delta=function(i,j){for (i in 1:28) {for (j in 1:47){  
       if (MyHeader[i,j]=="nominal") {
         result=0
    # the "{" in the next line before else was sabotaging your efforts
        } else if (MyHeader[i,j]=="ordinal") { result=1} }
      result}
                  }


Thanks Gavin and DWin for your help. I managed to solve the problem and find the right distance matrix. I used daisy() after I recoded the class of the data and it worked.

P.S. The solution that you suggested at my other topic for changing the class of the columns:

DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)

didn't work. It changed only the first nominal and ordinal column.

Thanks again for your help.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜