/* Filename: DistanceBootstrap.sas Purpose: SAS code to bootstrap the sampling distributiuon of measures of distance of distance between two assemblages of counts. Choose among Euclidean, Manhattan (~Brainard Robinson) and Nei distances. Last Update: FDN 4.8.02 */ data test; input site $ v1-v4; cards; a 100 9 1 19 b 83 15 2 9 ; /*lets do a chi-square test first */ proc transpose data=test out=test1(rename=(col1=count _name_=type)); by site; var _numeric_; proc print; proc freq; weight count; table site*type / chisq; exact chisq; run; proc iml; ************************************************************************** Function: EUCDIST Description:Compute Euclidean distance between two row vectors. Arguments: X1,X2: the two row vectors. pct: set to 1 if you want X1 and X2 coverted to relative frequencies first. Returns: Euclidean distance between X1 and X2. ; start eucdist(x1,x2,pct); if pct=1 then do; x1=x1/sum(x1); x2=x2/sum(x2); end; dist= sqrt((x1-x2)*(x1-x2)`); return(dist); finish eucdist; ****************************************************************************; ************************************************************************** Function: NEIDIST Description:Compute Nei's (1972) distance between two row vectorsof counts. Arguments: X1,X2: the two row vectors. Returns: Nei's (genetic) distance between X1 and X2. ; start neidist(x1,x2); *assumes input as counts; x1=x1/sum(x1); x2=x2/sum(x2); dist=1-((x1*x2`)/sqrt(x1*x1`)#(x2*x2`)); *you can change "-log" to "1-"; return(dist); finish neidist; **************************************************************************; ************************************************************************** Function: MANHATDIST Description:Computes the Manhattan distancee between two row vectors. Arguments: X1,X2: the two row vectors. pct: set to 1 if you want X1 and X2 converted to relative frequencies first. Returns: Manhattan distance between X1 and X2. ; start manhatdist(x1,x2,pct); if pct=1 then do; x1=x1/sum(x1); x2=x2/sum(x2); end; dist= sum(abs(x1-x2)); return(dist); finish manhatdist; **************************************************************************; ************************************************************************** Function: RESAMPLE Description:Samples with replacement from an vector of counts. Arguments: X: a row vector of counts e.g. type frequencies. samplesize: size of the sample to be taken. Returns: a row vector of counts sampled from a population with type frequencies given by X.; start resample(x,samplesize); xout= j(1,ncol(x),0); CUMpctvec=cusum(x)/sum(x); pct= x/sum(x); ntype=ncol(x); do i = 1 to samplesize; rannum=uniform(0); do j=1 to ncol(x); if j=1 then do; if (rannum < CUMpctvec[,j]) then xout[,j]=xout[,j]+1; end; if j ^= 1 then do; if ((rannum >= cumpctvec[,j-1]) & (rannum< cumpctvec[,j])) then xout[,j]=xout[,j]+1; end; end; end; return(xout); finish resample; **************************************************************************; **************************************************************************** Function: GETPVALUE Description:finds the depth of an observed test statistic in a sampling distrbution. Arguments: HOSampDist: The sampling distribution of the test stat, under H0. teststat: The observed value of the test statistic. Returns: The p value associated with a one-tailed test of the statistic in question; start getpvalue(HOSampDist,teststat); nsamples=nrow(HOSampdist); rindex=rank(HOSampDist); sorted=j(nsamples,1,0); sorted[rindex,] = HOSampDist[1:Nsamples,]; fvalues=((1:nsamples)-.5)/Nsamples; do i=1 to nsamples; if i=1 then do ; if teststat <= sorted then pvalue=1; end; if (1 < i) & (i < nsamples) then do; if (sorted[i,] <= teststat) & ( teststat < sorted[i+1,]) then pvalue=1-fvalues[,i]; end; if i=nsamples then do; if sorted[i,] <= teststat then pvalue=0; end; end; return(pvalue); finish getpvalue; ********************************************************************; **************************************************************************** Function: BOOTSAMPDIST Description:Bootstap estimation of the sampling distribution of the distance between two row vectors of counts. Arguments: nsamples: the number of iterations to draw. xmat: 2-row matrix of counts for the two samples. DistType: specify the distance statistic you want: 1=euclidean 2=manhattan (~brainerd-robinson) 3=nei Returns: The sampling distrbution of the distance statistic under HO that both samples came from the same population; start bootsampdist(Xmat,Nsamples,DistType); H0SampDist= j(Nsamples,1,0); do i=1 to nsamples; colmarg= xmat[+,]; n1= xmat[1,+]; n2= xmat[2,+]; x1= resample(colmarg,n1); x2= resample(colmarg,n2); if disttype=1 then dist=eucdist(x1,x2,1); if disttype=2 then dist=manhatdist(x1,x2,1); if disttype=3 then dist=neidist(x1,x2); H0Sampdist[i,]=dist; end; return (H0Sampdist); finish bootsampdist; *********************************************************************************************; /* read in the data from the sas dataset */ use test; read all var _all_ into xmat [colname=vars rowname=site]; /* call the functions */ obseucdist=eucdist(xmat[1,],xmat[2,],1); h0sampdist=bootsampdist(xmat,1000,1); pvalue= getpvalue(h0sampdist,obseucdist); print 'Observed Euclidean Distance' obseucdist; print 'Bootstrapped Probability of a Greater Value' pvalue; create h0sampdist from h0sampdist; append from h0sampdist; /*plot the sampling distribution*/ proc univariate normal plot; run;