3d Heatmap

My eye-data extraction program needed some upgrading and bug fixing.  What it does is read a series of datafiles produced by SMI Vision’s IviewX eye tracking software and displays the gaze data.   It scans each selected trial on each selected participant and counts the number of gaze points on each pixel.  Next, it smooths those counts by adding in the counts on neighboring pixels using a normal-distribution falloff function.  I typically use a radius of 20 pixels and a standard deviation of 6.  These values can now be modified by the user in the program by way of textboxes and an Update button.

Next, those smoothed are converted to a heat map.  The maximum count is set to 1 and every count thereafter is set to a percentage of that maximum (p).

The last step is to map these values to colors.  My old system was a hack, and I honestly don’t recall how I did it, but it did involve a bunch of “if” statements.  But my new one is somewhat more elegant.  Each color is defined by the standard combination of Red, Green, and Blue values.   Each of these values is determined by the function exp( -(p – center)^2 / (2*spread^2) ).   The spread for each is set at .21 and the centers vary.  The center for Blue is set at 0.01, green at .5, and red at 1.  These values and function produce curves for each component as follow. 


Thus, when the percentage is very high, the pixels are mostly red.  As the percentage decreases it fades to orange and then yellow at around a value of .76.  From there, it fades to green at .5, and turns cyan at .25.  Below that, the points are predominately blue.  I also modulate the alpha such that as they turn blue, they are faded out so that areas which receive very little attention do not show brightly.   Alpha is determined by the function 1-(1-p)^15.   It generally produces decent heatmaps as below.


That one is made from a study on latent inhibition using the learning game where people are looking at the screen when a Red sensor is being pre-exposed, collapsed across trials. You can see that the majority of the points are distributed over the sensor area (red, yellow, and green shades) with the remainder scattered about over the screen.  What I have added is the ability to light the heatmap.   

Each count is treated as a height.  Look at the square below and imagine each number is a height at a pixel location.  Every pixel, such as the “1” is surrounded by pixels 2-8 as in the square below.

2 3 4

5 1 6

7 8 9

Imagine a line from 1 to 2, 2  to 3 and back to 1 and you have a triangle.  Do that for all points surrounding 1 and you have 8 triangles. The distances in X and Y are simply the pixel coordinates, but each pixel also has a height variable associated with it, creating a set of triangles oriented in 3d space. I take the 3d direction from “1” to the other two corners of each triangle and cross those directions to get a normal.  Then I average all 8 normals to get a normal for point 1 as if it was a vertex.  I do that for all visible points and use those normals for lighting.  The lighting is a simple diffuse lighting model (intensity being the dot product of the normal with the direction of the pixel to the light).  Its not strictly a diffuse model because I do provide an ambient source of .08 intensity, and allow for the light to interact slightly with back-facing sides (.1 * –(dot(normal, direction_to_light)).

When the “With lighting” button is checked, the scene is lit as if there was a light at the mouse cursor.


Here the light is near the top left corner.  As the image shows, the lighting really turns the heatmap into a terrain with bumps and valleys that reveal details about the gaze patterns that are not evident in the heatmap coloring alone.   By holding the right mouse button down and moving it, you can move the light in real time.   The video below shows that movement using another data set.  In addition to the lighting, the video also briefly shows the effects of changing the radius and standard deviation of the smoothing functions.  Presently all the work is done on the CPU, so it’s a little herky-jerky.  Someday I may move the heavy lifting over the the GPU & make it silky.

video of 3d heatmap
Posted in Eye Tracking, Programming in general | Leave a comment

Some single-subject data.

Using the learning game, one subject (a prominent member of the Learning field) was trained on an A:R+/B:R- discrimination.  In context A: the Red sensor indicated an attack (+), but in context B: the stimulus was presented without outcome (-).   Finally, the subject was tested in context C:, a familiar context where the Red sensor had never before been encountered.


Points show responses per second to the sensor in the absence of the attack.  Thus, on + trials they show the first 5-s before the attack occurs.  On – trials they show all 20-s of responding to the stimulus as no attack occurs. Brown circles indicate responses per second during the first 5-s of the Red sensor, prior to the attack by the spaceship in context A: on each trial.  White circles indicate responding to the Red sensor during all 20-s of each trial of its presentation in context B.  Brown triangles at the far right indicate responding to the Red sensor during all 20-s of its presentation on each test trial in context C:.

The subject began to respond in anticipation of the attack on trials AR+3 and AR+4.  On the next trial (BR-1), where the sensor was first encountered in context B, the subject responded considerably across the duration of the stimulus.  The effect of the absence of the outcome was evident on the next trial (AR+ 5) in context A: where the subject responded minimally to the sensor.   The effect of the attack on that trial was evident on the following trials (BR-2, BR-3) in context B.  Here the subject responded to the sensor and showed some evidence that the attack was expected after five seconds.  Responding peaked near the 5-s mark, and then began to decline. 

On return to context A: the participant again anticipated the attack on the next three trials (AR+6 thru AR+8) as well as when the sensor was next encountered in context B (BR-4).   The absence of the attack on that trial once more reduced performance on the following trial (AR+9).  However, at that point the contexts appeared to acquire control.   The sensor failed to elicit responding in context B on trials BR-5 thru BR-8 and elicited strong responding in context A (AR+10).  

On test, the sensor elicited robust responding in context C, although it had never been presented there before.  Of the two meanings (attack, no attack) that had appeared to come under the control of contexts A and B, only the latter (no attack) appeared to be under contextual control. Anticipation of attack in the presence of the sensor generalized to context C.

The effect is similar to ABC renewal where conditioning occurs in context A, extinction in context B, and testing in context C.  The difference here is that rather than a phase of conditioning and a phase of extinction the two types of trials were intermixed in an explicit discrimination.  Whether or not this training affects the extent to which the two associations conditioned to the sensor differentially transfer to context C, relative to a standard conditioning and extinction treatment, is presently under investigation.

Posted in Uncategorized | Leave a comment

3d plotting in Excel

Ever wanted to make a 3-d Scatterplot and find that Excel just won’t do it?  I have.  There are add-ins you can use, that are of limited help in most cases.  There are also commercial add-ins you can buy.  I don’t know how useful these latter products are as I haven’t spent the money on them. 

I’m going to show you how to do it here.  There is no magic bullet, no special menu trick.  You have to engage in a little 3-d programming.   But its not as hard as you might think (its not hard at all).  With a tat of diligence, a bit of focus, and some patience, you should be on your way after going through this post.

It suddenly appeared to me that because I have done a considerable amount of 3-d graphics programming in designing my research methods, that the same techniques should apply.  All a 3d image is is simply a projection of the 3d points onto a 2d plane.  So, when a series of XYZ points is entered into Excel, they can be projected into a simple XY series and plotted in a normal 2d scatter graph.   Simple?  Well, yes, it is. 

The main step is projecting the 3d points into 2d space. Along the way we will want to do other things, like rotate and move the points.  The last thing is to make sure they are drawn from back to front.  I’ll cover each of these in this post.  You can get the spreadsheet to follow along here.  When you follow that link, google drive will show you the contents of the spreadsheet, which are far more disorganized than the spreadsheet itself. Look for the download link on that page (usually an arrow pointing down at the top of the page- move your mouse around to reveal) to save it.  It will not display correctly if you try to open it in google sheets.

Any graph will need Axes.  Since there are no 3d axes in Excel, we will have to make them.  What we will do is create a series of points that when connected will form an open box.  It will have a bottom and back sides, but no top or front.  In the Spreadsheet, the grid will look like the image below. 


Each set of XYZ coordinates defines a point, and connections will be made with lines between adjacent points when Excel does the scatterplot.  Some points are duplicated, as are some lines. That is necessary.  Sometimes when you draw from A to B, you have to re-draw from B to A to get from A to C. 

It is a good idea to set up each cell as a function ( = +/- .5*$b$2 ) with the +/- signs arranged accordingly. That way, we can modify the width (10 in this case) to change the size of our grid on the fly.  Ok- go make your grid and come back.

Next, we must project  those points from 3d to 2d.  The easiest way to do this is simply to do what is called a perspective divide.  Divide each X by Z and each Y by Z.   The resulting XY pairs are your screen coordinates.  image

These projected values are the numbers you will use in your graph series.  After the projection transform, the XY coordinates for your Axes should look like below.


Now, go to your INSERT tab on the ribbon.  Insert a SCATTER graph and choose “Scatter with straight lines and markers.”


Now you should see a blank graph.  Right-Click the graph and “Select data”.  In the following menu, click “Add” under Legend Entries (Series).


In the dialog box that will appear, name the Series “Axes”, and select all the data under Screen coordinates X in the spreadsheet for X and all the data under Y in the spreadsheet for Y.  Your resulting Graph should look a lot like this one below.


Kinda weird looking…   Its weird looking because it is trying to draw it around our head.  We did our perspective transform as if we were in the middle of that 3d universe.   So, what we need to do is add a “distance” variable.  At Cell D2, type the word “Distance” and put a “10” in Cell D3. 


We will use this distance to move our grid out away from us.  Over in our screen coordinates, where we divided each X and Y by Z, we are going to change the formula.  We will divide each X and Y by Z+Distance.  So change, for example, = A4/C4 to =A4/(C4+$D$3).  Paste that down through your Axes values.  Now your graph should start to look more sensible, like the one below. image


  Above is our box, with no top and no front sides.  Lets clean it up and make it pretty.

Right click the plot area, on one of the horizontal grid lines and delete those.   Then, right click each axis and set the min and max values to –1 and 1, respectively.   Also change where each axis crosses the other to Axis Value –1.   On the Spreadsheet change Distance to 12.  Scale the graph up a bit (moving the width & height of the graph in excel) and play with the Width/Height until your axes look more or less of equal length on the screen.   Your graph should look something like this:


Play with the Distance variable some to see its effect.  At this point, you should have a 3d set of axes.  You could start entering real data now as a new series.  Just divide each X and Y by Z+Distance to get what you would plot on the screen.   But, there is more that we will want to do.

First, pretty up your Graph.  Right click on one of the lines and choose “Format Data Series.”  Under Marker Options, choose a round point, about 8 points big.  Make it a nice color of your choice.  Do the same for Line Color- click “Solid Line” and pick your color.  Do the same for “Marker Line color”.  Now add some dazzle.  Click on “3-D format”.  Where it says Bevel, click on Top and click the first option under Bevel which is Circle.  Now you should have 3d-looking points for your Box corners.


Ok- the last thing to do with our Axes is to be able to move our graph around to be able to look at the data from different perspectives.  This process is going to involve a little matrix math.  The easiest way to implement the math we need to do is to write some functions in Visual Basic for Applications (VBA).   Its not as hard as it sounds.  I’ll put the functions here, and you can open VBA and paste them in. 

The first function we need is one to take three points, add a fourth (XYZW) and multiply them by a matrix

Function vmult(ByRef vector() As Double, matrix() As Double) As Boolean
    Dim i As Integer
    Dim j As Integer
    Dim vals(1 To 4) As Double
    vals(1) = vector(1) * matrix(1, 1) + vector(2) * matrix(2, 1) _
              + vector(3) * matrix(3, 1) + vector(4) * matrix(4, 1)
    vals(2) = vector(1) * matrix(1, 2) + vector(2) * matrix(2, 2) _
              + vector(3) * matrix(3, 2) + vector(4) * matrix(4, 2)
    vals(3) = vector(1) * matrix(1, 3) + vector(2) * matrix(2, 3) _
              + vector(3) * matrix(3, 3) + vector(4) * matrix(4, 3)
    vals(4) = vector(1) * matrix(1, 4) + vector(2) * matrix(2, 4) _
               + vector(3) * matrix(3, 4) + vector(4) * matrix(4, 4)
    vector(1) = vals(1)
    vector(2) = vals(2)
    vector(3) = vals(3)
    vector(4) = vals(4)

    vmult = True

End Function

Next, we need a function to create a rotation matrix.  This function will take an Angle in degrees by which you want to rotate the points as well as an Axis (“X” or “Y”) around which you wish to rotate.  I’m not going to include a “Z” option as with 3 axes we have 2 degrees of freedom.  Any orientation can be achieved by a combination of rotating around X and Y.  This function will be somewhat hidden.  You should never have to use it directly as it will be used in yet another function.

Function return_rotation(angle As Double, axis As String, ByRef result() As Double) As Boolean
    Dim x As Double: Dim y As Double: Dim z As Double
    Dim num As Double: Dim num1 As Double:
    Dim num2 As Double: Dim num3 As Double:
    Dim num4 As Double: Dim num5 As Double:
    Dim num6 As Double: Dim num7 As Double:
    Dim num8 As Double
    axis = UCase(axis)
    If axis = “X” Then
     x = 1: y = 0: z = 0
     x = 0: y = 1: z = 0
    End If
    angle = angle / 57.2957795
    num = Sin(angle): num2 = Cos(angle)
    num3 = x ^ 2: num4 = y ^ 2:  num5 = z ^ 2:
    num6 = x * y: num7 = x * z: num8 = y * z
    result(1, 1) = num3 + num2 * (1 – num3)
    result(1, 2) = num6 – num2 * num6 + num * z
    result(1, 3) = num7 – num2 * num7 – num * y
    result(1, 4) = 0
    result(2, 1) = num6 – num2 * num6 – num * z
    result(2, 2) = num4 + num2 * (1 – num4)
    result(2, 3) = num8 – num2 * num8 + num * x
    result(2, 4) = 0
    result(3, 1) = num7 – num2 * num7 + num * y
    result(3, 2) = num8 – num2 * num8 – num * x
    result(3, 3) = num5 + num2 * (1 – num5)
    result(3, 4) = 0
    result(4, 1) = 0
    result(4, 2) = 0
    result(4, 3) = 0
    result(4, 4) = 1

return_rotation = True

End Function

Ok- we have functions to create a rotation matrix and multiply our points by that matrix.  Now we need a function that we will call from the spreadsheet cells to put them together and do the rotation.

Function rotate(x As Double, y As Double, z As Double, xrot As Double, yrot As Double) As Double()

Dim rotate_x(1 To 4, 1 To 4) As Double
Dim rotate_y(1 To 4, 1 To 4) As Double
Dim vector(1 To 4) As Double
Dim retval(1 To 3) As Double

vector(1) = x: vector(2) = y: vector(3) = z: vector(4) = 1
ok = return_rotation(xrot, “X”, rotate_x)
ok = return_rotation(yrot, “Y”, rotate_y)

ok = vmult(vector, rotate_x)
ok = vmult(vector, rotate_y)

retval(1) = vector(1)
retval(2) = vector(2)
retval(3) = vector(3)

rotate = retval

End Function

When we want to rotate our points, we call =Rotate(xcell,ycell,zcell, x-axis rotation, y axis rotation) and it will return an array of 3 points- our original points rotated by X axis rotation around the X axis and Y axis rotation around the y axis.

To use these functions, we have to get them into Visual Basic for Applications.  That isn’t hard.  Open Excel.  Do you see a Tab that says “Developer” ?  If so- good.  If not- you have to change your Excel Options to show the Developer tab. 

If you don’t have a Developer tab. Click on the Office Button…


and choose Excel Options.



Choose “Popular” and click Show Developer Tab in the Ribbon.




Now, click the developer tab.


And on that tab click Visual Basic.


You should have just opened the Visual Basic for Applications Editor.  On the left you should see the “Project Explorer” window.  If not, you can have it displayed under the “View” menu.  Once displayed, look for “VBAproject(Book1) as circled in Red below.


Right-Click that text and select “Insert” and then “Module.” 


A new module should now appear, with a blank window to the right.  There may be text at the top of the window (e.g., “Option Explicit”).  Whether there is text or not, delete it and make sure the first line reads “Option Base 1”.


Below that line “Option Base 1” is where you will paste the functions from here.  Just copy & paste.

You might want to save your spreadsheet at this point.  It must be saved as a “Macro enabled” spreadsheet (xlsm), otherwise the VBA code won’t be saved and available.

Go back to your sheet, and click on Column F and then on the Home tab go to “Insert” and Insert 3 columns.  We’re going to need a bit more space.

To make this more fun- lets get a little fancy.  Lets set up a couple of scroll bars to use to update our X and Y axis rotation values.  Go to cell D4 and type X rotation.  In Cell E4 type Y rotation.   Put a zero in each cell below them.

Now go to the Developer Tab, click on “Insert” and under “Form Control” choose the scrollbar.


Your cursor will change to a “+”.  Left click and hold to draw the scroll bar in Column D under the X rotation cells.  Repeat these steps to put a scroll bar under the Y rotation cells.

Right click the bar under X rotation and choose “Format Control.”  On the “Control” tab, set current value to 0, Minimum Value to 0, Maximum value to 360, Incremental change to 1, and page change to 5.  Set Cell Link to the cell containing the X rotation.


Repeat that for the other scroll bar and the Y axis.  Now we have these bars linked to these cells so that updating the bar will change the cell values.  We will use these cell values in our rotate functions.

We’re almost there-  Go to G2 and type “Rotated Values” and then on G3-H3, put X, Y, and Z.

Now, select cells G4 to I4.  Type =Rotate(a4,b4,c4,d$5,e$5) and DO NOT PRESS ENTER.  Press SHIFT+CNTRL+ENTER.   This formula returns an array of 3 points.  The special way of entering the formula tells excel to expect more than one result and to put it them in the cells you’ve high lighted.

What you should see is a copy of cells A4-C4 (assuming X rotation and Y rotation are both Zero).   Highlight G4-H4 and paste them down to row 15 to apply the rotation to every point in our grid.

We can now rotate our original points.  So we need to put the rotated points onto the screen rather than the original points.  Go back to the formula for our Screen Coordinates.  Modify the values to take the Rotated Values of X, Y, and Z rather than the original ones.  For example change =A4/(C4+$d$3) to = g4/(I4+$d$3).  Modify X and Y, then paste the new formulas down the line. 

Now start playing with your scroll bars and watch the magic in your axes.

Lets create some interesting series to put in to complete the example.

The data we are going to fill in have nothing to do with the steps involved in making a 3d plot.  I’m just generating some cool data to plot.

First, set your grid to a width of 20.  Set your distance to 25.

At A17 I’m going to begin a new series of data.  I’ll call it GreenStuff. I’ll put the Labels Factor, X, Y, and Z in cells A18 – D18.

Set A19 to –10.  Set A20 to =A19+20/50 and copy that down to A69.

In Cells B19-D19, put the formula “=$A19+NORMINV(RAND(),0,1)”  and then paste it down to row 69.  These are our raw data.  Once you have the pasting done, select all those data and paste-special them back into the spreadsheet where they are.  The special paste operation will be “Values.”  Otherwise, every time we do something to the sheet, the formula will update and the pattern will change.

Now rotate them. Hilight cells G19-I19 and SHIFT-CNTRL-ENTER =Rotate(b19,c19,d19,$D$5,$E$6) into those cells.  Paste that down to row 69.

Do the perspective transform. Copy cells M15 and N15 (the last row of your Screen Coordinates) and paste them in the same columns between rows 19 and 69.

Ok- right click your graph. Select series, “add”, select “Greenstuff” for the name, and the new X values (M19..M69) for X, and the new Y screen values for Y.

You should have an ugly graph now! Click on the new data points and format the series.  Remove the connecting lines (line = none), and make your symbols something interesting, like green 3d formatted circles.

Copy all the new stuff you just made, from lines 17 – 69, and paste them beginning at line 72.  Change the name “Greenstuff” to Redstuff.

In Cells B74 put the formula “=-$A74+NORMINV(RAND(),0,1)”  Leave the other cells alone.  Copy and paste that cell down to row 124.  Once done, copy the results and paste them back into the same place using paste special –> values.

Now, enter the new screen coordinates as a new series.  Once in, format it to be red circles.  Play with your rotations.   Here is what mine looked like.


You should observe two things now.  First, small Z values are close to you initially.  That is, –Z points towards the viewer.  Second, as you rotate, the points don’t necessarily obscure each other as they should.  That is because Excel draws the series in order, from first values to last, and then from Series 1 to Series N.  Later drawn points will always be on top of earlier drawn points regardless of where they are supposed to be in space.

That can be corrected by sorting the points in Z order.  There are ways to do this on the fly, but I’m not going to get into it here, as even that won’t solve the series order problem.  But for a single series, sorting the points by Z improves things.  If you’re determined to do it, look here. 

In another spreadsheet you can download, I’ve implemented the sorting.  It does make a difference within a single series.  It also creates a neat effect to add some transparency to the points.

What if your points are out of the range of the graph?  Modify your axes width and distance.   Or, scale your points down. 

In short, create a set of XYZ points to make some form of Axes.  Make that your first series.

Enter your XYZ data for each subsequent series.

Use the Rotate function to create rotated points.

Apply the perspective division with the distance added to convert each XYZ triplet to an XY set.

Sort by Z, large to small, if you want.

Create a 2d scatterplot.  Select your series as appropriate based on their XY screen coordinates.  Format your series to your liking.

That’s about it.  You should be able to take it from here.


Posted in Programming in general, Statistics | Leave a comment

Visualizing effect sizes, the effect of N, and significance.

In my previous post, I discussed effect sizes and provided one  conceptualization of them as measures of how easy it is to see your effect, unaided by measurement, in the population.  The two measures I am going to discuss here are Cohen’s d, and eta/partial-eta squared. 

According to a variety of internet sources, Cohen has defined d’s of .2, .5. and .8 as delimiting small, medium, and large effect sizes, respectively (I cannot get access to his 88 book, so I am not able to say these things with certainty).  With respect to the proportion of variance accounted for by the effect (i.e., eta squared), he has set .02, .13, and .26 as the delimiters.  As with anything to do with statistics there are a variety of arguments against these descriptors, and they are generally considered very rough guidelines at best.  Because d represents the number of standard deviations apart a set of means is, it also represents overlap.  A d of .2 represents an overlap of about 85%, .5 at about 67% and .8 at about 53%.

I think it is hard to really visualize effect sizes.  There is an excellent demonstration maintained by Kristoffer Magnusson (his entire site is excellent and recommended reading) that allows us to visualize what is meant by effect size, to a certain extent.  The image below was copied from a screen shot of his work.


Here, we have two nicely smoothed distributions that are spaced about .9 standard deviations apart. That would be, by Cohen’s criteria, a large effect. By looking at these, it is relatively easy to see that they are two different distributions.  But, in a way, the variability represented by the width of the curves is hidden by their smoothness and the coloring.  Looking at two distributions is not quite the same as observing an effect of a variable in the absence of measurement. 

What I tried to do was to construct a representation that allows the interaction of variance within and between groups to be better represented. I created a spreadsheet that generates 2-d scatterplots of points drawn from two different populations.

You can download the spreadsheet here if you want to follow along. 

(When you go to the Google Drive to download the sheet, it will display the contents as if there were 174 different pages.  Just look for the download link (usually an arrow pointing down at the top of the page) to download the file.  The file will not open and function correctly in Google Sheets.)


imageThe cell and scroll bar at Top left allows you to enter the size of your sample, and your desired effect size as Cohen’s d.  You can type in an effect size directly at cell F3, but you must enter one 10 times greater than what you want (i.e., enter 5 for an effect size of .5). 

You can also modify the standard deviations of the points.   The spreadsheet will then generate 2000 sets of points, (1000 per group) such that the standard deviation of their variation in each of X and Y will be equal to what you set, aimagend the effect size as measured between their centroids will be more or less equal to the d you set.  These two large samples will be treated as populations and are graphed in the scatterplot on the right of the screen. 

On the left of that plot you will see a scatterplot of the two small samples drawn from the larger samples.

imageThe sample centroids are indicated by the Red triangle and Blue square. 

The spreadsheet reports the local univariate stats on the samples (i.e., means, standard deviations on X and Y separately) as well as doing a 2d ANOVA on the samples.  

All of that information is updated in real time with changes in the values of the Sample Size, desired d, and Standard Deviation.

Below, for example, is an ANOVA on two samples of size 30 were drawn from populations with an effect size of d=.7.  The ANOVA shows the means to be significantly different a p =.02 with d estimated at .59. 


To the Left of the ANOVA table are the stats on the two larger samples (populations).


To arrive at effect sizes for these 2-variable distributions I calculated a sum of squares for the difference between the distributions (distance between the two group centroids squared times the number of points per group).  Within each group the sum of squared distances from their respective centroids was computed to get the “Error” term, and finally the sum of squared distances from every point in each group from the grand mean was calculated to get the Total.  As in standard ANOVA, SS total = SS Effect + SS Error.  Eta squared is simply SS Effect / SS Total.

d was calculated by taking the distance between the group centroids and dividing it by the average within-group error.

There are many fun things to do with the spreadsheet.  First, fix a sample size and play with modifications of the effect size.  Then you can get a feel for how big effect sizes need to be to really be able to see them in your samples and populations.  To be readily apparent, they generally need to be bigger than even Cohen’s “Large” effect size.

You can also see the difference between “significance” and effect size.  For example, set the sample size to 5 and the effect size to .5.  Now, press Cntrl+alt+F9 all at the same time.  That key combination should cause the sheet to generate new samples and update all cells.  Keep doing that repeatedly.

As you press Cntrl-Alt-F9 repeatedly, there are many things to observe.  First observe that the Global parameters based on N = 1000 change very little.   The statistics for the smaller samples change more considerably.  The Red triangle and Blue Square should jump around quite a bit.  That reflects the sample-to-sample variability you get with small N. 

Notice also the p in the ANOVA table.  Seldom will it appear < .05. It is only when the sample-to-sample variability produces a large effect size that p will be < .05.    The two “populations” really are different.  They were programmed to be different with an effect size of about .5.  But, whenever the samples accurately estimate that effect size, no significant difference is found (a type II error is made). 

With a small effect size in the population, and small samples, we only make the correct inference as to the populations being different when the samples incorrectly over-estimate the population effect size.  For that, there are a number of “corrections” (.e.g., Hedges g for d, omega-squared for eta etc.) that can be applied to effect sizes to compensate for that overestimation.  A fun exercise would be to head over to Wikipedia, grab the formula and enter them into the spreadsheet to see how well these compensations work.

Now, keeping N set at 5, increase the effect size to something Cohen would think of as huge, like 1.5 and repeat your cntrl-alt-F9 cycling.  What you should observe is that we find more significant differences, and over-estimate the population effect size less in those cases.  With small N we don’t know very well if a significant difference reflects a big effect in the population, or an over-estimated effect in the population.

Next, investigate the effect of changing N.  Set the effect size to .5 again, and increase N to 50.  As you cycle with cntrl-alt-F9 you should observe considerably less variability in your group means (red triangle-blue square).  P should be significant more often, and our over-estimations of the effect size should be considerably reduced.

Creating this spreadsheet was a fun exercise, and it was very instructional for me.  I hope that you find it at least somewhat useful in understanding effect sizes, significance, and their relationships to sample size.

Posted in Statistics | Tagged , , , , | 2 Comments

Effect size rambling.

I am going to ramble about effect sizes.  Like anything I write about, I am sure the ideas are not new and have been written about by people much smarter than me.  But, I like to organize my thoughts and this blog is my way of doing that.

I may be being overly pessimistic, but I’m going to complain that effect sizes are a relatively useless statistic that editors are causing us to calculate.

When an experiment is run, the scientist typically ends up with two or more groups of data.  They then summarize those groups in some way, such as a mean or a median, and look to see the extent to which the groups are different.   Part of that process is to determine the probability of obtaining that difference by chance.  That is the process of “significance” testing.

Much has been made recently, and rightly so, about the process of significance testing in that it can (and is) easily abused.  The researcher has many degrees of freedom that he or she can use in conducting the test that can improve their odds of obtaining a “statistically significant” result.

Along with significance testing, there is now a push to report effect sizes (and even confidence intervals on effect sizes…).  Part of this push is likely in response to the abuses within significance testing.  But, effect sizes can also affected by the same researcher degrees of freedom.

An effect size is a measure of how “big” the presumed effect of some variable is.  I like to think of it as Cohen describes it (or at least that is where I believe to have encountered this idea), the effect size has to do with the ability to see any difference between the groups without any summary statistics.

Suppose you gathered people from Spain into a room.  There are people from the north and the south, and people from the north are a little taller from those in the south.   But just by looking into the room and observing heights, could you tell there are two different groups of people there?   Probably not- that would be a small effect of geography.  You’d have to measure each person and calculate some summary statistics to begin to see that there are two groups.

Now, suppose you put people from south Spain and people from the Netherlands into a room.  Just by looking into the room you could probably tell that there were two different kinds of people in the room based on height (folks are tall in the Netherlands).  In this case, there would be a large effect of geography.

Effect sizes are generally measured as a ratio of the extent to which the central tendencies of conditions differ over the variability within the conditions. The mean group differences divided by the standard deviation of the within-group differences gives us Cohen’s d.  The sum of squared differences between the groups over the sum of squared differences between the groups plus the sum of squares within each group (ss between / (ss between+ss within) gives us eta squared (or partial eta squared, depending on the number of independent variables).  These latter statistics generally represent the proportion of total variance that is attributable to the effect.

Statistical significance is also calculated in very similar ways. 

The major difference being that statistical significance also depends upon the size of the sample (the number of different people in the room).  Nevertheless, the two are generally related.   For instance, the statistic “t” used in significance testing is equal to “d”, a measure of effect size, when divided by the square rood of N  (  t / sqrt( N ) = d ).  Eta or partial eta squared is equal to F * DF1 / (F * DF1 + DF2) using the F obtained in analysis of variance.

As both the test statistic used to determine significance and the effect size are a function of the variability in the samples, anything that reduces that variability should improve both the size of the effect and the statistical significance.   I would suggest that property is what makes the concept of an effect size somewhat meaningless.

I’ll do like Staddon does in his book Adaptive Dynamics and adopt a pendulum for an example (though this is not his usage of the example).  I have two groups of 20 pendulums.  In one group the length is 10cm, and in the other the length is 11cm.  The one that is 11cm in length should swing more slowly than the one that has a length of 10cm. 

If you look at the two groups, you can probably pretty clearly determine that one group is swinging faster than the other.  Even though there may be some variability between individual pendulums (the researcher set the lengths to 10 or 11cm +/- 1mm, some pendulums have slightly tighter pivots etc..)  the variables responsible for that variability do not have enough impact to prevent you from seeing the overall difference that pendulum length makes. 

Suppose I then told you that the color should matter.  Black pendulums should swing slower than white ones in a well-lit room.  My imaginary idea here is that the black ones will retain heat better, expand slightly and thus swing more slowly.    After several hours, you probably could not see a difference.    Even after going in and measuring the individual tick-speeds you probably could not see a difference.

So- determined to prove support your hypothesis, what do you do?  You go in and start over.  You precisely measure and set the starting lengths to be exactly 10cm, with no variation in length.   You mix a concoction of WD-40 and precisely lubricate each pivot.  You place micro-switches on each pendulum to record their swing-time down to the microsecond.  You put them in glass, precisely vacuum-sealed boxes to eliminate any effect of air- and so forth.  You control every tiny variable that you can imagine might make a difference.   Now, when you start your pendulums swinging, after an hour you still can’t notice that he black ones are swinging faster than the white ones.   It again looks like you have no effect.  But lo- when you look at your computer-recorded data you see that the black ones are swinging .04 miliseconds faster than the white ones.   Within the black or white ones, there is only a .001 ms difference.  That is, between any two black ones or any two white ones, there is only .001ms timing difference, but between any black and any white one, there is a  .04 ms difference.  You have a difference between the pendulums of different colors that is 40 times bigger than the difference between pendulums of the same color.  You will calculate a HUGE effect size.

But does it matter?  Will the color variable make a difference?  Well- yes.  Will it make any practical difference in the operation of the world- absolutely not.     Effect sizes, we are taught, are supposed to help us with that latter question, but the truth is, they are relatively meaningless outside the context of the way in which the research was conducted.  We must consider the total number of variables controlled, and the quality of that control.  With greater control, effect sizes become less meaningful (imo). See also- visualizing effect sizes.


Posted in Statistics | 1 Comment


I’ve been getting a lot of views this month from Belarus-  I’m just asking out of curiosity, what is it Belarusians have found interesting?

Posted in Misc Fun Stuff | 2 Comments

Doing science.

Sometime, somewhere, somehow, my work will make a meaningful difference in the course of things. It will matter. I will receive no accolades, fame, fortune, nor security from it. I may even be dead when it happens. But some day my work will be used to connect two unconnected dots in a pattern and permit great discovery. He or she who connects all of the pieces will, likely, not recognize the importance of the individual parts in their contribution to what he or she has built, but they will be there nevertheless and my life’s labor will not have been in vain. That is the best that a good scientist can hope for.

Posted in Uncategorized | 2 Comments