R Handbook: Hypothesis Testing and p-values (2024)

Initial comments

Traditionally when students first learn about the analysisof experiments, there is a strong focus on hypothesis testing and makingdecisions based on p-values. Hypothesis testing is important fordetermining if there are statistically significant effects. However, readersof this book should not place undo emphasis on p-values. Instead, theyshould realize that p-values are affected by sample size, and that a lowp-value does not necessarily suggest a large effect or a practically meaningfuleffect. Summary statistics, plots, effect size statistics, and practicalconsiderations should be used. The goal is to determine: a) statisticalsignificance, b) effect size, c) practical importance. These are all differentconcepts, and they will be explored below.

Statistical inference

Most of what we’ve covered in this book so far is aboutproducing descriptive statistics: calculating means and medians, plotting datain various ways, and producing confidence intervals. The bulk of the rest ofthis book will cover statistical inference: using statistical tests to drawsome conclusion about the data. We’ve already done this a little bit inearlier chapters by using confidence intervals to conclude if means aredifferent or not among groups.

As Dr. Nic mentions in her article in the “References andfurther reading” section, this is the part where people sometimes get stumped.It is natural for most of us to use summary statistics or plots, but jumping tostatistical inference needs a little change in perspective. The idea of usingsome statistical test to answer a question isn’t a difficult concept, but someof the following discussion gets a little theoretical. The video from theStatistics Learning Center in the “References and further reading” section doesa good job of explaining the basis of statistical inference.

One important thing to gain from this chapter is anunderstanding of how to use the p-value, alpha, and decision ruleto test the null hypothesis. But once you are comfortable with that, you willwant to return to this chapter to have a better understanding of the theorybehind this process.

Another important thing is to understand the limitations ofrelying on p-values, and why it is important to assess the size ofeffects and weigh practical considerations.

Packages used in this chapter

The packages used in this chapter include:

• lsr

The following commands will install these packages if theyare not already installed:

if(!require(lsr)){install.packages("lsr")}

Hypothesis testing

The null and alternative hypotheses

The statistical tests in this book rely on testing a nullhypothesis, which has a specific formulation for each test. The nullhypothesis always describes the case where e.g. two groups are not different orthere is no correlation between two variables, etc.

The alternative hypothesis is the contrary of the nullhypothesis, and so describes the cases where there is a difference among groupsor a correlation between two variables, etc.

Notice that the definitions of null hypothesis and alternativehypothesis have nothing to do with what you want to find or don't want to find,or what is interesting or not interesting, or what you expect to find or whatyou don’t expect to find. If you were comparing the height of men and women,the null hypothesis would be that the height of men and the height of womenwere not different. Yet, you might find it surprising if you found thishypothesis to be true for some population you were studying. Likewise, if youwere studying the income of men and women, the null hypothesis would be thatthe income of men and women are not different, in the population you are studying.In this case you might be hoping the null hypothesis is true, though youmight be unsurprised if the alternative hypothesis were true. In anycase, the null hypothesis will take the form that there is no differencebetween groups, there is no correlation between two variables, or there is noeffect of this variable in our model.

p-valuedefinition

Most of the tests in this book rely on using a statisticcalled the p-value to evaluate if we should reject, or fail to reject,the null hypothesis.

Given the assumption that the null hypothesis is true,the p-value is defined as the probability of obtaining a result equal toor more extreme than what was actually observed in the data.

We’ll unpack this definition in a little bit.

Decision rule

The p-value for the given data will be determined byconducting the statistical test.

This p-value is then compared to a pre-determinedvalue alpha. Most commonly, an alpha value of 0.05 is used, butthere is nothing magic about this value.

If the p-value for the test is less than alpha,we reject the null hypothesis.

If the p-value is greater than or equal to alpha,we fail to reject the null hypothesis.

Coin flipping example

For an example of using the p-value for hypothesistesting, imagine you have a coin you will toss 100 times. The null hypothesisis that the coin is fair—that is, that it is equally likely that the coin willland on heads as land on tails. The alternative hypothesis is that the coin isnot fair. Let’s say for this experiment you throw the coin 100 times and itlands on heads 95 times out of those hundred. The p-value in this casewould be the probability of getting 95, 96, 97, 98, 99, or 100 heads, or 0, 1,2, 3, 4, or 5 heads, assuming that the null hypothesis is true.

This is what we call a two-sided test, since we are testingboth extremes suggested by our data: getting 95 or greater heads or getting 95or greater tails. In most cases we will use two sided tests.

You can imagine that the p-value for this data willbe quite small. If the null hypothesis is true, and the coin is fair, therewould be a low probability of getting 95 or more heads or 95 or more tails.

Using a binomial test, the p-value is < 0.0001.

(Actually, R reports it as < 2.2e-16, which is shorthandfor the number in scientific notation, 2.2 x 10-16, which is0.00000000000000022, with 15 zeros after the decimal point.)

Assuming an alpha of 0.05, since the p-valueis less than alpha, we reject the null hypothesis. That is, we concludethat the coin is not fair.

binom.test(5, 100, 0.5)

Exact binomial test

number of successes = 5, number of trials = 100, p-value < 2.2e-16

alternative hypothesis: true probability of success is not equal to 0.5

Passing and failing example

As another example, imagine we are considering twoclassrooms, and we have counts of students who passed a certain exam. We wantto know if one classroom had statistically more passes or failures than theother.

In our example each classroom will have 10 students. Thedata is arranged into a contingency table.


Classroom Passed Failed
A 8 2
B 3 7

We will use Fisher’s exact test to test if there is anassociation between Classroom and the counts of passed and failedstudents. The null hypothesis is that there is no association between Classroomand Passed/Failed, based on the relative counts in each cell of thecontingency table.

Input =("
Classroom Passed Failed
A 8 2
B 3 7
")

Matrix = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))

Matrix

Passed Failed
A 8 2
B 3 7

fisher.test(Matrix)

Fisher's Exact Test for Count Data

p-value = 0.06978

The reported p-value is 0.070. If we use an alphaof 0.05, then the p-value is greater than alpha, so we fail toreject the null hypothesis. That is, we did not have sufficient evidence tosay that there is an association between Classroom and Passed/Failed.

More extreme data in this case would be if the counts in theupper left or lower right (or both!) were greater.


Classroom Passed Failed
A 9 1
B 3 7Classroom Passed Failed
A 10 0
B 3 7

and so on, with Classroom B...

In most cases we would want to consider as"extreme" not only the results when Classroom A has a high frequencyof passing students, but also results when Classroom B has a high frequency ofpassing students. This is called a two-sided or two-tailed test. If we wereonly concerned with one classroom having a high frequency of passing students,relatively, we would instead perform a one-sided test. The default for the fisher.testfunction is two-sided, and usually you will want to use two-sided tests.

Classroom Passed Failed
A 2 8
B 7 3Classroom Passed Failed
A 1 9
B 7 3Classroom Passed Failed
A 0 10
B 7 3

and so on, with Classroom B...

In both cases, "extreme" means there is a strongerassociation between Classroom and Passed/Failed.

Theory and practice of using p-values

Wait, does this make any sense?

Recall that the definition of the p-value is:

Given the assumption that the null hypothesis is true,the p-value is defined as the probability of obtaining a result equal toor more extreme than what was actually observed in the data.

The astute reader might be asking herself, “If I’m trying todetermine if the null hypothesis is true or not, why would I start with theassumption that the null hypothesis is true? And why am I using a probabilityof getting certain data given that a hypothesis is true? Don’t I want toinstead determine the probability of the hypothesis given my data?”

The answer is yes, we would like a method todetermine the likelihood of our hypothesis being true given our data, but weuse the Null Hypothesis Significance Test approach since it isrelatively straightforward, and has wide acceptance historically and across disciplines.

In practice we do use the results of the statistical teststo reach conclusions about the null hypothesis.

Technically, the p-value says nothing about thealternative hypothesis. But logically, if the null hypothesis is rejected,then its logical complement, the alternative hypothesis, is supported.Practically, this is how we handle significant p-values, though thispractical approach generates disapproval in some theoretical circles.

Statistics is like a jury?

Note the language used when testing the null hypothesis.Based on the results of our statistical tests, we either reject the nullhypothesis, or fail to reject the null hypothesis.

This is somewhat similar to the approach of a jury in atrial. The jury either finds sufficient evidence to declare someone guilty, orfails to find sufficient evidence to declare someone guilty.

Failing to convict someone isn’t necessarily the same asdeclaring someone innocent. Likewise, if we fail to reject the nullhypothesis, we shouldn’t assume that the null hypothesis is true. It may be thatwe didn’t have sufficient samples to get a result that would have allowed us toreject the null hypothesis, or maybe there are some other factors affecting theresults that we didn’t account for. This is similar to an “innocent untilproven guilty” stance.

Errors in inference

For the most part, the statistical tests we use are based onprobability, and our data could always be the result of chance. Consideringthe coin flipping example above, if we did flip a coin 100 times and came upwith 95 heads, we would be compelled to conclude that the coin was not fair.But 95 heads could happen with a fair coin strictly by chance.

We can, therefore, make two kinds of errors in testing thenull hypothesis:

• A Type I error occurs when the null hypothesis reallyis true, but based on our decision rule we reject the null hypothesis. In thiscase, our result is a false positive; we think there is an effect(unfair coin, association between variables, difference among groups) whenreally there isn’t. The probability of making this kind error is alpha,the same alpha we used in our decision rule.

• A Type II error occurs when the null hypothesis isreally false, but based on our decision rule we fail to reject the nullhypothesis. In this case, our result is a false negative; we havefailed to find an effect that really does exist. The probability of makingthis kind of error is called beta.

The following table summarizes these errors.


Reality
___________________________________Decision of Test Null is true Null isfalse

Reject null hypothesis Type I error Correctly

(prob. = alpha) reject null
(prob. = 1 – beta)

Retain null hypothesis Correctly Type II error
retain null

(prob. = beta)
(prob. = 1 –alpha)

Statistical power

The statistical power of a test is a measure of the abilityof the test to detect a real effect. It is related to the effect size, thesample size, and our chosen alpha level.

The effect size is a measure of how unfair a coin is, howstrong the association is between two variables, or how large the difference is amonggroups. As the effect size increases or as the number of observations wecollect increases, or as the alpha level increases, the power of thetest increases.

Statistical power in the table above is indicated by 1 –beta, and power is the probability of correctly rejecting the nullhypothesis.

An example should make these relationship clear. Imagine weare sampling a large group of 7th grade students for their height. Thatis, the group is the population, and we are sampling a sub-set of thesestudents. In reality, for students in the population, the girls are tallerthan the boys, but the difference is small (that is, the effect size is small),and there is a lot of variability in students’ heights. You can imagine thatin order to detect the difference between girls and boys that we would have tomeasure many students. If we fail to sample enough students, we might make aType II error. That is, we might fail to detect the actual difference inheights between sexes.

If we had a different experiment with a larger effect size—forexample the weight difference between mature hamsters and mature hedgehogs—we mightneed fewer samples to detect the difference.

Note also, that our chosen alpha plays a role in thepower of our test, too. All things being equal, across many tests, if we decreaseour alpha, that is, insist on a lower rate of Type I errors, we are morelikely to commit a Type II error, and so have a lower power. This is analogousto a case of a meticulous jury that has a very high standard of proof toconvict someone. In this case, the likelihood of a false conviction is low,but the likelihood of a letting a guilty person go free is relatively high.

The 0.05 alpha value is not dogma

The level of alpha is traditionally set at 0.05 insome disciplines, though there is sometimes reason to choose a different value.

One situation in which the alpha level is increasedis in preliminary studies in which it is better to include potentiallysignificant effects even if there is not strong evidence for keeping them. Inthis case, the researcher is accepting an inflated chance of Type I errors inorder to decrease the chance of Type II errors.

Imagine an experiment in which you wanted to see if variousenvironmental treatments would improve student learning. In a preliminarystudy, you might have many treatments, with few observations each, and you wantto retain any potentially successful treatments for future study. For example,you might try playing classical music, improved lighting, complimentingstudents, and so on, and see if there is any effect on student learning. Youmight relax your alpha value to 0.10 or 0.15 in the preliminary study tosee what treatments to include in future studies.

On the other hand, in situations where a Type I, falsepositive, error might be costly in terms of money or people’s health, a lower alphacan be used, perhaps, 0.01 or 0.001. You can imagine a case in which there isan established treatment for cancer, and a new treatment is being tested.Because the new treatment is likely to be expensive and to hold people’s livesin the balance, a researcher would want to be very sure that the new treatmentis more effective than the established treatment. In reality, the researcherswould not just lower the alpha level, but also look at the effect size, submitthe research for peer review, replicate the study, be sure there were noproblems with the design of the study or the data collection, and weigh thepractical implications.

The 0.05 alpha value is almost dogma

In theory, as a researcher, you would determine the alphalevel you feel is appropriate. That is, the probability of making a Type Ierror when the null hypothesis is in fact true.

In reality, though, 0.05 is almost always used in mostfields for readers of this book. Choosing a different alpha value willrarely go without question. It is best to keep with the 0.05 level unless youhave good justification for another value, or are in a discipline where othervalues are routinely used.

Practical advice

One good practice is to report actual p-values fromanalyses. It is fine to also simply say, e.g. “The dependent variable wassignificantly correlated with variable A (p < 0.05).” But Iprefer when possible to say, “The dependent variable was significantlycorrelated with variable A (p = 0.026).

It is probably best to avoid using terms like “marginallysignificant” or “borderline significant” for p-values less than 0.10 butgreater than 0.05, though you might encounter similar phrases. It is better tosimply report the p-values of tests or effects in straight-forwardmanner. If you had cause to include certain model effects or results fromother tests, they can be reported as e.g., “Variables correlated with the dependentvariable with p < 0.15 were A, B, and C.”

Is the p-value every really true?

Considering some of the examples presented, it may have occurredto the reader to ask if the null hypothesis is ever really true. For example,in some population of 7th graders, if we could measure everyone inthe population to a high degree of precision, then there must be somedifference in height between girls and boys. This is an important limitationof null hypothesis significance testing. Often, if we have many observations,even small effects will be reported as significant. This is one reason why itis important to not rely too heavily on p-values, but to also look atthe size of the effect and practical considerations. In this example, if we sampledmany students and the difference in heights was 0.5 cm, even if significant, wemight decide that this effect is too small to be of practical importance,especially relative to an average height of 150 cm. (Here, the differencewould be 0.3% of the average height).

Effect sizes and practical importance

Practical importance and statistical significance

It is important to remember to not let p-values bethe only guide for drawing conclusions. It is equally important to look at thesize of the effects you are measuring, as well as take into account other practicalconsiderations like the costs of choosing a certain path of action.

For example, imagine we want to compare the SAT scores oftwo SAT preparation classes with a t-test.

Class.A = c(1500, 1505, 1505, 1510, 1510, 1510, 1515, 1515, 1520,1520)

Class.B = c(1510, 1515, 1515, 1520, 1520, 1520, 1525, 1525, 1530, 1530)

t.test(Class.A, Class.B)

Welch Two Sample t-test

t = -3.3968, df = 18, p-value = 0.003214

mean of x mean of y
1511 1521

The p-value is reported as 0.003, so we wouldconsider there to be a significant difference between the two classes (p< 0.05).

But we have to ask ourselves the practical question, is adifference of 10 points on the SAT large enough for us to care about? What ifenrolling in one class costs significantly more than the other class? Is itworth the extra money for a difference of 10 points on average?

Sizes of effects

It should be remembered that p-values do not indicatethe size of the effect being studied. It shouldn’t be assumed that a small p-valueindicates a large difference between groups, or vice-versa.

For example, in the SAT example above, the p-value isfairly small, but the size of the effect (difference between classes) in thiscase is relatively small (10 points, especially small relative to the range ofscores students receive on the SAT).

In converse, there could be a relatively large size of theeffects, but if there is a lot of variability in the data or the sample size isnot large enough, the p-value could be relatively large.

In this example, the SAT scores differ by 100 points betweenclasses, but because the variability is greater than in the previous example,the p-value is not significant.

Class.C = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450,1500)
Class.D = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600)

t.test(Class.C, Class.D)

Welch Two Sample t-test

t = -1.4174, df = 18, p-value = 0.1735

mean of x mean of y
1290 1390

boxplot(cbind(Class.C, Class.D))


R Handbook: Hypothesis Testing and p-values (1)

p-values andsample sizes

It should also be remembered that p-values are affectedby sample size. For a given effect size and variability in the data, as thesample size increases, the p-value is likely to decrease. For largedata sets, small effects can result in significant p-values.

As an example, let’s take the data from Class.C and Class.Dand double the number of observations for each without changing the distributionof the values in each, and rename them Class.E and Class.F.

Class.E = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450,1500,
1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500)
Class.F = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600,
1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600)

t.test(Class.E, Class.F)

Welch Two Sample t-test

t = -2.0594, df = 38, p-value = 0.04636

mean of x mean of y
1290 1390

boxplot(cbind(Class.E, Class.F))

Notice that the p-value is lower for the t-testfor Class.E and Class.F than it was for Class.C and Class.D.Also notice that the means reported in the output are the same, and the boxplots would look the same.

Effect size statistics

One way to account for the effect of sample size on ourstatistical tests is to consider effect size statistics. These statisticsreflect the size of the effect in a standardized way, and are unaffected bysample size.

An appropriate effect size statistic for a t-test isCohen’s d. It takes the difference in means between the two groups anddivides by the pooled standard deviation of the groups. Cohen’s dequals zero if the means are the same, and increases to infinity as thedifference in means increases relative to the standard deviation.

In the following, note that Cohen’s d is not affectedby the sample size difference in the Class.C / Class.D and the Class.E/ Class.F examples.

library(lsr)

cohensD(Class.C, Class.D,
method = "raw")

[1] 0.668

cohensD(Class.E, Class.F,
method = "raw")

[1] 0.668

Effect size statistics are standardized so that they are notaffected by the units of measurements of the data. This makes them interpretableacross different situations, or if the reader is not familiar with the units ofmeasurement in the original data. A Cohen’s d of 1 suggests that thetwo means differ by one pooled standard deviation. A Cohen’s d of 0.5suggests that the two means differ by one-half the pooled standard deviation.

For example, if we create new variables—Class.G and Class.H—thatare the SAT scores from the previous example expressed as a proportion of a1600 score, Cohen’s d will be the same as in the previous example.

Class.G = Class.E / 1600
Class.H = Class.F / 1600

Class.G
Class.H

cohensD(Class.G, Class.H,
method="raw")

[1] 0.668

Good practices for statistical analyses

Statistics is not like a trial

When analyzing data, the analyst should not approach thetask as would a lawyer for the prosecution. That is, the analyst should not besearching for significant effects and tests, but should instead be like anindependent investigator using lines of evidence to find out what is mostlikely to true given the data, graphical analysis, and statistical analysisavailable.

The problem of multiple p-values

One concept that will be in important in the followingdiscussion is that when there are multiple tests producing multiple p-values,that there is an inflation of the Type I error rate. That is, there is ahigher chance of making false-positive errors.

This simply follows mathematically from the definition of alpha.If we allow a probability of 0.05, or 5% chance, of making a Type I error forany one test, as we do more and more tests, the chances that at least one ofthem having a false positive becomes greater and greater.

p-value adjustment

One way we deal with the problem of multiple p-valuesin statistical analyses is to adjust p-values when we do a series oftests together (for example, if we are comparing the means of multiple groups).

Don’t use Bonferroni adjustments

There are various p-value adjustments available inR. In some cases, we will use FDR, which stands for false discovery rate,and in R is an alias for the Benjamini and Hochberg method. There are alsocases in which we’ll use Tukey range adjustment to correct for the family-wiseerror rate.

Unfortunately, students in analysis of experiments coursesoften learn to use Bonferroni adjustment for p-values. This method issimple to do with hand calculations, but is excessively conservative in mostsituations, and, in my opinion, antiquated.

There are other p-value adjustment methods, and thechoice of which one to use is dictated either by which are common in your fieldof study, or by doing enough reading to understand which are statistically mostappropriate for your application.

Preplanned tests

The statistical tests covered in this book assume that testsare preplanned for their p-values to be accurate. That is, in theory,you set out an experiment, collect the data as planned, and then say “I’m goingto analyze it with kind of model and do these post-hoc tests afterwards”, and reportthese results, and that’s all you would do.

Some authors emphasize this idea of preplanned tests. Incontrast is an exploratory data analysis approach that relies upon examiningthe data with plots and using simple tests like correlation tests to suggestwhat statistical analysis makes sense.

If an experiment is set out in a specific design, thenusually it is appropriate to use the analysis suggested by this design.

p-valuehacking

It is important when approaching data from an exploratoryapproach, to avoid committing p-value hacking. Imagine the case inwhich the researcher collects many different measurements across a range ofsubjects. The researcher might be tempted to simply try different tests andmodels to relate one variable to another, for all the variables. He mightcontinue to do this until he found a test with a significant p-value.

But this would be a form of p-value hacking.

Because an alpha value of 0.05 allows us to make afalse-positive error five percent of the time, finding one p-value below0.05 after several successive tests may simply be due to chance.

Some forms of p-value hacking are more egregious.For example, if one were to collect some data, run a test, and then continue tocollect data and run tests iteratively until a significant p-value isfound.

Publication bias

A related issue in science is that there is a bias topublish, or to report, only significant results. This can also lead to aninflation of the false-positive rate. As a hypothetical example, imagine ifthere are currently 20 similar studies being conducted testing a similar effect—let’ssay the effect of glucosamine supplements on joint pain. If 19 of those studiesfound no effect and so were discarded, but one study found an effect using an alphaof 0.05, and was published, is this really any support that glucosamine supplementsdecrease joint pain?

Clarification of terms and reporting onassignments

"Statistically significant"

In the context of this book, the term"significant" means "statistically significant".

Whenever the decision rule finds that p < alpha,the difference in groups, the association, or the correlation underconsideration is then considered "statistically significant" or"significant".

No effect size or practical considerations enter intodetermining whether an effect is “significant” or not. The only exception isthat test assumptions and requirements for appropriate data must also be met inorder for the p-value to be valid.

What you need to consider:

• The null hypothesis

p, alpha, and the decision rule,

• Your result. That is, whether the difference in groups,the association, or the correlation is significant or not.

What you should report on your assignments:

• The p-value

• The conclusion, e.g. "There was a significantdifference in the mean heights of boys and girls in the class." It is bestto preface this with the "reject" or "fail to reject"language concerning your decision about the null hypothesis.

“Size of the effect” / “effect size”

In the context of this book, I use the term "size ofthe effect" to suggest the use of summary statistics to indicate how largean effect is. This may be, for example the difference in two medians. I try reservethe term “effect size” to refer to the use of effect size statistics. This distinctionisn’t necessarily common.

Usually you will consider an effect in relation to themagnitude of measurements. That is, you might look at the difference inmedians as a percent of the median of one group or of the global median. Or,you might look at the difference in medians in relation to the range ofanswers. For example, a one-point difference on a 5-point Likert item. Countsmight be expressed as proportions of totals or subsets.

What you should report onassignments:

• The size of the effect. That is, the difference in mediansor means, the difference in counts, or the proportions of counts among groups.

• Where appropriate, the size of the effect expressed as apercentage or proportion.

• If there is an effect size statistic—such as r, epsilon-squared,phi, Cramér's V, or Cohen's d—: report this and itsinterpretation (small, medium, large), and incorporate this into yourconclusion.

"Practical" / "Practical importance"

If there is a significant result, the question of practicalimportance asks if the difference or association is large enough to matter inthe real world.

If there is no significant result, the question of practicalimportance asks if the a difference or association is large enough to warrantanother look, for example by running another test with a larger sample size orthat controls variability in observations better.

What you should report onassignments:

• Your conclusion as to whether this effect is large enough tobe important in the real world.

• The context, explanation, or support to justify yourconclusion.

• In some cases you might include considerations that aren'tincluded in the data presented. Examples might include the cost of onetreatment over another, including time investment, or whether there is a largerisk in selecting one treatment over another (e.g., if people's lives are onthe line).

A few of xkcd comics

Significant

xkcd.com/882/

Null hypothesis

xkcd.com/892/

P-values

xkcd.com/1478/

Experiments, sampling, and causation

Types of experimental designs

Experimental designs

A true experimental design assigns treatments in asystematic manner. The experimenter must be able to manipulate theexperimental treatments and assign them to subjects. Since treatments arerandomly assigned to subjects, a causal inference can be made for significantresults. That is, we can say that the variation in the dependent variable is causedby the variation in the independent variable.

For interval/ratio data, traditional experimental designscan be analyzed with specific parametric models, assuming other modelassumptions are met. These traditional experimental designs include:

• Completely random design

• Randomized complete block design

• Factorial

• Split-plot

• Latin square

Quasi-experiment designs

Often a researcher cannot assign treatments to individualexperimental units, but can assign treatments to groups. For example, ifstudents are in a specific grade or class, it would not be practical to randomlyassign students to grades or classes. But different classes could receivedifferent treatments (such as different curricula). Causality can be inferredcautiously if treatments are randomly assigned and there is some understandingof the factors that affect the outcome.

Observational studies

In observational studies, the independent variables are notmanipulated, and no treatments are assigned. Surveys are often like this, asare studies of natural systems without experimental manipulation. Statisticalanalysis can reveal the relationships among variables, but causality cannot beinferred. This is because there may be other unstudied variables that affectthe measured variables in the study.

Sampling

Good sampling practices are critical for producing gooddata. In general, samples need to be collected in a random fashion so thatbias is avoided.

In survey data, bias is often introduced by a self-selectionbias. For example, internet or telephone surveys include only those whor*spond to these requests. Might there be some relevant difference in the variablesof interest between those who respond to such requests and the generalpopulation being surveyed? Or bias could be introduced by the researcherselecting some subset of potential subjects, for example only surveying a 4-Hprogram with particularly cooperative students and ignoring other clubs. Thisis sometimes called “convenience sampling”.

In election forecasting, good pollsters need to account for selectionbias and other biases in the survey process. For example, if a survey is doneby landline telephone, those being surveyed are more likely to be older than thegeneral population of voters, and so likely to have a bias in their votingpatterns.

Plan ahead and be consistent

It is sometimes necessary to change experimental conditionsduring the course of an experiment. Equipment might fail, or unusual weathermay prevent making meaningful measurements.

But in general, it is much better to plan ahead and beconsistent with measurements.

Consistency

People sometimes have the tendency to change measurementfrequency or experimental treatments during the course of a study. Thisinevitably causes headaches in trying to analyze data, and makes writing up theresults messy. Try to avoid this.

Controls and checks

If you are testing an experimental treatment, include a checktreatment that almost certainly will have an effect and a controltreatment that almost certainly won’t. A control treatment will receiveno treatment and a check treatment will receive a treatment known to besuccessful. In an educational setting, perhaps a control group receives noinstruction on the topic but on another topic, and the check group will receivestandard instruction.

Including checks and controls helps with the analysis in apractical sense, since they serve as standard treatments against which tocompare the experimental treatments. In the case where the experimentaltreatments have similar effects, controls and checks allow you say, for example,“Means for the all experimental treatments were similar, but were higher thanthe mean for control, and lower than the mean for check treatment.”

Include alternate measurements

It often happens that measuring equipment fails or that acertain measurement doesn’t produce the expected results. It is thereforehelpful to include measurements of several variables that can capture thepotential effects. Perhaps test scores of students won’t show an effect, but aself-assessment question on how much students learned will.

Include covariates

Including additional independent variables that might affectthe dependent variable is often helpful in an analysis. In an educationalsetting, you might assess student age, grade, school, town, background level inthe subject, or how well they are feeling that day.

The effects of covariates on the dependent variable may beof interest in itself. But also, including co-variates in an analysis canbetter model the data, sometimes making treatment effects more clear or makinga model better meet model assumptions.

Optional discussion: Alternative methods to theNull Hypothesis Significance Test

The NHST controversy

Particularly in the fields of psychology and education,there has been much criticism of the null hypothesis significance testapproach. From my reading, the main complaints against NHST tend to be:

• Students and researchers don’t really understand the meaningof p-values.

p-values don’t include important information likeconfidence intervals or parameter estimates.

p-values have properties that may be misleading, forexample that they do not represent effect size, and that they change withsample size.

• We often treat an alpha of 0.05 as a magical cutoffvalue.

Personally, I don’t find these to be very convincingarguments against the NHST approach.

The first complaint is in some sense pedantic: Like so manythings, students and researchers learn the definition of p-values atsome point and then eventually forget. This doesn’t seem to impact the usefulnessof the approach.

The second point has weight only if researchers use onlyp-values to draw conclusions from statistical tests. As this bookpoints out, one should always consider the size of the effects and practicalconsiderations of the effects, as well present finding in table or graphicalform, including confidence intervals or measures of dispersion. There is noreason why parameter estimates, goodness-of-fit statistics, and confidenceintervals can’t be included when a NHST approach is followed.

The properties in the third point also don’t count much ascriticism if one is using p-values correctly. One should understandthat it is possible to have a small effect size and a small p-value, andvice-versa. This is not a problem, because p-values and effect sizesare two different concepts. We shouldn’t expect them to be the same. The factthat p-values change with sample size is also in no way problematic tome. It makes sense that when there is a small effect size or a lot of variabilityin the data that we need many samples to conclude the effect is likely to bereal.

(One case where I think the considerations in the precedingpoint are commonly problematic is when people use statistical tests to checkfor the normality or hom*ogeneity of data or model residuals. As sample sizeincreases, these tests are better able to detect small deviations from normalityor hom*oscedasticity. Too many people use them and think their model isinappropriate because the test can detect a small effect size, that is, a smalldeviation from normality or hom*oscedasticity).

The fourth point is a good one. It doesn’t make much senseto come to one conclusion if our p-value is 0.049 and the oppositeconclusion if our p-value is 0.051. But I think this can be amelioratedby reporting the actual p-values from analyses, and relying less on p-valuesto evaluate results.

Overall it seems to me that these complaints condemn poorpractices that the authors observe: not reporting the size of effects in somemanner; not including confidence intervals or measures of dispersion; basingconclusions solely on p-values; and not including important results likeparameter estimates and goodness-of-fit statistics.

Alternatives to the NHST approach

Estimates and confidence intervals

One approach to determining statistical significance is touse estimates and confidence intervals. Estimates could be statistics likemeans, medians, proportions, or other calculated statistics. This approach canbe very straightforward, easy for readers to understand, and easy to presentclearly.

Bayesian approach

The most popular competitor to the NHST approach is Bayesianinference. Bayesian inference has the advantage of calculating the probabilityof the hypothesis given the data, which is what we thought we should bedoing in the “Wait, does this make any sense?” section above. Essentially ittakes prior knowledge about the distribution of the parameters ofinterest for a population and adds the information from the measured data toreassess some hypothesis related to the parameters of interest. If the readerwill excuse the vagueness of this description, it makes intuitive sense. Westart with what we suspect to be the case, and then use new data to assess ourhypothesis.

One disadvantage of the Bayesian approach is that it is notobvious in most cases what could be used for legitimate prior information. Asecond disadvantage is that conducting Bayesian analysis is not as straightforwardas the tests presented in this book.

References and further reading

[Video] “Understanding statistical inference”from Statistics Learning Center (Dr. Nic). 2015. www.youtube.com/watch?v=tFRXsngz4UQ.

[Video] “Hypothesis tests, p-value” fromStatistics Learning Center (Dr. Nic). 2011. www.youtube.com/watch?v=0zZYBALbZgg.

[Video] “Understanding the p-value”from Statistics Learning Center (Dr. Nic). 2011.

www.youtube.com/watch?v=eyknGvncKLw.

[Video] “Important statistical concepts:significance, strength, association, causation” from StatisticsLearning Center (Dr. Nic). 2012. www.youtube.com/watch?v=FG7xnWmZlPE.

“Understandingstatistical inference” from Dr. Nic. 2015. Learn and Teach Statistics& Operations Research. creativemaths.net/blog/understanding-statistical-inference/.

“Basic concepts of hypothesis testing” inMcDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/hypothesistesting.html.

“Hypothesis testing”, section 4.3, in Diez,D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics,2nd ed. www.openintro.org/.

“Hypothesis Testing with One Sample”, sections9.1–9.2 in Openstax. 2013. Introductory Statistics. openstax.org/textbooks/introductory-statistics.

"Proving causation" from Dr. Nic.2013. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/proving-causation/.

[Video] “Variation and Sampling Error”from Statistics Learning Center (Dr. Nic). 2014. www.youtube.com/watch?v=y3A0lUkpAko.

[Video] “Sampling: Simple Random,Convenience, systematic, cluster, stratified” from StatisticsLearning Center (Dr. Nic). 2012. www.youtube.com/watch?v=be9e-Q-jC-0.

“Confounding variables” in McDonald, J.H.2014. Handbook of Biological Statistics. www.biostathandbook.com/confounding.html.

“Overview of data collection principles”,section 1.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntroStatistics, 2nd ed. www.openintro.org/.

“Observational studies and sampling strategies”,section 1.4, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntroStatistics, 2nd ed. www.openintro.org/.

“Experiments”, section 1.5, in Diez, D.M.,C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics, 2nd ed.www.openintro.org/.

Exercises F

1. Which of the following pair is the null hypothesis?

A) The number of heads from the coin is not different from thenumber of tails.

B) The number of heads from the coin is different from thenumber of tails.

2. Which of the following pair is the null hypothesis?

A) The height of boys is different than the height of girls.

B) The height of boys is not different than the height ofgirls.

3. Which of the following pair is the null hypothesis?

A) There is an association between classroom and sex. That is,there is a difference in counts of girls and boys between the classes.

B) There is no association between classroom and sex. That is,there is no difference in counts of girls and boys between the classes.

4. We flip a coin 10 times and it lands on heads 7 times.We want to know if the coin is fair.

a. What is the null hypothesis?

b. Looking at the code below, and assuming an alpha of0.05,

What do you decide (use the reject or fail to rejectlanguage)?

c. In practical terms, what do you conclude?


binom.test(7, 10, 0.5)

Exact binomial test

number of successes = 7, number of trials = 10, p-value = 0.3438

5. We measure the height of 9 boys and 9 girls in a class,in centimeters. We want to know if one group is taller than the other.

a. What is the null hypothesis?

b. Looking at the code below, and assuming an alphaof 0.05,

What do you decide (use the reject or fail toreject language)?

c. In practical terms, what do you conclude? Address thepractical importance of the results.

Girls = c(152, 150, 140, 160, 145, 155, 150, 152, 147)

Boys = c(144, 142, 132, 152, 137, 147, 142, 144, 139)

t.test(Girls, Boys)

Welch Two Sample t-test

t = 2.9382, df = 16, p-value = 0.009645

mean of x mean of y
150.1111 142.1111

mean(Boys)

sd(Boys)

quantile(Boys)

mean(Girls)

sd(Girls)

quantile(Girls)

boxplot(cbind(Girls, Boys))

6. We count the number of boys and girls in two classrooms.We are interested to know if there is an association between the classrooms andthe number of girls and boys. That is, does the proportion of boys and girlsdiffer statistically across the two classrooms?

a. What is the null hypothesis?

b. Looking at the code below, and assuming an alphaof 0.05,

What do you decide (use the reject or fail toreject language)?

c. In practical terms, what do you conclude?


Classroom
Girls Boys
A 13 7
B 5 15

Input =("
Classroom Girls Boys
A 13 7
B 5 15
")

Matrix = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))

fisher.test(Matrix)

Fisher's Exact Test for Count Data

p-value = 0.02484

Matrix

rowSums(Matrix)

colSums(Matrix)

prop.table(Matrix,
margin=1)

### Proportions for each row

barplot(t(Matrix),
beside = TRUE,
legend = TRUE,
ylim = c(0, 25),
xlab = "Class",
ylab = "Count")

7. Why should you not rely solely on p-values to makea decision in the real world? (You should have at least two reasons.)

8. Create your own example to show the importance ofconsidering the size of the effect. Describe the scenario: whatthe research question is, and what kind of data were collected. You may makeup data and provide real results, or report hypothetical results.

9. Create your own example to show the importance ofweighing other practical considerations. Describe the scenario:what the research question is, what kind of data were collected, whatstatistical results were reached, and what other practical considerations werebrought to bear.

10. What is 5e-4 in common decimal notation?

R Handbook: Hypothesis Testing and p-values (2024)
Top Articles
Latest Posts
Article information

Author: Greg O'Connell

Last Updated:

Views: 5852

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.