Consider an operation man from a manufacturing plant, he produces thousands of units of a part say screw. It is nearly impossible to find the deviation of the measurement of the screw for all those thousands. What is his way out to find that the screw’s measurement doesn’t shoot up above a threshold? Enter our dear friend from statistics called “Hypothesis testing”. We’ll see in the later half how he can use simple measurements like mean and standard deviation of a “sample” and check if things are under control.
What is hypothesis testing?
Hypothesis testing can sound scary to non-statisticians, but it has had it’s applications from even the most longstanding areas like judicial system.
What happens when a person is presented before the court for a crime? There is basically a hypothesis or a statement or a stand taken by the court. This is called the Null hypothesis, it always defines a default or a natural case.
H0: The person is innocent
The opposite of the Null hypothesis is the Alternate hypothesis which inverts null.
Ha: The person is guilty.
The judge decides the person is innocent or guilty by hearing arguments. Here is it gets interesting, you will never hear the judge say “the person is guilty” if he is a statistics man. Because in statistics, there are only 2 possibilities.
- Reject Null hypothesis – This means we have enough evidence to suggest that we can reject the null hypothesis and say that that it is possibly wrong to say that the person is innocent.
If the judge takes this possibility, you can say that the person is toast. He goes to prison.
- Do not reject the Null hypothesis – This means we don’t have enough evidence to suggest that the null hypothesis is wrong and say that it is not possible to say a person is not innocent.
What the judge means by that complicated sentence the person might be innocent, and the person walks free.
Now once the probability is in action, there are bound to be errors. In the above case, there are 2 errors.
Imagine the Type 1 Error that is when in reality the person is innocent, but the judge declares that reject the null hypothesis and sends behind the bars. That is sending an innocent man for a crime not committed. Now, imagine the reverse which is Type II Error, when the person is not innocent, but the judge declares that he cannot reject the null hypothesis and sends him scot-free.
Which is more devastating? Of course, the first one if you stand for justice for the individual. This is the pillars on which judicial system is built, an innocent individual should not be punished. Minimise Type 1 Error at all costs, even if means not possible to declare Dawood as a blasts mastermind, or Zawahiri as a terrorist, or even Pablo as a drug peddler. But first, we need to capture and produce them before the court, which looks like a remote possibility than sky turning into Neon green!
Hence the prosecution has to toil very hard to reject the null hypothesis, while the defence (the lawyers for the person in the box) just break that possibility by sowing doubts in the prosecution’s evidence. Quite a job to get paid, sow enough doubts to confuse the judge so that he doesn’t have enough evidence to reject the null hypothesis.
Example of Hypothesis test
Test of a single mean
Consider the operations manager wanting to test if the mean measurements of the screw is 350 mm. that could be worded in hypothesis as
H0: Mean, mu <= 170
Ha: Mean, mu! > 170
Remember, the null hypothesis is always an equality, it cannot have only inequality symbols like “<” or “>”.
Once we set up the hypothesis, we need to take a sample and then find the test statistic. Using the test statistic only we can make conclusions about the hypothesis. The test statistic will have a sampling distribution which will either say that the sample mean is far away from the real mean or if it is closer to the real mean. If it is far away, then we have sufficient evidence to reject the null otherwise we fail to reject the null as the sample mean is close to the real one.
We talk about the sample, why sample it? Because the population is big that it is impossible to test each case. It is to be noted, the hypothesis testing is based on the assumption that the sample has a normal distribution.
Computing Test statistic
Then we can calculate a test statistic called z.
z = (xbar – mu)/ (sigma/ sqrt(n))
this is just standardizing the value of sample mean.
z = (178-170)/(65/sqrt(400)) = 2.46
What can we do with this test statistic, z?
We can compare it with a rejection region statistic called zalpha. Remember, there should be enough evidence to reject the null hypothesis. What is enough very subjective, but in statistics, 5% is the probability used commonplace. If you want to enforce strict rules to reduce type 1 error, then 1% or further decrease it.
Hence Z alpha can be calculated from the Normal distribution table for the probability value of 5%, which is 1.645. This is the truth and the only truth which you can memorise it for easy calculations. Better do it for 1% from the table as well.
Z vs z alpha, 2.46 vs 1.65
Z > zalpha, what does it signify?
Let us put it in a graph and see.
Since 2.46 is on the right side of the 1.65 and the null hypothesis is mean <= 170, we can reject the null hypothesis.
Is it so simple? No, that is where you encounter p-value. Z is just a test statistic.
What is a p-value?
The p-value of a test is the probability to observe a test statistic as extreme as the computed one given that the null hypothesis is true.
Just memorise this axiom for the sake of god, you don’t need anything else to conclude. Close your eyes and conclude like a true statistical practitioner. Your words will never be rejected even if you reject the null hypothesis.
When p-value < 0.05, reject null hypothesis
p-value > 0.05, do not reject null hypothesis.
For our example the p-value would be
= P (xbar > 178) = P ( z > zalpha) = P (z> 2.446) =1 – P(z < 2.46) = 1 – .9931 = .0069
How to interpret this?
That is the probability of observing a sample mean as extreme as 178 is 0.0069. This is very small. The probability to observe a test statistic as extreme as the computed one given that the null hypothesis is true is .0069. Hence reject the null hypothesis.
Is .0069 small or significant enough? That is where statistics has defined different ranges for describing the strength of evidence and levels of significance.
Variations of the test
What we saw above is a 1-tailed test, where we declared the null hypothesis with mean to be less than or equal to 170 and the alternate hypothesis having mean to be greater than 170. That means the rejection region is on the right side.
When we need to test strict conditions like when we declare the null with mean equal to 170 and the alternate with mean to be not equal to 170. It could be greater or lesser than 170. In that case, the rejection region is on either side of 170. This is called a 2-tailed test.
H0: mu <= 170
Ha: mu > 170
H0: mu = 170
Ha: mu!= 170
So what happens to the p-value is 2-tailed test. The alpha or the level of significance is divided by 2, and hence converts from.05 to .025.
p-value = P (xbar > 178) + P (xbar < 178) = P (z> 2.446) + P (z< 2.446) = 0.0069 + 0.0069 = 0.0138
Still less than 0.05 and hence reject null hypothesis.
What are the types?
There are different types of hypothesis testing, what we looked is testing against a population mean via single mean z-test
We can test for:
- Differences of means between 2 different populations via Independent t-test
- Difference between paired samples, typical before and after variations of a single population via Paired t-test.
- Differences in means for 3 or more independent populations via 1-way ANOVA (Analysis of Variance)
- The strength of the relationship between 2 categorical variables via chi-square test
Every test calculates the Holy Grail metric called p-value, and remember the axiom always!
Applications in Business
Hypothesis testing is used in many domains of healthcare, insurance, manufacturing and across predominantly across functions like operations, marketing. Whenever there is a big population and we cannot check case by case, samples are taken and hypothesis testing is performed.
Like in the case of healthcare, Performing a blood test and inference the presence of microorganism:
- In the case of product development, Measuring web traffic to conduct A/B testing to decide on the best features;
- In the case of manufacturing, Using the computed mean measurements of shipped products and concluding if the sample is defective;
- In the case of retail marketing, Comparing the sample campaign performance and concluding if the performance of the campaign has enhanced.
Magesh is a data science professional with close to a decade of experience in the Analytics and Retail domain. He has a masters in management from IIM Calcutta. He has been a self-starter throughout his career, solving problems in ambiguous situations.