Applied Statistics | 𝚃𝚛𝚊𝚗𝚜𝚙𝚘𝚗𝚜𝚝𝚎𝚛

Testing Alcohol level

Wed, 23 May 2018 21:13:14 -0500

Is there really 5.4% alcohol in that beer brand?

We all see that a lot of brand publish on their wrapper that the alcohol level is 5.4%. Let’s say we collected the percent level of volume for those brand. We sampled randomly and measured the alcohol level ourselves

So we believe that the actual beer percent should be 5.4% but as a beer consumer, we feel sometime it’s not.

if we measure one, and found out that beer has 6.7 we would immediately complain that the brand is telling us lie that there is 5.4% . They may argue that our measuring apparatus or technique is not 100% accurate. There is no way of finding our inaccurate our measurement without measuring it multiple times or taking measurement of multiple beers. It might be the case that our measurement is 100% accurate and the beer has more alcohol than the company is saying. We don’t really know. Also, we can’t measure every single beer they ever manufactured. This is the perfect timing to test this with our statistics sense, Below we have a list of measurements from different beer randomly bought, some from midtown, some from walmart. Let’s do a t-test.

level = c(5.1,5.2,6,7,5.01,5.0,6.5,5.6,5.2,6.1,6.2,5.0)
t.test(level, mu = 5.4)

## 
##  One Sample t-test
## 
## data:  level
## t = 1.3139, df = 11, p-value = 0.2156
## alternative hypothesis: true mean is not equal to 5.4
## 95 percent confidence interval:
##  5.225010 6.093323
## sample estimates:
## mean of x 
##  5.659167

The p-value is greater than 0.05 and confidence interval [5.17 to 6.17]. Which means if 100 people have done this random sampling of beer and have calculated the confidence interval , then the mean[5.4] would have always fall in the confidence interval.

Enough with the statistical jargon? Okay let’s enjoy the beer

Police Data Challenge

Sun, 23 Jul 2017 21:13:14 -0500

Police Data Challenge: Winner Recommendations

February 1, 2018

The Police Data Challenge contest brought talented high school and undergraduate students across the nation to show their passion for the good statistics can do.

With the Police Foundation’s efforts to make the information available, the 70 teams used real crime data sets from Baltimore, Seattle and Cincinnati police departments to analyze the best possible solutions for safer communities.

Check out below how the winning teams analyzed the best way to fight crime through statistics:

Winona State University, Winona, MN Jimmy Hickey, Kapil Khanal, Luke Peacock divided the crimes into more detailed categories than what the Seattle Police Department data provided. They used the crime types and locations to discover that gun related crimes are condensed in specific areas. Their recommendation was to raise public awareness of the times and locations of high crimes and include more police for patrol.

Secretory Problem

Sun, 23 Jul 2017 21:13:14 -0500

When to give up? Exploration vs Exploitation

A lot of hard working students don’t end up being selected for the scholarships. I should know because i lost 3 years doing it.

Now i turn into a information theoretic game to find when should i have quit the whole process.

Assumption: Your best score will get you scholarship if you are one of the sufficiently prepared student.

Say, entrance exams are the games. We all agree they do behave as a game. If a student is well prepared as indicated by practice questions and exams, then getting their name in scholarship in list is basically a game of chance. This is not to say that it is not possible but given that we all have time and money constrains in our life, when is the right amount to quit. Thus, A player in this game is a sufficiently prepared, hard working student. for others, before playing this game one has to be efficiently prepared.

Now that we agree, getting your name on that list is a work of chance. Say, that you are prepared to give entrance exams 10 times but that will come at a cost of time and money. Out of 10 exams you give, say all these exams can be ranked from your best score to worst score , thus you can rank them from 1 to 10. We can agree on one thing that your best possible score has the highest chance of getting scholarship[which may not be necessarily true for all but our player is a smart, hardworking , well prepared one.].

Now, we give exams one by one and the score one get is random after some cutoff[for me it was 90]. We can all relate to the “fact” that some questions are actually random and they determine our fate.

So we don’t know which exam’s is gonna be the best score for us. so its ideal to assume that it is random. After we give each exams, we surely can rank which one was the best exams and which one was the worst.

The optimal solution is to give n/e exams before deciding to quit and quit after the n/e exams if the score on n/e + 1 is not better than the exams before.

def quit_candidate(n):
    '''Choose a exam to quit after.. from a list of n exam using 
    the optimal strategy. 1= best time to quit,n is worst time to quit'''

    exams = np.arange(1, n+1)
    np.random.shuffle(exams)
    
    stop = int(round(n/np.e)) 
    best_from_rejected = np.min(exams[:stop])
    rest = exams[stop:]
    
    try:
        return rest[rest < best_from_rejected][0]
    except IndexError:
        return exams[-1]
#Now let's see if it actually holds..by having  100,000 student give 100 exams

sim = np.array([quit_candidate(n=100) for i in range(100000)])

plt.figure(figsize=(10, 6))
plt.hist(sim, bins=100)
plt.xticks(np.arange(0, 101, 10))
plt.ylim(0, 40000)
plt.xlabel('Chosen candidate')
plt.ylabel('frequency')
plt.show()

img

We see most of the time we ended up quiting on the prime time[rank 1 is the prime time to quit]

best_candidate = []
for r in range(5, 101, 5):
    sim = np.array([quit_candidate(n=100, reject=r) for i in range(100000)])
    # np.histogram counts frequency of each candidate
    best_candidate.append(np.histogram(sim, bins=100)[0][0]/100000)

plt.figure(figsize=(10, 6))
plt.scatter(range(5, 101, 5), best_candidate)
plt.xlim(0, 100)
plt.xticks(np.arange(0, 101, 10))
plt.ylim(0, 0.4)
plt.xlabel('% of candidates rejected')
plt.ylabel('Probability of choosing best candidate')
plt.grid(True)
plt.axvline(100/np.e, ls='--', c='black')
plt.show()

img

Hence , if we decide to quit on the optimal time to quit is try giving 37% exams and quit if the score is lower than the lower score you got before.

so i was ready to give 8 exams and my score were [84,87,88,94,90,92]

37% of 8 = 3.

My score was improving after 3rd exam so i guess i was right to keep giving exams but the 5th exam my score went down i guess i should have quit then instead of giving one more exam. I lost another 3 month preparing for that.

:-by Kapil Khanal

Why do you have to wait more for the buses?

Sun, 23 Jul 2017 21:13:14 -0500

Average for group vs Individual

Inspection Paradox

Buses and trains are supposed to arrive at constant intervals, but in practice some intervals are longer than others. This means the buses do not follow schedule exactly. There is always some randomness..With your luck, you might think you are more likely to arrive during a long interval. It turns out you are right: a random arrival is more likely to fall in a long interval because, well, it’s longer..!

Let’s think of a scenario…

Suppose a Bus service in your city says they pass a station every 10 minutes. This means you will assume that when you go to station randomly you would think that the average time is 5 minutes but more often you will be waiting longer than five minutes actually 10 minutes on average.

Another example of this paradoxes is: Most of the school report there average class size. But if you, as a student that average is not accurate. Say, there are 4 classes of size 75,13,12,10. Then, the average colleges will report is \((75 + 13 +12 +10)/4 = 27.5\) but you as a prospective student, the average is different.

You are more likely to be in room with 75 students \(((75*75) + (13*13)+(12*12)+(10*10))/110 = 54.89\). Hence, the average reporting is not for you. This kind of paradoxes happen everywhere.

To generalize it in a more abstract way,

This is one case where the perspective of the individual and the group differs.For group, the average is what happens but as a individual the average will not make any sense.

Verifying empirical rule and Chebyshev's theorem

Mon, 01 Jan 0001 00:00:00 +0000

Empirical rule and Chebyshev’s theorem

Let’s talk about this really simple concept but powerful one. Data Distributions. A data distribution is an abstract concept(a function) that gives the the possible values of data and also how often that data is generated. When you want to talk about the all the data of your experiments at once, then talk about data distribution. A data distribution gives us the probability of how often that data will be an output if we keep repeating the experiment.

We rarely have the complete dataset from the experiment.So, it is powerful to have the an idea of how data is distributed and which data occurs more often than others. We can intuitively understand some distributions like the height of the populations. We know there will be few people with really short height while few have more height. But we are sure that most of the people will be in between.This is really convienient for us to know in advance the spread and frequency of the data.

Interesting thing is that there are more than one kinds of distributions in the world. So the convienience if knowing in advance the spread of the data will be helpful. There is a famous theorem that givrs us an idea of how our data is distributed. It’s called Chebyshev’s theorem.

image credit: libretext

It says that most(3/4th) of our data will be at max two standard deviations from the mean.

library(tidyverse)
library(knitr)
library(kableExtra)
stock<- read.csv("~/OneDrive - MNSCU/FALL 2019/MathStat/Data/Stock Trade.csv",stringsAsFactors = FALSE)

Now let’s clean the name,

stock<- stock %>% select(percentStock = X..of.Shares.Outstanding)

The empirical rule says that 68% of the data will be within two standard deviation.

This function below: 1. standardizes the data
2. counts data within z standard deviations
3. outputs the proportion

data_within<- function(df, z){
  func_normalize<-function(x){(x-mean(x))/sd(x)}
  #>11 after removing a data point 
  df<-df %>% filter(percentStock<11)
  df_scaled<- df %>% mutate(percentStock_normal = func_normalize(percentStock)) %>% filter(abs(percentStock_normal)<z)
  proportion = dim(df_scaled)[1]/dim(df)[1]
  return (round(proportion,2))
}

Let’s collect the output in a small tibble.

tb<- tibble(
  first_std_dev = data_within(stock,1),
  second_std_dev = data_within(stock,2),
  third_std_dev = data_within(stock,3),
)

kable(tb)

first_std_dev	second_std_dev	third_std_dev
0.64	0.92	1

We can also test if our function is working correctly,

library(testthat)

## Warning: package 'testthat' was built under R version 3.5.2

normal_generated = tibble(percentStock = rnorm(10,mean = 6.2,sd = 1.2))

#Testing our function
tb_test<- tibble(
  first_std_dev = data_within(normal_generated,1),
  second_std_dev = data_within(normal_generated,2),
  third_std_dev = data_within(normal_generated,3),
)


kable(tb_test)

first_std_dev	second_std_dev	third_std_dev
0.7	1	1

testthat::expect_gt(tb_test$first_std_dev, 0.68,label = "data proportion within first deivation")

Hence, our function is working correctly.Note that the data is randomly generated every time the code is run.