<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>R | 𝚃𝚛𝚊𝚗𝚜𝚙𝚘𝚗𝚜𝚝𝚎𝚛</title>
    <link>https://almostkapil.netlify.com/categories/r/</link>
      <atom:link href="https://almostkapil.netlify.com/categories/r/index.xml" rel="self" type="application/rss+xml" />
    <description>R</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>© 2018 Kapil Khanal</copyright><lastBuildDate>Wed, 24 Jul 2019 21:13:14 -0500</lastBuildDate>
    <image>
      <url>https://almostkapil.netlify.com/img/aph-salt-spring-zoom.jpg</url>
      <title>R</title>
      <link>https://almostkapil.netlify.com/categories/r/</link>
    </image>
    
    <item>
      <title>Sankey diagrams for Bacteria and antibiotics</title>
      <link>https://almostkapil.netlify.com/post/sankey/</link>
      <pubDate>Wed, 24 Jul 2019 21:13:14 -0500</pubDate>
      <guid>https://almostkapil.netlify.com/post/sankey/</guid>
      <description>


&lt;div id=&#34;visually-classifying-bacteria-and-antibiotics&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Visually Classifying Bacteria and Antibiotics&lt;/h2&gt;
&lt;p&gt;After World War II, antibiotics earned the moniker “wonder drugs” for quickly treating previously-incurable diseases. Data was gathered to determine which drug worked best for each bacterial infection. Comparing drug performance was an enormous aid for practitioners and scientists alike. In the fall of 1951, Will Burtin published a &lt;a href = &#34;https://mbostock.github.io/protovis/ex/antibiotics-burtin.html&#34;&gt;graph &lt;/a&gt; showing the effectiveness of three popular antibiotics on &lt;B&gt;16&lt;/B&gt; different bacteria, measured in terms of minimum inhibitory concentration.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://almostkapil.netlify.com/post/sankey_files/avb.jpg&#34; alt=&#34;image creidt: Ask a biologist&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;image creidt: Ask a biologist&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;I am reproducing this &lt;a href = &#34;https://www.dropbox.com/s/68ahri9xnnabce4/Bacteria-sigmoid-howto.docx?dl=0&#34;&gt;wonderful visualization&lt;/a&gt; from my professor(&lt;a href = &#34;http://driftlessdata.space/&#34;&gt; Silas Bergen&lt;/a&gt;.) in ggplot2, who did this in Tableau&lt;/p&gt;
&lt;p&gt;Let’s bring the datasets,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(knitr)
library(kableExtra)
df &amp;lt;- read.csv(&amp;quot;https://cdn.rawgit.com/plotly/datasets/5360f5cd/Antibiotics.csv&amp;quot;, stringsAsFactors = F)
#String as Factors is a demon. Better not bring it here ! We rarely need that beast.
#There are 16 bacteria so giving them ID to reference later..
df&amp;lt;-df %&amp;gt;% mutate(ID =seq(1:16) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kable(head(df,n = 16))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Bacteria
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Penicillin
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Streptomycin
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Neomycin
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Gram
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
ID
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2.000
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Salmonella schottmuelleri
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.80
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.090
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Proteus vulgaris
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.10
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.100
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Klebsiella pneumoniae
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
850.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.000
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Brucella abortus
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.020
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Pseudomonas aeruginosa
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
850.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.400
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
6
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Escherichia coli
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
100.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.40
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.100
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
7
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Salmonella (Eberthella) typhosa
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.40
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.008
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Aerobacter aerogenes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
870.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.600
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Brucella antracis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.001
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.01
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.007
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
positive
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Streptococcus fecalis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.100
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
positive
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
11
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Staphylococcus aureus
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.030
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.03
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.001
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
positive
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
12
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Staphylococcus albus
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.007
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.10
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.001
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
positive
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
13
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Streptococcus hemolyticus
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.001
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
14.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10.000
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
positive
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
14
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Streptococcus viridans
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.005
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
40.000
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
positive
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
15
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Diplococcus pneumoniae
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.005
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
11.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10.000
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
positive
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
16
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Before proceeding further with the data manipulation we need to think about the format of the visualization. Here we will be making our visualization on the bacteria level, that means we will have information for each bacteria, their gram stain , and the concentration of drug required .&lt;/p&gt;
&lt;p&gt;If you look at the table above, we do have all the data we need but not on the format we are thinking. We want one information per row for each bacteria unlike above where each row has all the information of each bacteria on one single row.
Let’s change the format of the data,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;key_value = df %&amp;gt;% gather(&amp;quot;Drug&amp;quot;,&amp;quot;Concentration&amp;quot;,Penicillin:Neomycin,-Bacteria)
kable(head(key_value))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Bacteria
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Gram
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
ID
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Drug
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Concentration
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Penicillin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Salmonella schottmuelleri
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Penicillin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Proteus vulgaris
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Penicillin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Klebsiella pneumoniae
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Penicillin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
850
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Brucella abortus
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Penicillin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Pseudomonas aeruginosa
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
6
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Penicillin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
850
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;okay so, now what we need to do is add a minimum concentration information for each bacteria for each stain type. so basically a column on the gathered table above. The only thing to keep note of is that here we should group all these bacteria and select the minimum concentration. We could have done this first[basically for eacg ] and gather like above but this is my thought process.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_min&amp;lt;- key_value  %&amp;gt;% 
  group_by(Bacteria) %&amp;gt;% summarise(Min = min(Concentration))
kable(head(df_min))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Bacteria
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Min
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Aerobacter aerogenes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Brucella abortus
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.020
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Brucella antracis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.001
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Diplococcus pneumoniae
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.005
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Escherichia coli
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.100
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Klebsiella pneumoniae
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;so now, let’s join this &lt;code&gt;df_min&lt;/code&gt; dataframe from above with &lt;code&gt;df&lt;/code&gt; to have that minimum information in the dataframe.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df&amp;lt;- inner_join(df,df_min,by = &amp;quot;Bacteria&amp;quot;)
df&amp;lt;- df %&amp;gt;% mutate(Best = case_when(
  Penicillin == Min~ &amp;quot;Penicillin&amp;quot;,
  Neomycin == Min~ &amp;quot;Neomycin&amp;quot;,
  Streptomycin == Min~ &amp;quot;Streptomycin&amp;quot;
))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, since the data is ready and in the format we want,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kable(head(df))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Bacteria
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Penicillin
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Streptomycin
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Neomycin
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Gram
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
ID
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Min
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Best
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5.0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Salmonella schottmuelleri
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.09
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.09
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Proteus vulgaris
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.10
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.10
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Klebsiella pneumoniae
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
850
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.2
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Brucella abortus
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2.0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.02
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.02
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Pseudomonas aeruginosa
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
850
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2.0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.40
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
6
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.40
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Okay, this step might be a little unintuitive but if we think with &lt;code&gt;grammer of graphics&lt;/code&gt; philosophy this will make sense.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;seq1 &amp;lt;- rep(1:16,each=100)
seq2 &amp;lt;-rep(seq(-6,6,length=100),16)
newdat &amp;lt;-data.frame(ID=seq1,T=seq2)
write.csv(newdat,&amp;quot;new_data.csv&amp;quot;,row.names=FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We are making a new dataframe that has data point for the sigmoid curve(you can just draw sigmoid curve in R but this way it is linked with our data with ID)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Joining the data by ID
final_df&amp;lt;-inner_join(df,newdat,by = &amp;quot;ID&amp;quot;)
kable(head(final_df))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Bacteria
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Penicillin
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Streptomycin
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Neomycin
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Gram
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
ID
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Min
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Best
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
T
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-6.000000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-5.878788
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-5.757576
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-5.636364
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-5.515151
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mycobacterium tuberculosis
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
800
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
negative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Neomycin
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-5.393939
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#ggplot
final_df &amp;lt;- final_df %&amp;gt;% mutate(Sigmoid = 1/(1 + exp(-T)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;okay so now we have the final dataset, we can get in the ggplot2 land.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- ggplot(data = final_df , aes(x = T , y = Sigmoid ))
p + geom_point() &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://almostkapil.netlify.com/post/sankey_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;1344&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Making best slope
#Different slop will separate our curves
final_df&amp;lt;-final_df %&amp;gt;% mutate(bestBacSlope = case_when(
  Best ==&amp;quot;Streptomycin&amp;quot; ~ 4 - ID,
  Best ==&amp;quot;Neomycin&amp;quot; ~ 9 - ID,
  Best ==&amp;quot;Penicillin&amp;quot; ~ 14 - ID
))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;final_df&amp;lt;-final_df %&amp;gt;% mutate(curveBest = ID + bestBacSlope * Sigmoid)
#Figuring out ID and labels

label_df&amp;lt;-final_df %&amp;gt;% dplyr::select(c(ID, Bacteria))%&amp;gt;% group_by(Bacteria,ID) %&amp;gt;% summarise(count = n()) %&amp;gt;% dplyr::select(Bacteria,ID) %&amp;gt;% arrange(ID)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Below are the label we will use in y-axis&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;label_y= c(&amp;quot;Mycobacterium tuberculosis&amp;quot; ,  &amp;quot;Salmonella schottmuelleri&amp;quot;  ,    
           &amp;quot;Proteus vulgaris&amp;quot;        ,        &amp;quot;Klebsiella pneumoniae&amp;quot;  ,        
           &amp;quot;Brucella abortus&amp;quot;      ,          &amp;quot;Pseudomonas aeruginosa&amp;quot;    ,     
           &amp;quot;Escherichia coli&amp;quot;    ,            &amp;quot;Salmonella (Eberthella) typhosa&amp;quot;,
           &amp;quot;Aerobacter aerogenes&amp;quot;     ,       &amp;quot;Brucella antracis&amp;quot;    ,          
           &amp;quot;Streptococcus fecalis&amp;quot;    ,       &amp;quot;Staphylococcus aureus&amp;quot;      ,    
           &amp;quot;Staphylococcus albus&amp;quot;    ,        &amp;quot;Streptococcus hemolyticus&amp;quot;      ,
           &amp;quot;Streptococcus viridans&amp;quot;    ,      &amp;quot;Diplococcus pneumoniae&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now it’s a &lt;code&gt;plotting time&lt;/code&gt; !&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Plotting the sigmoid plots
library(ggthemes)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;ggthemes&amp;#39; was built under R version 3.5.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sankey &amp;lt;- ggplot(data = final_df, aes(x = T , y = curveBest, color =Gram,size = Min,alpha = 0.9,group = Bacteria)) + geom_line() +scale_fill_manual(values=c(&amp;quot;green&amp;quot;,&amp;quot;red&amp;quot;)) + 
    scale_y_continuous(breaks = seq(1:16) , labels = label_y)   + theme(axis.title.y = element_blank() , axis.line.x  = element_blank() , axis.ticks.x = element_blank(), axis.title.x =element_blank() , axis.text.x.bottom = element_blank() ) + 
  annotate(&amp;quot;text&amp;quot;, x = 6, y = 14, label = &amp;quot;Penicillin&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;, x = 6, y = 9, label = &amp;quot;Neomycin&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;, x = 6, y = 4, label = &amp;quot;Streptomycin&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;,x = 5.5,y = 15,label = &amp;quot;Best Antibiotics&amp;quot; ,size = 5, colour = &amp;#39;blue&amp;#39;)+
  theme_minimal()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sankey&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span id=&#34;fig:sankey&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://almostkapil.netlify.com/post/sankey_files/figure-html/sankey-1.png&#34; alt=&#34;Classification of Bacteria&#34; width=&#34;1344&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Classification of Bacteria
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Testing Alcohol level</title>
      <link>https://almostkapil.netlify.com/post/beer/</link>
      <pubDate>Wed, 23 May 2018 21:13:14 -0500</pubDate>
      <guid>https://almostkapil.netlify.com/post/beer/</guid>
      <description>


&lt;div id=&#34;is-there-really-5.4-alcohol-in-that-beer-brand&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Is there really 5.4% alcohol in that beer brand?&lt;/h2&gt;
&lt;p&gt;We all see that a lot of brand publish on their wrapper that the alcohol level is 5.4%. Let’s say we collected the percent level of volume for those brand. We sampled randomly and measured the alcohol level ourselves&lt;/p&gt;
&lt;p&gt;So we believe that the actual beer percent should be 5.4% but as a beer consumer, we feel sometime it’s not.&lt;/p&gt;
&lt;p&gt;if we measure one, and found out that beer has 6.7 we would immediately complain that the brand is telling us lie that there is 5.4% . They may argue that our measuring apparatus or technique is not 100% accurate. There is no way of finding our inaccurate our measurement without measuring it multiple times or taking measurement of multiple beers. It might be the case that our measurement is 100% accurate and the beer has more alcohol than the company is saying. We don’t really know. Also, we can’t measure every single beer they ever manufactured.
This is the perfect timing to test this with our statistics sense,
Below we have a list of measurements from different beer randomly bought, some from midtown, some from walmart.
Let’s do a t-test.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;level = c(5.1,5.2,6,7,5.01,5.0,6.5,5.6,5.2,6.1,6.2,5.0)
t.test(level, mu = 5.4)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##  One Sample t-test
## 
## data:  level
## t = 1.3139, df = 11, p-value = 0.2156
## alternative hypothesis: true mean is not equal to 5.4
## 95 percent confidence interval:
##  5.225010 6.093323
## sample estimates:
## mean of x 
##  5.659167&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The p-value is greater than 0.05 and confidence interval [5.17 to 6.17]. Which means if 100 people have done this random sampling of beer and have calculated the confidence interval , then the mean[5.4] would have always fall in the confidence interval.&lt;/p&gt;
&lt;p&gt;Enough with the statistical jargon? Okay let’s enjoy the beer&lt;img src=&#34;https://almostkapil.netlify.com/post/Beer_files/giphy.gif&#34; alt=&#34;Cold Beer and Confidence Interval&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Why do you have to wait more for the buses?</title>
      <link>https://almostkapil.netlify.com/post/inspectionparadox/</link>
      <pubDate>Sun, 23 Jul 2017 21:13:14 -0500</pubDate>
      <guid>https://almostkapil.netlify.com/post/inspectionparadox/</guid>
      <description>


&lt;div id=&#34;average-for-group-vs-individual&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Average for group vs Individual&lt;/h1&gt;
&lt;p&gt;&lt;B&gt;Inspection Paradox&lt;/B&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://almostkapil.netlify.com/post/InspectionParadox_files/sajha.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Buses and trains are supposed to arrive at constant intervals, but in practice some intervals are longer than others. This means the buses do not follow schedule exactly. There is always some randomness..With your luck, you might think you are more likely to arrive during a long interval. It turns out you are right: a random arrival is more likely to fall in a long interval because, well, it’s longer..!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Let’s think of a scenario…&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Suppose a Bus service in your city says they pass a station every 10 minutes. This means you will assume that when you go to station randomly you would think that the average time is 5 minutes but more often you will be waiting longer than five minutes actually 10 minutes on average.&lt;/p&gt;
&lt;p&gt;Another example of this paradoxes is: Most of the school report there average class size. But if you, as a student that average is not accurate. Say, there are 4 classes of size 75,13,12,10. Then, the average colleges will report is &lt;span class=&#34;math inline&#34;&gt;\((75 + 13 +12 +10)/4 = 27.5\)&lt;/span&gt; but you as a prospective student, the average is different.&lt;/p&gt;
&lt;p&gt;You are more likely to be in room with 75 students &lt;span class=&#34;math inline&#34;&gt;\(((75*75) + (13*13)+(12*12)+(10*10))/110 = 54.89\)&lt;/span&gt;. Hence, the average reporting is not for you. This kind of paradoxes happen everywhere.&lt;/p&gt;
&lt;p&gt;To generalize it in a more abstract way,&lt;/p&gt;
&lt;p&gt;This is one case where the perspective of the individual and the group differs.For group, the average is what happens but as a individual the average will not make any sense.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Verifying empirical rule and Chebyshev&#39;s theorem</title>
      <link>https://almostkapil.netlify.com/post/untitled/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://almostkapil.netlify.com/post/untitled/</guid>
      <description>


&lt;div id=&#34;empirical-rule-and-chebyshevs-theorem&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Empirical rule and Chebyshev’s theorem&lt;/h2&gt;
&lt;p&gt;Let’s talk about this really simple concept but powerful one. &lt;code&gt;Data Distributions&lt;/code&gt;. A data distribution is an abstract concept(a function) that gives the the possible values of data and also how often that data is generated. When you want to talk about the all the data of your experiments at once, then talk about data distribution. A data distribution gives us the probability of how often that data will be an output if we keep repeating the experiment.&lt;/p&gt;
&lt;p&gt;We rarely have the complete dataset from the experiment.So, it is powerful to have the an idea of how data is distributed and which data occurs more often than others. We can intuitively understand some distributions like the height of the populations. We know there will be few people with really short height while few have more height. But we are sure that most of the people will be in between.This is really convienient for us to know in advance the spread and frequency of the data.
&lt;img src=&#34;https://almostkapil.netlify.com/post/Untitled_files/normal.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Interesting thing is that there are more than one kinds of distributions in the world. So the convienience if knowing in advance the spread of the data will be helpful. There is a famous theorem that givrs us an idea of how our data is distributed. It’s called Chebyshev’s theorem.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://almostkapil.netlify.com/post/Untitled_files/chebyshev.jpg&#34; alt=&#34;image credit: libretext&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;image credit: libretext&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;It says that most(3/4th) of our data will be at max two standard deviations from the mean.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(knitr)
library(kableExtra)
stock&amp;lt;- read.csv(&amp;quot;~/OneDrive - MNSCU/FALL 2019/MathStat/Data/Stock Trade.csv&amp;quot;,stringsAsFactors = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s clean the name,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;stock&amp;lt;- stock %&amp;gt;% select(percentStock = X..of.Shares.Outstanding)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The empirical rule says that 68% of the data will be within two standard deviation.&lt;/p&gt;
&lt;p&gt;This function below:
1. standardizes the data &lt;br&gt;
2. counts data within &lt;code&gt;z&lt;/code&gt; standard deviations &lt;br&gt;
3. outputs the proportion&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data_within&amp;lt;- function(df, z){
  func_normalize&amp;lt;-function(x){(x-mean(x))/sd(x)}
  #&amp;gt;11 after removing a data point 
  df&amp;lt;-df %&amp;gt;% filter(percentStock&amp;lt;11)
  df_scaled&amp;lt;- df %&amp;gt;% mutate(percentStock_normal = func_normalize(percentStock)) %&amp;gt;% filter(abs(percentStock_normal)&amp;lt;z)
  proportion = dim(df_scaled)[1]/dim(df)[1]
  return (round(proportion,2))
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s collect the output in a small tibble.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tb&amp;lt;- tibble(
  first_std_dev = data_within(stock,1),
  second_std_dev = data_within(stock,2),
  third_std_dev = data_within(stock,3),
)

kable(tb)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
first_std_dev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
second_std_dev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
third_std_dev
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.64
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.92
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can also test if our function is working correctly,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(testthat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;testthat&amp;#39; was built under R version 3.5.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;normal_generated = tibble(percentStock = rnorm(10,mean = 6.2,sd = 1.2))

#Testing our function
tb_test&amp;lt;- tibble(
  first_std_dev = data_within(normal_generated,1),
  second_std_dev = data_within(normal_generated,2),
  third_std_dev = data_within(normal_generated,3),
)


kable(tb_test)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
first_std_dev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
second_std_dev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
third_std_dev
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.7
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;testthat::expect_gt(tb_test$first_std_dev, 0.68,label = &amp;quot;data proportion within first deivation&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Hence, our function is working correctly.Note that the data is randomly generated every time the code is run.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
