In this report, I will be investigating the exponential distribution in R and compare it with the Central Limit Theorem (CLT). I will be investigating the distribution of averages of 40 exponentials, and a total of a thousand simulations.
From Wikipedia: Exponential distribution describes times between events happening at constant rate lambda with expected value 1/lambda.
In summary, this report will 1. Show the sample mean and compare it to the theoretical mean of the distribution 2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution 3. Show that the distribution is approximately normal
The exponential distribution is simulated with rexp(n, lambda)
, where lambda is the rate parameter. The mean of exponential distribution and standard deviation is 1/lambda. Lambda is set at 0.2 for all simulations.
set.seed(2021) # for reproducability
nosim <- 1000 # no of simulations
n <- 40 # no of exponentials
lambda <- 0.2 # rate parameter
simdata <- matrix(rexp(nosim * n, rate=lambda), nosim)
sim_mean <- rowMeans(simdata) # row means
# calculate mean, sd and variance of sample exp dist
simdata_mean <- mean(sim_mean)
simdata_sd <- sd(sim_mean)
simdata_var <- var(sim_mean)
With the no of simulations, no of exponentials, and the rate parameter, we can simulate the exponential distribution by multiplying the exponential by the no of simulations, giving us 1000 simulations of 40 exponentials. We put it in matrix form, and use the apply function to find the mean for each row.
With this, we can then find the sample mean, standard deviation and variance.
# calculate mean, df and variance of theoretical exp dist
t_mean = 1/lambda
t_sd = (1/lambda) * (1/sqrt(n))
t_var = t_sd^2
The blue histogram represents the simulated exponential distribution, as you can see most of the data is at the left side of the plot because of the properties of the exponential distribution.
Observing the histogram for the averages of simulated exponentials, we can see it’s following the form of a normal distribution. This will be further investigated later on in the report.
The green line represents the theoretical mean of the distribution, and our simulated distribution is centered closely at the line.
## Sample_stats Theoretical_stats diff
## Mean 5.0086386 5.0000000 0.008638602
## Std 0.7882570 0.7905694 0.002312423
## Variance 0.6213491 0.6250000 0.003650915
Observing the table, the sample mean of the exponential distribution is centered at 5.008 whereas the theoretical mean, 1/lambda is 5
The difference between the sample and theoretical mean is 0.0086
The sample Variance is is 0.621, which is very close to the theoretical variance, 0.625.
The difference between them is 0.0037
The red line is the theoretical normal distribution density, whereas the blue line is the density of the sample distribution. You can see that the sample distribution is approximately normal.
Observing the normal Q-Q plot, we can conclude that the sample distribution approximates the theoretical normal distribution quite closely, with the tails being less normal.
Based on the comparisons and the plots, the simulated sample distribution (as n grows larger) does indeed have similar means and variance with the theoretical distribution. This proves the Central Limit Theorem is in fact true.
An important condition for the central limit theorem is that the random variables are IID, which stands for Independent and Identically Distributed. These conditions are satisfied as we simulated the data using R.
R codes for table and plots
par(mfrow = c(1, 2))
hist(simdata,
col = "steelblue",
main = "Simulated exponential distribution",
xlab = "40 random exponentials")
hist(sim_mean,
col = "red3",
main = "Averages of Simulated Exponentials",
xlab = "Average of 40 exponentials")
abline(v = t_mean, col = "green", lwd = 2) # theoretical mean
Sample_stats <- c(simdata_mean, simdata_sd, simdata_var)
Theoretical_stats <- c(t_mean, t_sd, t_var)
diff <-
c(abs(t_mean - simdata_mean),
abs(t_sd - simdata_sd),
t_var - simdata_var)
names <- c("Mean", "Std", "Variance")
data.frame(Sample_stats,
Theoretical_stats,
diff,
row.names = c("Mean", "Std", "Variance"))
library(ggplot2)
simdata_mean <- data.frame(sim_mean)
ggplot(simdata_mean, aes(sim_mean)) +
geom_histogram(
binwidth = .3,
fill = "steelblue",
color = "black",
aes(y = ..density..)
) +
geom_density(color = "blue", lwd = 1) +
labs(title = "Distribution of Random Exponential Values with 1000 simulations",
x = "Average of 40 Exponentials", y = "Density") +
stat_function(
fun = dnorm,
args = list(mean = t_mean, sd = t_sd),
color = "red",
lwd = 1
) +
theme_bw()
qqnorm(sim_mean, col="black") # sample distribution
qqline(sim_mean, col="red", lwd=3) #theoretical
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] knitr_1.30 magrittr_2.0.1 tidyselect_1.1.0 munsell_0.5.0
## [5] colorspace_2.0-0 R6_2.5.0 rlang_0.4.8 stringr_1.4.0
## [9] dplyr_1.0.2 tools_4.0.2 grid_4.0.2 gtable_0.3.0
## [13] xfun_0.19 withr_2.3.0 htmltools_0.5.0 ellipsis_0.3.1
## [17] yaml_2.2.1 digest_0.6.27 tibble_3.0.4 lifecycle_0.2.0
## [21] crayon_1.3.4 farver_2.0.3 purrr_0.3.4 vctrs_0.3.5
## [25] glue_1.4.2 evaluate_0.14 rmarkdown_2.5 labeling_0.4.2
## [29] stringi_1.5.3 compiler_4.0.2 pillar_1.4.7 generics_0.1.0
## [33] scales_1.1.1 pkgconfig_2.0.3