Lecture 1
Dr. Mariam Hassan Ali Hassan @ ArabiStat
2023-12-05
Statistical Anlysis using R
In this lecture:
- Intro to Statistics
- Data Types in Statistics
- Methods of summarizing data
What is statistics:
Statistics is a branch of mathematics dealing with the collection, analysis,interpretation, and presentation of numerical or quantitative data.
Note: Statistics is a branch of mathematics, thus, Understatnding every concept from its root sometimes not applicable UNLESS you have studied mathematical statistics and Math behind statistics
Statistics dealing with:
Cleaning data: —-> This is not a part from statistical analysis from a theoretical point of view. However, now it is an important part in Data analysis.
Collection: ——-> Based on Sampling and desingning experiments fields
Analysis: ———> What happened (Descriptive Analysis), can we generalize it on the population (Inferential Analysis), what will Happen (predictive Analysis)
Presentation:—-> Tables or graphs
Interpretation:
What statistics can do: A/B tests: Which ad is more effective in getting people to purchase a product?
What statistics can’t do: Is the too much yellow color in the second add responsible in getting people to purchase a product
Types of Data:
From a statistical standpoint, data can be broadly classified into two main categories: Numeric (Quantitative) and Categorical (Qualitative).
Numeric (Quantitative):
Numeric data represents measurable quantities and can further be classified into two subtypes: Continuous and Discrete.
- Continuous (measured): Continuous data consists of variables that can take any value within a given range. Examples include:
- Speed
- Weight
- Time (sometimes discrete)
- Discrete (counted): Discrete data involves variables that can only take distinct, separate values, typically in whole numbers. Examples include:
- Number of children
- Number of packages shipped
Categorical (Qualitative):
Categorical data represents characteristics and qualities, rather than quantities. It can be further divided into two subtypes: Nominal and Ordinal.
- Nominal (Un-ordered): Nominal data involves categories without any inherent order. Examples include:
- Country of residence
- Marital status (married/unmarried)
- Gender
- Ordinal (Ordered): Ordinal data consists of categories with a meaningful order but without a consistent interval. Examples include:
- Customer satisfaction (agree, neutral, disagree)
- Temperature (low, medium, high)
Why Types of Data is Important:
Understanding the types of data is crucial in the field of statistics for several reasons:
Appropriate Analysis: Different types of data require different statistical methods. Recognizing the type of data helps in choosing the right analysis techniques.
Data Interpretation: Knowing whether data is numeric or categorical aids in the interpretation of results. For instance, mean values make sense for continuous data, but not for categorical data.
Effective Visualization: Properly categorizing data types allows for the selection of appropriate visualizations. Histograms are suitable for numeric data, while bar charts are often used for categorical data.
Informed Decision-Making: Understanding the nature of the data enhances the ability to make informed decisions based on statistical insights.
In summary, a clear comprehension of the types of data is fundamental for accurate statistical analysis and interpretation, contributing to informed decision-making in various fields.
Why Types of Data is Important in R:
In R, as in any statistical programming language, knowing the types of data is crucial for effective data manipulation, analysis, and visualization. Here are some reasons why understanding data types is important when using R:
- Data Import and Loading:
- Different functions in R are designed to handle specific data types.
Knowing the data type helps you choose the appropriate function for
importing or loading data into R. For example, the
read.csv()
function is suitable for loading tabular data, whilereadRDS()
is used for reading serialized R objects.
- Different functions in R are designed to handle specific data types.
Knowing the data type helps you choose the appropriate function for
importing or loading data into R. For example, the
- Data Cleaning and Transformation:
- Before analysis, data often requires cleaning and transformation.
Understanding data types is essential for detecting and handling missing
values, outliers, and inconsistencies. R provides specific functions
tailored to different data types, such as
na.omit()
for handling missing values in numeric data.
- Before analysis, data often requires cleaning and transformation.
Understanding data types is essential for detecting and handling missing
values, outliers, and inconsistencies. R provides specific functions
tailored to different data types, such as
- Statistical Analysis:
- Different statistical methods are applicable to different data types. For instance, linear regression is commonly used for numeric data, while chi-squared tests are more suitable for categorical data. Knowing the data type informs the selection of appropriate statistical tests and models in R.
- Data Manipulation with dplyr:
- The
dplyr
package in R is widely used for data manipulation. Functions likemutate()
andfilter()
operate differently depending on the data type. Understanding whether a variable is numeric or categorical is crucial for writing effective and accurate data manipulation code.
- The
- Visualization with ggplot2:
- Visualization is a key aspect of data exploration and communication.
The
ggplot2
package in R is a powerful tool for creating visualizations. The type of plot and the aesthetics used (e.g., color, shape) depend on the data types. For instance, histograms are suitable for numeric data, while bar plots are more appropriate for categorical data.
- Visualization is a key aspect of data exploration and communication.
The
- Efficient Memory Usage:
- Knowing the types of data helps in optimizing memory usage. R provides data types like integers, doubles, and factors. Choosing the appropriate data type for each variable can save memory and improve the efficiency of your R code, especially when working with large datasets.
- Programming and Debugging:
- Understanding data types is fundamental for effective programming in R. It helps in writing robust and error-free code. For example, ensuring that a function receives the expected data type as input can prevent errors during execution.
In summary, knowing the types of data is essential when using R because it influences data handling, analysis, visualization, and overall programming efficiency. It enables users to make informed decisions and write code that is accurate, efficient, and aligned with the nature of the data being analyzed.
Example on importance of data types by visualization
Dirty Visualization vs Not dirty visualization
library(ggplot2)
ggplot(PlantGrowth, #data
aes(x=weight,y=group
))+
geom_point()
ggplot(PlantGrowth, #data
aes(x=weight,y=group
))+
geom_boxplot()
print("This plot is not accipatble statistically but Sometimes acceptable in data analysis")
## [1] "This plot is not accipatble statistically but Sometimes acceptable in data analysis"
ggplot(PlantGrowth, #data
aes(y=weight,x=group
))+
geom_boxplot()
# Creating a data frame with continent and GDP information
continent_gdp <- data.frame(
Continent = c("Asia", "Africa", "North America", "South America", "Europe", "Oceania"),
GDP = c(100, 50, 150, 80, 120, 10)
)
ggplot(continent_gdp, aes(x = Continent, y = GDP, fill = Continent)) +
geom_bar(stat = "identity") +
#labs(title = "GDP by Continent", x = "Continent", y = "GDP") +
theme_minimal()
ggplot(continent_gdp, aes(x = Continent, y = GDP, fill = Continent)) +
geom_bar(stat = "identity") +
#labs(title = "GDP by Continent", x = "Continent", y = "GDP") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(continent_gdp, aes(y = Continent, x = GDP, fill = Continent)) +
geom_bar(stat = "identity") +
#labs(title = "GDP by Continent", x = "Continent", y = "GDP") +
theme_minimal()
Methods of Summarising Data:
- Descriptive statistics
- Frequency tables and cross tables (part from descriptive statistics but for categorical variable)
- Correlation analysis
- Visualization
- Histograms
- Scatter plot
- Line plot
- Bar charts
- Heat-maps
- Pareto analysis
1. Descriptive Statistics:
Descriptive statistics refers to the branch of statistics that involves the collection, analysis, interpretation, presentation, and organization of data. The primary goal of descriptive statistics is to summarize and describe the main features of a dataset, providing a clear and concise overview that facilitates understanding.
Key components of descriptive statistics include:
- Measures of Central Tendency:
- Mean: The average value of a dataset, calculated by summing all values and dividing by the number of observations.
library(readxl)
library(data.table)
df_sleep=read.csv("E:\\Arabistat\\Arabistat Academy\\R courses\\R\\Statistics with R advanced/Animals_sleep.csv")
head(df_sleep)
## species body_wt brain_wt non_dreaming dreaming total_sleep
## 1 Africanelephant 6654.000 5712.0 NA NA 3.3
## 2 Africangiantpouchedrat 1.000 6.6 6.3 2.0 8.3
## 3 ArcticFox 3.385 44.5 NA NA 12.5
## 4 Arcticgroundsquirrel 0.920 5.7 NA NA 16.5
## 5 Asianelephant 2547.000 4603.0 2.1 1.8 3.9
## 6 Baboon 10.550 179.5 9.1 0.7 9.8
## life_span gestation predation exposure danger
## 1 38.6 645 3 5 3
## 2 4.5 42 3 1 3
## 3 14.0 60 1 1 1
## 4 NA 25 5 2 3
## 5 69.0 624 3 5 4
## 6 27.0 180 4 4 4
mean(df_sleep$total_sleep) #The mean failed to be calculated why:
## [1] NA
df_sleep= na.omit(df_sleep) # remove any row has null values
mean(df_sleep$total_sleep)
## [1] 10.64286
hist(df_sleep$total_sleep,5)
- Median: The middle value of a dataset when it is ordered from least to greatest.
median(df_sleep$total_sleep)
## [1] 9.8
hist(df_sleep$total_sleep,5)
abline(v = mean(df_sleep$total_sleep), col = "red")
abline(v = median(df_sleep$total_sleep), col = "blue")
tea_price=c(15,17,13.3, 80,15.1,15.4,10,12.5,17,20,15.)
print(mean(tea_price))
## [1] 20.93636
print(median(tea_price))
## [1] 15.1
hist(tea_price)
abline(v = mean(tea_price), col = "red")
abline(v = median(tea_price), col = "blue")
After removing the outlier
tea_price=c(15,17,13.3, 15.1,15.4,10,12.5,17,20,15.7)
print(mean(tea_price))
## [1] 15.1
print(median(tea_price))
## [1] 15.25
hist(tea_price)
abline(v = mean(tea_price), col = "red")
abline(v = median(tea_price), col = "blue")
It is not always easy to remove all outliers 🙂
look at the following example
df_food=read_excel("E:\\Arabistat\\Arabistat Academy\\R courses\\R\\Statistics with R advanced/food_consumption.xlsx")
head(df_food)
## # A tibble: 6 × 4
## country food_category consumption co2_emission
## <chr> <chr> <dbl> <dbl>
## 1 Argentina pork 10.5 37.2
## 2 Argentina poultry 38.7 41.5
## 3 Argentina beef 55.5 1712
## 4 Argentina lamb_goat 1.56 54.6
## 5 Argentina fish 4.36 6.96
## 6 Argentina eggs 11.4 10.5
hist(df_food$co2_emission,100)
abline(v = mean(df_sleep$brain_wt), col = "red")
abline(v = median(df_sleep$brain_wt), col = "blue")
- Mode: The value that appears most frequently in a dataset.
library(DescTools)
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:data.table':
##
## %like%
num=c(1,1,2,3,4,1,1,1)
Mode(num)
## [1] 1
## attr(,"freq")
## [1] 5
num2=c(1.1,1.5,2.7,3.8,4.3,1.9,1.2,1.0)
Mode(num2)
## [1] NA
## attr(,"freq")
## [1] 1
num2=c(1.1,1.5,2.7,3.8,4.3,1.9,1.2,1.0)
Mode(num2)
## [1] NA
## attr(,"freq")
## [1] 1
cat1=c("danger", "safe", "danger", "safe","danger","danger","danger","danger")
Mode(cat1)
## [1] "danger"
## attr(,"freq")
## [1] 6
cat2=c("dasafenger", "safe", "danger", "safe","safe","safe","safe","danger")
Mode(cat2)
## [1] "safe"
## attr(,"freq")
## [1] 5
Mean vs Median vs Mode
- The mean is sensitive to extreme values (outliers) and can be influenced by them but it consider ALL data points.
- The median is less sensitive to outliers and provides a better measure of central tendency in skewed datasets but it consider only one data point.
- The mode is less commonly used and may not exist in every dataset. It’s especially useful for categorical data.
NOTE: if you need to Mode a continuous variable or numerical variable in general, re-categorize the data
# Create a sample numeric vector
data <- c(10, 15, 20, 30, 40, 50, 60)
# Define the breakpoints (bin edges) for the intervals
breaks <- c(0, 20, 40, 60)
# Use cut() to create intervals based on the specified breakpoints
# The labels argument specifies labels for the intervals
result <- cut(data, breaks = breaks, labels = c("Low", "Medium", "High"))
# Display the result
print(result)
## [1] Low Low Low Medium Medium High High
## Levels: Low Medium High
- Measures of Dispersion:
Why we need this measures?
The following example is prices fro coffee within three months, is the mean enough to judge such market?
#coffee price in three months
coffee_price_City1=c(4.9 , 5 , 5.1)
coffee_price_City2=c(1 , 5 , 9)
mean(coffee_price_City1)
## [1] 5
mean(coffee_price_City2)
## [1] 5
is market of City2 stable such as City1
it is clear that the market in the second city is worse than the fist one although they have the same mean!. but how can we measure such claim numerically?! that is why we need measures of dispersion
Variance,standard deviation(std) and coefficient of variation(C.V%)
Range
Inter Quartile range(IQR)
Percentiles
- Variance: A measure of how spread out the values in a dataset are from the mean.
- Standard Deviation: The square root of the variance, providing a more interpretable measure of the spread of data.
- Coefficient of variation(C.V%): The coefficient of variation (CV) is a statistical measure expressing the relative variability of a dataset, calculated as the ratio of the standard deviation to the mean and expressed as a percentage, making it a normalized indicator of dispersion relative to the scale of the data.
#coffee price in three months
coffee_price_City1=c(4.9,5,5.1)
coffee_price_City2=c(1,5,9)
# Manually Calculate the Variance
distance=coffee_price_City1-mean(coffee_price_City1)
distance_square=distance^2
sum_distance_square=sum(distance_square)
average_sum_distance_square=sum_distance_square/(length(coffee_price_City1)-1) #(n-1 for samples)
# Variance Function in r
var(coffee_price_City1)
## [1] 0.01
# Manually calculate the Standard deviation
root_average_sum_distance_square=sqrt(average_sum_distance_square)
# Standard deviation function in r
sd(coffee_price_City1)
## [1] 0.1
Now which prices are more dispersed, NewYork in dollars or India in robeya
# price among three months
coffee_price_NewYork=c(4,5,6)
coffee_price_India=c(400,500,600)
sd(coffee_price_NewYork)
## [1] 1
sd(coffee_price_India)
## [1] 100
Now the standard deviation of Newyork is 1 dollar which is less than that of india which is 100 robeya but is it acceptable to compare different units? That is why we needCoefficient of variation
- Coefficient of variation(C.V%): The coefficient of variation (CV) is a statistical measure expressing the relative variability of a dataset, calculated as the ratio of the standard deviation to the mean and expressed as a percentage, making it a normalized indicator of dispersion relative to the scale of the data.
i.e. C.V% transform the dispersion into perentage of dispersion around the mean
# Manually caculating CV%
CV_NewYork=sd(coffee_price_NewYork)/mean(coffee_price_NewYork)
CV_NewYork
## [1] 0.2
CV_India=sd(coffee_price_India)/mean(coffee_price_India)
CV_India
## [1] 0.2
# Calculating CV% using funcitons in R
library(DescTools)
# Calculate the coefficient of variation
cv_result <- CoefVar(coffee_price_NewYork)
# Print the result
cat("Coefficient of Variation:", cv_result)
## Coefficient of Variation: 0.2
NOte: Variance, std and cv are depending on their mean in their calculation thus thus they have the same advantages and disadvantages of the mean.
- Range: The difference between the maximum and minimum values in a dataset.
depends only on two observations affected by outliers
#coffee price in three months
coffee_price_City1=c(4.9,5,5.1)
coffee_price_City2=c(1,5,9)
coffee_price_City3=c(1,5,9,90) # take care from outlier
range1=max(coffee_price_City1)-min(coffee_price_City1)
range1
## [1] 0.2
range2=max(coffee_price_City2)-min(coffee_price_City2)
range2
## [1] 8
range3=max(coffee_price_City3)-min(coffee_price_City3)
range3
## [1] 89
- Interquartile Range (IQR) is a measure of statistical dispersion that represents the range between the first quartile (25th percentile) and the third quartile (75th percentile) of a dataset, providing a robust measure of the spread of the middle 50% of the data.
It isn’t affected by outliers
One of the great disadvantage that it neglect 50% of the data 🙁
prices=c(1,2,3,4,5,6,7,8,9,10)
q1=quantile(prices,0.25)
q3=quantile(prices,0.75)
iqr=q3-q1
print(paste("The first quartile Q1 =",q1, " and the third quartile Q3 =",q3, " and the IQR =", iqr))
## [1] "The first quartile Q1 = 3.25 and the third quartile Q3 = 7.75 and the IQR = 4.5"
The IQR is important because it is a part of calculation of outliers
Outlier: data point that is substantially different from the others
How do we know what a substantial difference is? A data point is an outlier if:
data.point < Q1 − 1.5×IQR or
data.point > Q3 + 1.5×IQR
prices=c(1,2,3,4,5,6,7,8,9,10,80)
q1=quantile(prices,0.25)
q3=quantile(prices,0.75)
iqr=q3-q1
lower_bound=q1-1.5*iqr
higher_bound=q3+1.5*iqr
prices>higher_bound #there is one upper outliers
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
prices[prices>higher_bound]
## [1] 80
prices<lower_bound
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
prices[prices<lower_bound] #no lower outliers
## numeric(0)
NOw we want to apply this on a column in a dataFrame
q1=quantile(df_sleep$body_wt,0.25)
q3=quantile(df_sleep$body_wt,0.75)
iqr <- q3-q1
lower_bound <- q1-1.5* iqr
upper_bound<- q3+1.5* iqr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df_sleep %>%
filter(body_wt < lower_bound | body_wt > upper_bound) #Or
## species body_wt brain_wt non_dreaming dreaming total_sleep life_span
## 1 Asianelephant 2547.00 4603 2.1 1.8 3.9 69.0
## 2 Braziliantapir 160.00 169 5.2 1.0 6.2 30.4
## 3 Chimpanzee 52.16 440 8.3 1.4 9.7 50.0
## 4 Cow 465.00 423 3.2 0.7 3.9 30.0
## 5 Goat 27.66 115 3.3 0.5 3.8 20.0
## 6 Grayseal 85.00 325 4.7 1.5 6.2 41.0
## 7 Horse 521.00 655 2.1 0.8 2.9 46.0
## 8 Man 62.00 1320 6.1 1.9 8.0 100.0
## 9 Pig 192.00 180 6.5 1.9 8.4 27.0
## 10 Sheep 55.50 175 3.2 0.6 3.8 20.0
## gestation predation exposure danger
## 1 624 3 5 4
## 2 392 4 5 4
## 3 230 1 1 1
## 4 281 5 5 5
## 5 148 5 5 5
## 6 310 1 3 1
## 7 336 5 5 5
## 8 267 1 1 1
## 9 115 4 4 4
## 10 151 5 5 5
- Quartiles and Percentiles
Quartiles and percentiles are measures of statistical dispersion that divide a dataset into parts, indicating the relative position of values.
Quartiles:
- Quartiles are values that divide a dataset into four equal parts.
- The three quartiles are the first quartile (Q1 or the 25th percentile), the second quartile (Q2 or the median), and the third quartile (Q3 or the 75th percentile).
- Q1 represents the 25% mark, Q2 is the median at the 50% mark, and Q3 is the 75% mark. It is the generalization of quartiles, you can choose any level of percentile
library(ggplot2)
quartiles=quantile(df_sleep$total_sleep)
print(quartiles)
## 0% 25% 50% 75% 100%
## 2.90 8.05 9.80 13.60 19.90
ggplot(df_sleep, aes(y = total_sleep))+
geom_boxplot()
Percentiles:
- Percentiles divide a dataset into hundred equal parts, providing a more detailed view of the distribution. it is a general form of quartiles that can take any value from 1 to 100
- The nth percentile represents the value below which n% of the data falls.
- For example, the 75th percentile (P75) indicates the value below which 75% of the data lies.
quantile(df_sleep$total_sleep, probs =c(0,0.2,0.4,0.6,0.8,1))
## 0% 20% 40% 60% 80% 100%
## 2.90 6.28 8.80 10.82 14.28 19.90
quantile(df_sleep$total_sleep, probs = seq(0,1,0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 2.90 4.05 6.28 8.33 8.80 9.80 10.82 13.20 14.28 17.36 19.90
Both quartiles and percentiles help in understanding the distribution of data and identifying central tendencies and spread.
H.W
In the world_happiness_sugar dataframe 1. find countries that have the highest usage of sugar in grams(the highest 5%) sort these countries according to the sugar usage in a descending order 2. find the countries which have the lowest usage of sugar in grams(the lowest 5%)sort these countries according to the sugar usage in an ascending order. 3. search for the mean absolute deviation (MAD) write its definition and calculate it in R for the sugar usage in grams
- Measures of Shape:
- Skewness: Indicates the asymmetry or lack of symmetry in a dataset’s distribution.
- Kurtosis: Measures the “tailedness” of a dataset, indicating whether the data are heavy-tailed or light-tailed compared to a normal distribution.
Skewness is a statistical measure that describes the asymmetry of a distribution. It indicates the extent and direction of skew (departure from horizontal symmetry) in a dataset. Skewness can be classified into three types:
I. Symmetrical Distribution (Zero Skewness): - A distribution is considered symmetrical when the left and right sides are mirror images of each other. - Skewness is zero for a perfectly symmetrical distribution. - Example: Normal distribution.
# Generate a random sample from a normal distribution
library(moments)
set.seed(123)
data_zero_skew <- rnorm(1000)
# Calculate skewness
skewness_zero <- skewness(data_zero_skew)
# Plot the histogram
hist(data_zero_skew, main="Zero Skewness (Symmetrical Distribution)", col="lightblue", border="black")
- Negative Skewness:
- A distribution is negatively skewed when the left tail is longer or fatter than the right tail.
- Skewness is negative for a negatively skewed distribution.
- Example: Exponential distribution.
# Generate a random sample from an exponential distribution
set.seed(456)
data_negative_skew <- rbeta(100000,100,1)*10
# Calculate skewness
skewness_negative <- skewness(data_negative_skew)
# Plot the histogram
hist(data_negative_skew, main="Negative Skewness", col="lightcoral", border="black")
- Positive Skewness:
- A distribution is positively skewed when the right tail is longer or fatter than the left tail.
- Skewness is positive for a positively skewed distribution.
- Example: Chi-squared distribution.
# Generate a random sample from a chi-squared distribution
set.seed(789)
data_positive_skew <- rchisq(1000, df=3)
# Calculate skewness
skewness_positive <- skewness(data_positive_skew)
# Plot the histogram
hist(data_positive_skew, main="Positive Skewness", col="lightgreen", border="black")
In the above code snippets, the skewness
function from
the moments
package in R is used to calculate the skewness
of each dataset. The histograms visualize the distribution of each
dataset.
Note: Make sure to install and load the moments
package
before running the code:
install.packages("moments")
library(moments)
Kurtosis is a statistical measure that describes the distribution of data in a dataset. It indicates the tailedness or peakedness of a distribution compared to a normal distribution. There are three main types of kurtosis:
- Mesokurtic (Normal Kurtosis):
- A mesokurtic distribution has kurtosis equal to zero, indicating a normal distribution.
- In R, you can generate a random sample from a normal distribution
and calculate its kurtosis using the
kurtosis
function from themoments
package:
# Install and load the moments package
library(moments)
# Generate a random sample from a normal distribution
norm_data <- rnorm(1000)
# Calculate kurtosis
norm_kurtosis <- kurtosis(norm_data)
# Print kurtosis
print(paste("Kurtosis of normal distribution:", norm_kurtosis))
## [1] "Kurtosis of normal distribution: 3.21179927527682"
- Leptokurtic Distribution:
- A leptokurtic distribution has positive kurtosis, indicating heavier tails and a sharper peak than the normal distribution.
- In R, you can generate a leptokurtic distribution using the
rt
function from the base package:
# Generate a random sample from a t-distribution with degrees of freedom = 3
leptokurtic_data <- rt(1000, df = 3)
# Calculate kurtosis
leptokurtic_kurtosis <- kurtosis(leptokurtic_data)
# Print kurtosis
print(paste("Kurtosis of leptokurtic distribution:", leptokurtic_kurtosis))
## [1] "Kurtosis of leptokurtic distribution: 13.0027823091812"
- Platykurtic Distribution:
- A platykurtic distribution has negative kurtosis, indicating lighter tails and a flatter peak than the normal distribution.
- In R, you can generate a platykurtic distribution using the
rnorm
function with a wider standard deviation:
# Generate a random sample from a normal distribution with a larger standard deviation platykurtic_data <- c(rep(61, each = 10), rep(64, each = 18), rep(65, each = 23), rep(67, each = 32), rep(70, each = 27), rep(73, each = 17)) # Calculate kurtosis platykurtic_kurtosis <- kurtosis(platykurtic_data) # Print kurtosis print(paste("Kurtosis of platykurtic distribution:", platykurtic_kurtosis))
## [1] "Kurtosis of platykurtic distribution: 2.25831795322904"
For visualization, you can use a histogram to observe the shape of the distributions:
# Plot histograms for the three distributions
par(mfrow = c(1, 3))
hist(norm_data, main = "Normal Distribution", col = "lightblue")
hist(leptokurtic_data, main = "Leptokurtic Distribution", col = "lightgreen")
hist(platykurtic_data, main = "Platykurtic Distribution", col = "lightcoral")
This code will create a side-by-side comparison of the histograms for the normal, leptokurtic, and platykurtic distributions, allowing you to visually compare their shapes. 4. Frequency Distribution: - A table that displays the frequency of values or ranges of values in a dataset.
- Percentiles and Quartiles:
- Percentiles: Values that divide a dataset into 100 equal parts, providing insight into the relative standing of a particular value. it is a general forms of
- Quartiles: Values that divide a dataset into four equal parts, representing the median and the two midpoints of the lower and upper halves.
Descriptive statistics are fundamental in summarizing the main characteristics of a dataset and are often the first step in data analysis. They provide a snapshot of the data’s central tendency, variability, and distribution, making it easier for researchers, analysts, and decision-makers to understand and interpret the information at hand.