Review of Statistical Estimation

STA4173: Biostatistics
Spring 2025

Introduction

In this lecture, we will review estimation
- Continuous variables
  - Mean
  - Median
  - Percentiles / quartiles
  - Variance and standard deviation
  - Interquartile range
- Categorical variables
  - Count
  - Percentage
We will also discuss exploring data graphically

R: Introduction

In this course, we will review formulas, but we will use R for computational purposes
- Remember to refer to the lecture notes for specific code needed
- Code is also available on this course’s GitHub repository
You can install R and RStudio if you wish; both are free.
- We have access to the Posit Workbench (“the server”) through HMCSE.
I know that this is probably the first time you are seeing R (or any sort of programming).
- That is why we have “R lab” time built in to our course.
- Remember that I am not looking for perfection, but instead for competency.

Today’s Data: Palmer Penguins

Today we will be demonstrating the basics using the Palmer Penguins dataset, available through R.

penguins <- palmerpenguins::penguins

Types of Variables

Continuous Variables

A continuous variable is a variable that can has an infinite set of possible values.

Between any two possible values, there are an infinite number of possible values.
These typically arise from measurement. (Height, weight, etc.)

Discrete Variables

A discrete variable is a variable that can only take on a finite set of possible values.

The possible values can usually be listed.
These typically arise from categorizing (work vs. home) or counting.

Types of Variables

Types of Continuous Variables

Ratio Variables

A ratio variable is a variable that has a meaningful zero point, allowing comparisons of magnitude.

True zero point indicates the absence of the quantity being measured.
All arithmetic operations (addition, subtraction, multiplication, division) are meaningful.

Interval Variables

An interval variable has an arbitrary zero point and differences between values are meaningful.

The zero point does not indicate a true absence.
A 1 unit difference always represents the same amount.

Types of Discrete Variables

Ordinal Variables

An ordinal variable has a meaningful order of responses; the exact differences between responses are not necessarily equal.

We understand which value is “greater” or “less,” but not by how much.
Arithmetic is not meaningful.

Nominal Variables

A nominal variable has is no intrinsic order among the categories.

Categories are used merely as labels or names.
No arithmetic or ordering operations are meaningful.

Measures of Centrality: Mean

Sample Mean

The sample mean provides a single number that can represent a “typical” or central value in your data.

\bar{x} = \frac{\sum_{i=1}^n x_i}{n}

R syntax:

dataset_name %>% summarize(mean(variable_name, na.rm = TRUE))

Measures of Centrality: Mean

Let’s find the average weight (body_mass_g) of the penguins.

penguins %>% summarize(mean(body_mass_g, na.rm = TRUE))

Let’s find the average flipper length (flipper_length_mm) of the penguins.

penguins %>% summarize(mean(flipper_length_mm, na.rm = TRUE))

Measures of Centrality: Median

Sample Median

The sample median is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger.

If n is odd, the median is the single middle value.
If n is even, the median is the average of the two middle values.

R syntax:

dataset_name %>% summarize(median(variable_name, na.rm = TRUE))

Measures of Centrality: Median

Let’s find the median weight (body_mass_g) of the penguins.

penguins %>% summarize(median(body_mass_g, na.rm = TRUE))

Let’s find the median flipper length (flipper_length_mm) of the penguins.

penguins %>% summarize(median(flipper_length_mm, na.rm = TRUE))

Measures of Spread: Variance and Standard Deviation

Sample Variance

The sample variance measures how “widely spread” the data points are around the mean.

s^2 = \frac{\sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}}{n-1}

When we have a mound-shaped and symmetric distribution, most observations will fall within 2 standard deviations of the mean.
Variance results in units², which typically does not make sense.

Sample Standard Deviation

The sample standard deviation also measures how “widely spread” the data points are around the mean.

s = \sqrt{s^2}

Standard deviation is the square root of the variance, measuring spread in the original units of the data.
R syntax:

dataset_name %>% summarize(var(variable_name, na.rm = TRUE), 
                           sd(variable_name, na.rm = TRUE))

Measures of Spread: Variance and Standard Deviation

Let’s find the variance and standard deviation of the weight (body_mass_g) of the penguins.

penguins %>% summarize(var(body_mass_g, na.rm = TRUE),
                       sd(body_mass_g, na.rm = TRUE))

Let’s find the variance and standard deviation of the flipper length (flipper_length_mm) of the penguins.

penguins %>% summarize(var(flipper_length_mm, na.rm = TRUE),
                       sd(flipper_length_mm, na.rm = TRUE))

Measures of Spread: Interquartile Range

Sample Interquartile Range

The sample interquartile range measures the spread of the middle 50% of data.

\text{IQR} = P_{75}-P_{25}

R syntax:

dataset_name %>% summarize(IQR(variable_name))

Measures of Spread: Interquartile Range

Let’s find the IQR of the weight (body_mass_g) of the penguins.

penguins %>% summarize(IQR(body_mass_g, na.rm = TRUE))

Let’s find the IQR of the flipper length (flipper_length_mm) of the penguins.

penguins %>% summarize(IQR(flipper_length_mm, na.rm = TRUE))

Mean & Standard Deviation vs. Median & IQR

When should we use the mean vs. the median to describe the center of the distribution?
- Mound-shaped and symmetric \to \bar{x} & s.
- Not mound-shaped and symmetric \to M & \text{IQR}.
… How do we know the shape of the distribution?
We will explore histograms.

Graphs: Histograms

Graphs: Histograms (`R` code)

We are using the ggplot2 package for graphing.
- It will always start with ggplot().
- We will then layer elements on top.
R syntax:

dataset_name %>% 
  ggplot(aes(x=variable_name)) + 
  geom_histogram()

Graphs: Histograms

Let’s look at the histogram of penguin weight (body_mass_g):

penguins %>% 
  ggplot(aes(x=body_mass_g)) + 
  geom_histogram()

Graphs: Histograms

Let’s look at the histogram of penguin weight (body_mass_g):

penguins %>% 
  ggplot(aes(x=body_mass_g)) + 
  geom_histogram() +
  labs(x = "Body Mass (g)",
       y = "Number of Penguins",
       title = "Penguin Weight Distribution") +
  theme_bw()

Wrap Up

Today we reviewed estimation.
Next week, we will review statistical inference.
- Confidence intervals
- Hypothesis testing
Get to know you quiz - complete with RStudio.
- .qmd \to Quarto
- .R \to R script
Join the Discord server!
- If you are already a Discord user, this is a friendly reminder that you can change your display name…

Review of Statistical Estimation

Introduction

R: Introduction

Today’s Data: Palmer Penguins

Types of Variables

Types of Variables

Types of Continuous Variables

Types of Discrete Variables

Measures of Centrality: Mean

Measures of Centrality: Mean

Measures of Centrality: Median

Measures of Centrality: Median

Measures of Spread: Variance and Standard Deviation

Measures of Spread: Variance and Standard Deviation

Measures of Spread: Interquartile Range

Measures of Spread: Interquartile Range

Mean & Standard Deviation vs. Median & IQR

Graphs: Histograms

Graphs: Histograms

Graphs: Histograms

Graphs: Histograms

Graphs: Histograms (R code)

Graphs: Histograms

Graphs: Histograms

Wrap Up

Graphs: Histograms (`R` code)