STA2023 Review: Point Estimation Data Visualization
June 17, 2025 Tuesday
Introduction: Topics
Basic descriptives:
Continuous variables:
Mean
Median
Variance and standard deviation
Range and interquartile range
Categorical variables:
Count
Overall percentage
Row percentage
Column percentage
Introduction: Data
Name: the pony’s name
Type: type of pony (Earth, Pegasus, Unicorn, Alicorn)
Sex: sex/age of pony (Coal, Filly, Stallion, Mare)
Flying speed: average flying speed (km/hr) for winged ponies
Friendship: a harmony index from friendship activities (0-10)
Magical energy: measured magical energy output (sparkles) for magical ponies
Tail shimmer: how much light reflected by the pony’s tail (lux)
Types of Variables: Qualitative
A qualitative or categorical variable classifies an observation into one of two or more groups or categories.
Nominal: purely qualitative and unordered
Ordinal: data can be ranked, but intervals between ranks may not be equivalent
Examples:
satisfaction rating
favorite color
type of pet
education level
blood type
Types of Variables: Quantitative
A quantitative or continuous variable takes numerical values for which arithmetic operations such as adding and averaging make sense; typically has a unit of measure.
Interval: meaningful differences between values, but no true zero point
Ratio: meaningful differences and a true zero point
Examples:
age (years)
temperature (Celsius)
daily hours of sleep
ACT or SAT score
height (inches)
Types of Variables: Example
Name: the pony’s name
Type: type of pony (Earth, Pegasus, Unicorn, Alicorn)
Sex: sex/age of pony (Coal, Filly, Stallion, Mare)
Flying speed: average flying speed (km/hr) for winged ponies
Friendship: a harmony index from friendship activities (0-10)
Magical energy: measured magical energy output (sparkles) for magical ponies
Tail shimmer: how much light reflected by the pony’s tail (lux)
Describing Data: Why?
Why do we describe data? We want to tell a story!
Summarize n observations into a single description
Understand what is in the data
Spot patterns, missing data, or outliers
Compare groups or spot differences or oddities
Describing Data: How?
How do we describe data?
Numbers
Frequency table
Mean & standard deviation
Median & IQR
Graphs
Bar charts
Box plots
Histograms
Point Estimation: Mean
Mean: the average of a set of values
\bar{y} = \frac{\sum_{i=1}^n y_i}{n}
Find the mean for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
\bar{y} = \frac{\sum_{i=1}^n y_i}{n} = \frac{10 + 20 + 30 + 40 + 100}{5} = 40
The average flying speed for winged ponies is 40 km/hr.
Point Estimation: Variance
Variance: A measure of spread; the average of squared differences from the mean.
Higher variance = data has more spread.
In squared units of the data.
s_y^2 = \frac{\sum_iy_i^2 - (\sum_iy_i)^2/n}{n-1}
Find the variance for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
s_y^2 = \frac{\sum_iy_i^2 - (\sum_iy_i)^2/n}{n-1} = \frac{(10^2+...+100^2)-(10+...+100)^2/5}{4} = 1250
The variance is 1250 (km/hr)2
Point Estimation: Standard Deviation
Standard Deviation: A measure of spread; the average distance from the mean.
Higher standard deviation = data has more spread.
Same units as the data.
s_y = \sqrt{s_y^2}
Find the standard deviation for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
s_y = \sqrt{s^2_y} = \sqrt{1250} \approx 35.36
The standard deviation is 35.36 km/hr.
Point Estimation: Range
Range: difference between the maximum and minimum values
\text{range} = \text{max}(y) - \text{min}(y)
Find the range for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
\begin{align*}
\text{range} = \text{max}(y) - \text{min}(y) = 100 - 10 = 90
\end{align*}
The range of the flying speeds is 90 km/hr.
Point Estimation: Interquartile Range
Interquartile Range (IQR): range of the middle 50% of the data.
\text{IQR} = \text{P}_{75} − \text{P}_{25}
Find the IQR for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
Recall that the median is 30.
We then find P_{25} using {10, 20} and P_{75} using {40, 100}
Thus, P_{25} = 15 and P_{75} = 70 .
\begin{align*}
\text{IQR} = \text{P}_{75} − \text{P}_{25} = 70 - 15 = 55
\end{align*}
The IQR of the flying speeds is 55 km/hr.
Point Estimation: Proportion
Proportion: a type of mean for categorical data
Often expressed as a percentage
Useful for categorical responses
\hat{p} = \frac{\sum_{i=1}^n y_i}{n},
y_i =
\begin{cases}
1 & \text{if in category }i \\
0 & \text{otherwise}
\end{cases}
Point Estimation: Proportion
Find the proportion of ponies that have wings in the following sample: {Y, N, Y, Y, N, Y}
Count the number of “Y” responses and divide by total:
\hat{p} = \frac{\sum_{i=1}^n y_i}{n} = \frac{4}{6} \approx 0.67
The proportion of ponies with wings is 0.667 (or 66.7%).
Point Estimation: Frequency Table
Frequency table: A table showing how often each value appears in a dataset.
Useful for categorical responses.
For each category, i , we report n_i (\%_i )
Find the freqency table for the following sample of 8 ponies: {Earth, Pegasus, Unicorn, Earth, Pegasus, Pegasus, Unicorn, Alicorn}
Frequencies:
Alicorn: n_{\text{A}} = 1
Earth: n_{\text{E}} = 2
Pegasus: n_{\text{P}} = 3
Unicorn: n_{\text{U}} = 2
Proportions:
Alicorn: \hat{p}_{\text{A}} = 1/8 = 0.125
Earth: \hat{p}_{\text{E}} = 2/8 = 0.250
Pegasus: \hat{p}_{\text{P}} = 3/8 = 0.375
Unicorn: \hat{p}_{\text{U}} = 2/8 = 0.250
Point Estimation: Frequency Table
Putting this into a table,
Point Estimation: Contingency Table
Contingency table: A table that summarizes two qualitative variables and their overlap.
We will not concern ourselves with the derivation, but will rely on R.
Consider this data,
Point Estimation: Contingency Table
The resulting contingency table would look someting like this:
We are using column totals as our denominators.
# A tibble: 4 × 3
pony_type No Yes
<chr> <chr> <chr>
1 Alicorn 0 (0.0%) 1 (25.0%)
2 Earth 2 (50.0%) 0 (0.0%)
3 Pegasus 0 (0.0%) 3 (75.0%)
4 Unicorn 2 (50.0%) 0 (0.0%)
Graphs: Box Plots
Box plots display the distribution of a continuous variable using the five number summary:
Whisker: Minimum
Beginning of box: 25th percentile (first quartile; Q1, P25)
“Middle” of box: Median (50th percentile, second quartile; Q2, P50)
End of box: 75th percentile (third quartile; Q3, P75)
Whisker: Maximum
We use box and whisker plots to get an idea of the spread and skewness of the data.
Note: there are different ways to define the whiskers.
I use the min/max as whiskers when sketching by hand.
ggplot()
uses 1.75 \times IQR.
Graphs: Histograms
Histograms show the distribution of a continuous variable.
What is the shape of the distribution?
Is the distribution symmetric? Skewed? How skewed?
Values are grouped into intervals (“bins”), then the bin height demonstrates how many values fall into that interval.
This allows us to quickly see if there are any oddities.
Increased proportion of a specific value/bin.
Zero inflation? Value used to indicate missing?
Any values that are “out in the tail”.
Outlier? Data entry error?
Graphs: Bar Graphs
Bar graphs display the distribution of categorical data.
The frequency or proportion of observations is displayed on the bar graph.
Bar graphs usually have categories on the x -axis and counts or proportions on the y -axis.
Note that we could flip the axes to create a vertical bar graph.
Note that the bars are separated on the x -axis to indicate the lack of continuity.
Graphs: Bar Graphs
Consider the bar graph, below.
Graphs: Side-by-Side Bar Graphs
Consider the bar graph, below.
Graphs: Stacked Bar Graphs
Consider the bar graph, below.
Graphs: Histograms vs Bar Graphs
We have now reviewed two “bar style” graphs that we see regularly: histograms and bar graphs.
We use histograms to see the distribution of continuous variables.
The x -axis represents numeric intervals.
The bars touch each other to represent continuity.
We use bar graphs to see the distribution of categorical variables.
The x -axis represents categories.
The bars do not touch each other, implying distinct categories.
Graphs: Scatterplots
Scatterplots allow us to look at the relationship between two continuous variables.
Each point on the graph represents one observation.
What statisticians use scatterplots for:
Explore patterns (aka trends or relationships).
Linear relationships.
Non-linear relationships.
Detect clusters of observations.
Find oddities in the data (outliers).
When we describe the relationship, we are really answering the question, “As x increases, what happens to y ?”
Graphs: Scatterplots
Consider the scatterplot, below.
Graphs: Scatterplots
Consider the scatterplot, below.
Graphs: Scatterplots
Consider the scatterplot, below.
Graphs: Scatterplots
Consider the scatterplot, below.
Graphs: Scatterplots
Consider the scatterplot, below.
Introduction to R
In this course, we will review formulas, but we will use R for computational purposes.
Remember to refer to the lecture notes for specific code needed.
Code is also available on this course’s GitHub repository .
You can install R and RStudio if you wish; both are free.
We also have access to the Posit Workbench (“the server”) through HMCSE.
I know that this is probably the first time you are seeing R (or any sort of programming).
That is why we have “R lab” time built in to our course.
Remember that I am not looking for perfection, but for competency.
Introduction to R
Please download today’s activity from Canvas and log into the server .
Click on “New session”
Click on “Create session”
Upload today’s activity to the server:
In the bottom right pane, click on the white square with a golden up arrow on in
Click on “Choose file”
Select the downloaded file
Click “Open”
Click “OK”
Open today’s activity on the server:
In the bottom right pane, scroll to the bottom
Click on the name of the .qmd file for today’s activity
Introduction to R: .R scripts
.R scripts:
Only allows code
Can comment out code using pound sign
Can run code line-by-line
Can run multiple lines of code at a time
Results output to Console window pane (bottom left)
I use .R scripts for my day-to-day analyses
Introduction to R: .qmd file
.qmd files:
Allow both text and code
Can comment out text using html code
Can comment out code in chunk using pound sign
Uses “code chunks” to evaluate code
Button: run all chunks before
Button: run this chunk
Ctrl+enter / cmd+return: line-by-line
Rendering results in .html file
I use .qmd files to create sharable documents
Reproducible research forever and always
Introduction to R: Disclaimer!
My major disclaimer as a biostatistician: I am a statistician first, programmer second.
My expertise is in statistics, not programming.
I do not know everything about R.
I do not claim to write the most efficient code.
Our goal is to correctly apply statistics to answer research questions using data.
R is a tool for us to apply statistics.
My major disclaimer as a professor: yes, I know this is likely the first time you are seeing R or programming in general.
We have “R lab” time built into the course.
Code you need to answer questions will be provided in lecture.
This means you must revisit lecture slides to find the code you need.
Introduction to R: Dr. Seals’s Expectations
I expect students to try their best . This includes:
referring back to lectures as needed.
asking when you have a question.
using the resources provided to learn .
You must know how to answer questions using R.
You will not be expected to write code beyond what is shown in class.
Note: Sometimes I include bonus questions…
When grading, I am looking for competency .
What is the appropriate analysis for the question at hand?
What are the assumptions of the analysis? Do we meet them?
Are the correct conclusions drawn given the information provided?
Functions in R: base R vs. packages
R functions are like baking recipes. They:
Take input (ingredients, or data),
Does something with it (follow recipe, or perform calculations),
Gives back a result (baked good, or statistics).
Some functions in R are available as soon as you open RStudio (this is “base R”).
Other functions are not available and must be called in after you start RStudio (these are “packages”).
e.g., after I call in library(tidyverse)
, I can use summarize(mean(), sd())
We will always need library(tidyverse)
because of %>%
(pipe operator).
Functions in R: tidyverse
library(tidyverse)
is a collection of R packages designed for data science.
All packages share a common philosophy and are meant to work together.
This is ideal for using the “same syntax” - I promise it’s better than base R!
Core library(tidyverse)
packages we will use:
library(readr)
: read in data files
library(dplyr)
: manipulate and summarize data
library(ggplot2)
: create data visualizations
If you are interested, there are resources:
Functions in R: Summarizing Continuous Data
We will use mean_median()
from library(ssstats)
to summarize continuous variables.
It will return both the mean (standard deviation) and median (IQR).
dataset_name %>%
mean_median (var1, var2, ...)
We can add group_by()
from library(tidyverse)
to split the summaries by categories.
dataset_name %>%
group_by (grouping_var1, grouping_var2, ...) %>%
mean_median (var1, var2, ...)
Functions in R: Summarizing Continuous Data
Let’s use mean_median()
to summarize the MLP dataset.
mlp_data %>%
mean_median (friendship, tail_shimmer, magical_energy)
Functions in R: Summarizing Continuous Data
Let’s use mean_median()
to summarize the MLP dataset.
mlp_data %>%
mean_median (friendship, tail_shimmer, magical_energy)
# A tibble: 3 × 3
variable mean_sd median_iqr
<chr> <chr> <chr>
1 friendship 7.6 (1.6) 8.0 (2.0)
2 magical_energy 9.9 (9.6) 7.0 (11.1)
3 tail_shimmer 256.6 (65.7) 253.0 (103.0)
Functions in R: Summarizing Continuous Data
Let’s use mean_median()
to summarize the MLP dataset by pony type.
mlp_data %>%
group_by (type) %>%
mean_median (friendship, tail_shimmer, magical_energy)
Functions in R: Summarizing Continuous Data
Let’s use mean_median()
to summarize the MLP dataset by pony type.
mlp_data %>%
group_by (type) %>%
mean_median (friendship, tail_shimmer, magical_energy)
# A tibble: 12 × 4
type variable mean_sd median_iqr
<chr> <chr> <chr> <chr>
1 Alicorn friendship 7.8 (1.3) 8.0 (2.0)
2 Earth friendship 7.6 (1.6) 8.0 (2.0)
3 Pegasus friendship 7.6 (1.6) 8.0 (2.0)
4 Unicorn friendship 7.5 (1.6) 8.0 (2.0)
5 Alicorn magical_energy 9.0 (8.0) 6.2 (10.9)
6 Earth magical_energy NaN (NA) NA (NA)
7 Pegasus magical_energy NaN (NA) NA (NA)
8 Unicorn magical_energy 9.9 (9.6) 7.0 (11.1)
9 Alicorn tail_shimmer 280.1 (64.5) 297.0 (104.0)
10 Earth tail_shimmer 252.2 (65.2) 246.0 (100.0)
11 Pegasus tail_shimmer 263.5 (67.0) 265.0 (110.0)
12 Unicorn tail_shimmer 261.2 (65.0) 260.0 (100.0)
Functions in R: Summarizing Categorical Data
dataset_name %>%
n_pct (var1)
For two variables – this returns n_{ij} \ (\%_{\text{col}}) :
dataset_name %>%
n_pct (var1, var2)
Functions in R: Summarizing Categorical Data
Let’s use n_pct()
to summarize the MLP dataset.
mlp_data %>%
n_pct (type, rows = 4 )
Functions in R: Summarizing Categorical Data
Let’s use n_pct()
to summarize the MLP dataset.
mlp_data %>%
n_pct (type, rows = 4 )
type n (pct)
Alicorn 41 (1.4%)
Earth 1678 (58.4%)
Pegasus 487 (17.0%)
Unicorn 665 (23.2%)
Functions in R: Summarizing Categorical Data
Let’s use n_pct()
to summarize the MLP dataset.
mlp_data %>%
n_pct (friendship, type, rows = 4 )
Functions in R: Summarizing Categorical Data
Let’s use n_pct()
to summarize the MLP dataset.
mlp_data %>%
n_pct (friendship, type, rows = 4 )
# A tibble: 4 × 5
friendship Alicorn Earth Pegasus Unicorn
<dbl> <chr> <chr> <chr> <chr>
1 1 0 (0.0%) 0 (0.0%) 0 (0.0%) 1 (0.2%)
2 2 0 (0.0%) 5 (0.3%) 1 (0.2%) 1 (0.2%)
3 3 0 (0.0%) 14 (0.8%) 6 (1.2%) 13 (2.0%)
4 4 2 (4.9%) 54 (3.2%) 14 (2.9%) 16 (2.4%)
Graphs in R: Using ggplot()
We will construct data visualizations using library(ggplot2)
, which loads in when we load library(tidyverse)
.
This package allows us to create a layered visualization.
ggplot()
creates the base layer.
geom_X()
creates the individual pieces.
geom_point()
creates a scatterplot.
geom_line()
creates connected lines.
geom_bar()
creates a bar chart.
geom_histogram()
creates a histogram.
Graphs in R: Using ggplot()
We use ggplot()
because it is very flexible - it allows us to customize every part of the graph.
Note that customization is less important in this course, but incredibly important in real life.
The R Graphics Cookbook is a great place to get basic code for graphs.
Remember that I do not expect you to memorize code. I do not have the code memorized.
Things I regularly ask Google for help with:
How to suppress the legend.
How to specify the tickmarks on the axis.
How to change the font size.
Graphs in R: The ggplot()
Layer
Calling ggplot()
creates the initial layer the graph lasagna.
Graphs in R: The ggplot()
Layer
We specify the aesthetics through aes()
in ggplot()
.
mlp_data %>% ggplot (aes (x = tail_shimmer, y = flying_speed))
Graphs in R: Overriding Defaults
We can override plot defaults using additional layers.
mlp_data %>%
ggplot (aes (x = tail_shimmer, y = flying_speed)) +
labs (x = "Tail Shimmer" ,
y = "Flying Speed" ) +
theme_bw ()
Graphs in R: Box Plots
Construct a box plot for the tail shimmer of the ponies (tail_shimmer ).
mlp_data %>% ggplot (aes (x = tail_shimmer)) +
geom_boxplot () +
labs (x = "Tail Shimmer" ) +
theme_bw () +
theme (axis.ticks.y = element_blank (),
axis.text.y = element_blank ())
Graphs in R: Box Plots
Construct a box plot for the tail shimmer of the ponies (tail_shimmer ).
Graphs in R: Box Plots
Construct a box plot for the tail shimmer of the ponies (tail_shimmer ).
mlp_data %>% ggplot (aes (y = tail_shimmer)) +
geom_boxplot () +
labs (y = "Tail Shimmer" ) +
theme_bw () +
theme (axis.ticks.x = element_blank (),
axis.text.x = element_blank ())
Graphs in R: Box Plots
Construct a box plot for the tail shimmer of the ponies (tail_shimmer ).
Graphs in R: Histograms
Construct a histogram for the flying speed of ponies (flying_speed ).
mlp_data %>% ggplot (aes (x = flying_speed)) +
geom_histogram (bins = 15 ,
color = "#2E7D32" ,
fill = "#4CAF50" ) +
labs (x = "Flying Speed" ,
y = "Number of Ponies" ) +
theme_bw ()
Graphs in R: Histograms
Describe the histogram of the flying speed of ponies (flying_speed ):
Graphs in R: Histograms
Construct a histogram for the magical energy of ponies (magical_energy ).
mlp_data %>% ggplot (aes (x = magical_energy)) +
geom_histogram (bins = 15 ,
color = "#8B6C42" ,
fill = "#F0E9DD" ) +
labs (x = "Magical Energy" ,
y = "Number of Ponies" ) +
theme_bw ()
Graphs in R: Histograms
Describe the histogram of the magical energy of ponies (magical_energy ):
Graphs in R: Bar Graphs
Construct a bar graph for the combined age and sex of ponies.
mlp_data %>%
count (sex) %>%
ggplot (aes (x = sex, y = n)) +
geom_col () +
labs (x = "Age and Sex of Pony" ,
y = "Number of Ponies" )+
theme_bw ()
Graphs in R: Bar Graphs
Construct a bar graph for the combined age and sex of ponies.
Graphs in R: Bar Graphs
Construct a bar graph for the type of pony.
mlp_data %>%
count (type) %>%
ggplot (aes (x = type, y = n)) +
geom_col () +
labs (x = "Type of Pony" ,
y = "Number of Ponies" )+
theme_bw ()
Graphs in R: Bar Graphs
Construct a bar graph for the type of pony.
Graphs in R: Scatterplots
Construct a scatterplot with magical energy (magical_energy ) on the x -axis and tail shimmer (tail_shimmer ) on the y -axis.
mlp_data %>% ggplot (aes (y = tail_shimmer, x = magical_energy)) +
geom_point (size = 2 ) +
labs (x = "Magical Energy" ,
y = "Tail Shimmer" ) +
theme_bw ()
Graphs in R: Scatterplots
Construct a scatterplot with magical energy (magical_energy ) on the x -axis and tail shimmer (tail_shimmer ) on the y -axis.
Graphs in R: Scatterplots
Construct a scatterplot with magical energy (magical_energy ) on the x -axis and flying speed (flying_speed ) on the y -axis.
mlp_data %>% ggplot (aes (x = magical_energy, y = flying_speed)) +
geom_point (size = 2 ) +
labs (y = "Tail Shimmer" ,
x = "Flying Speed (km/h)" ) +
theme_bw ()
Graphs in R: Scatterplots
Construct a scatterplot with magical energy (magical_energy ) on the x -axis and flying speed (flying_speed ) on the y -axis.
Wrap Up
We have covered (“reminded” ourselves of) a lot today!
Always remember that I do not expect you to:
Memorize code.
Produce code in a timed environment.
Automatically know how to do these things.
I do expect you to:
Use your resources (lecture slides, GitHub website, Discord).
Try your best.
Wrap Up
Today’s lecture:
Basic summaization of data.
Basic data visualization.
Introductory R.
Next class:
Review of statistical inference.
Confidence intervals and hypothesis tests.
One sample means.
Two sample means.
Independent data.
Dependent data.
Wrap Up
Daily activity: the .qmd we worked on during class.
Due date: Monday, June 23, 2025.
You will upload the resulting .html file on Canvas.
Please refer to the help guide on the Biostat website if you need help with submission.