Introduction to Technology

STA6349: Applied Bayesian Analysis

Introduction

Welcome to Applied Bayesian Analysis - Fall 2024!
- Canvas set up
- Syllabus
- Discord
- R/RStudio
- Quarto
- GitHub
- Resources

Introduction

Days we definitely will not meet on Zoom:
- Monday 09/02 (Labor Day)
- Monday 10/14 (Dr. Seals traveling)
- Wednesday 10/16 (Project 1)
- Monday 11/11 (Veteran’s Day)
- Wednesday 11/20 (Project 2)
- Wednesday 11/27 (Thanksgiving)

Introduction

General topics:
- Probability rules and distributions
- Bayes Theorem
- Prior distributions
- Posterior distributions
- Conjugate families
- Beta-Binomial, Normal-Normal, and Gamma-Poisson models
- Posterior simulation
- Posterior inference
- Linear regression
This is an applied class.

Weekly Schedule

Lecture weeks:
- Lecture Monday
- Lecture Wednesday
- Throughout: activities in breakout rooms
Project weeks:
- Short meeting on Monday to introduce project
- No meeting on Wednesday
- Project due the following Monday
Final Exam:
- There will be a proctored and written final exam on Wednesday, 12/04.
- If you are in Pensacola, the exam is 2:00-4:30 pm.
- If you are online, you will schedule your exam based on the guidelines from MathStat proctoring.

GitHub

Our course lectures and labs are posted on GitHub.
Please bookmark the repository: GitHub for STA6349.
You will want to look at my .qmd files for formatting / \LaTeX purposes.
Feel free to poke around my GitHub to see materials for other classes.

R/RStudio

We will be using R in this course.
- I use the RStudio IDE, however, if you would like to use another IDE, that is fine.
It is okay if you have not used R before!
Full disclosure: I am a biostatistician first, programmer second.
- This means that I focus on the application of statistical methods and not on “understanding” the innerworkings of R.
  - R is a tool that we use, like how SAS, JMP, Stata, SPSS, Excel, etc. are tools.
- Sometimes my code is not elegant/efficient, and that’s okay! Because our focus is on the application of methods, we are interested in the code working.
- I have learned so much from my students since implementing R in the classroom.
  - Do not be afraid to teach me new things!
This is an applied class.

R/RStudio

You can install R and RStudio on your computer for free.
- R from CRAN
- RStudio from Posit
Alternative to installing: RStudio Server hosted by UWF HMCSE
Do not use Citrix.
I encourage you to install R on your own machine if you are able.
- In the “real world,” you will not have access to the server.
- Installing on your own machine will help your future self troubleshoot issues.

Tidy Data

Journal article: Tidy Data by Wickham (2014, Journal of Statistical Software)

Book chapter: Data Tidying by Wickham, Çetinkaya-Rundel, and Grolemund

There are three interrelated rules that make a dataset tidy:
1. Each variable is a column; each column is a variable.
2. Each observation is a row; each row is an observation.
3. Each value is a cell; each cell is a single value.

Tidyverse

tibble for modern data frames.
readr and haven for data import.
- readr is pulled in with tidyverse
- haven needs to be called in on its own
tidyr for data tidying.
dplyr for data manipulation.
ggplot2 for data visualization.
It is not possible for me to teach you everything you will ever need to know about programming in R.
- Good resource for tidyverse: data science in a box

Tidyverse

A major advantage of using tidyverse is the common “language” between the functions.
Another advantage: the pipe operator, %>%.
- Yes, there is a pipe operator now included in base R. No, I do not use it.
  - Here is a discussion of similarities and differences from Hadley himself.
- By default, %>% deposits everything that came before into the first argument of the next function.
  - If we want to insert it elsewhere, we can indicate that with a “.” in the function.

lm(body_mass_g ~ flipper_length_mm, data = penguins)

penguins %>% lm(body_mass_g ~ flipper_length_mm, data = .)

Tidyverse

If we try to use a function before calling its package in, we will see an error.

sw <- tibble(starwars) %>% filter(mass < 100)

Error in tibble(starwars) %>% filter(mass < 100): could not find function "%>%"

We are good to go after calling in tidyverse.

library(tidyverse)
sw <- tibble(starwars) %>% filter(mass < 100)
head(sw, n=3)

Importing Data

Let’s import data from the Jackson Heart Study.

jhs_csv <- read_csv("/path/to/folder/analysislong.csv")
head(jhs_csv, n = 6)

Importing Data

Be comfortable with Googling for help with code to import data.
As a collaborative statistician, I have received the following file types:
- .sas7bdat
- .sav
- .dat
- .csv
- .xls
- .xlsx
- .txt
- Google Sheet
- hand written

Importing Data

There have been times where I have received data as a .xlsx, but I can’t get it to import properly.
- Usually, the issue is that there is a character variable with too much text.
- Sometimes, it’s that the variable type changes mid-dataset.
  - i.e., both a number and a character stored in the same vector.
Sometimes the solution is saving it as a different file type (I default to .csv).
Get comfortable Googling error messages.
- I am still consulting Dr. Google for assistance on a daily basis!
Try not to do any data management within the original file type!
- We want to be able to retrace our steps.
- Reproducible research!

Data Manipulation

Functions:
- select(): Selecting columns.
- filter(): Filtering the observations.
- mutate(): Adding or transforming columns.
- summarise(): Summarizing data.
- group_by(): Grouping data for summary operations.
- %>%: Pipelines.

Data Manipulation

select(): Selecting columns.

jhs_csv %>% 
  select(subjid, visit, age, sex) %>% 
  head(n=4)

Data Manipulation

filter(): Filtering rows.

jhs_csv %>% 
  filter(visit == 1) %>% 
  head(n=3)

Data Manipulation

mutate(): Adding or transforming columns.

jhs_csv %>% 
  filter(visit == 1) %>%
  select(subjid, sex) %>%
  mutate(male = if_else(sex == "Male", 1, 0)) %>%
  head(n=3)

Data Manipulation

summarise(): Summarizing data.

jhs_csv %>% 
  filter(visit == 1) %>%
  summarize(n = n(),
            mean_BMI = round(mean(BMI, na.rm = TRUE),2),
            sd_BMI = round(sd(BMI, na.rm = TRUE),2),
            n_female = sum(sex == "Female", na.rm = TRUE),
            pct_female = round(sum(sex == "Female", na.rm = TRUE)*100/n(),2))

Data Manipulation

group_by(): Grouping data for summary operations.

jhs_csv %>% 
  filter(visit == 1) %>%
  group_by(HTN) %>%
  summarize(n = n(),
            mean_BMI = round(mean(BMI, na.rm = TRUE),2),
            sd_BMI = round(sd(BMI, na.rm = TRUE),2),
            n_female = sum(sex == "Female", na.rm = TRUE),
            pct_female = round(sum(sex == "Female", na.rm = TRUE)*100/n(),2))

Wrap Up

Today we have gently introduced data management in R.
I do not expect you to become an expert R programmer, but the more you practice, the easier it becomes.
Today’s activity: Quiz 0