Review of Technology

STA6349: Applied Bayesian Analysis
Spring 2025

Introduction

  • Welcome to Applied Bayesian Analysis - Spring 2025!
    • Canvas set up
    • Syllabus
    • Discord
    • R/RStudio
    • Quarto
    • GitHub
    • Resources

Introduction

  • General topics:
    • Probability rules and distributions
    • Bayes Theorem
    • Prior distributions
    • Posterior distributions
    • Conjugate families
    • Beta-Binomial, Normal-Normal, and Gamma-Poisson models
    • Posterior simulation
    • Posterior inference
    • Linear regression
  • This is an applied class.

GitHub

  • Our course lectures and labs are posted on GitHub.

  • Please bookmark the repository: GitHub for STA6349.

  • You will want to look at my .qmd files for formatting / \LaTeX purposes.

  • Feel free to poke around my GitHub to see materials for other classes.

R/RStudio

  • We will be using R in this course.
    • I use the RStudio IDE, however, if you would like to use another IDE, that is fine.
  • It is okay if you have not used R before!
  • Full disclosure: I am a biostatistician first, programmer second.
    • This means that I focus on the application of statistical methods and not on “understanding” the innerworkings of R.
      • R is a tool that we use, like how SAS, JMP, Stata, SPSS, Excel, etc. are tools.
    • Sometimes my code is not elegant/efficient, and that’s okay! Because our focus is on the application of methods, we are interested in the code working.
    • I have learned so much from my students since implementing R in the classroom.
      • Do not be afraid to teach me new things!
  • This is an applied class.

R/RStudio

  • You can install R and RStudio on your computer for free.
  • Alternative to installing: RStudio Server hosted by UWF HMCSE
  • Do not use Citrix.
  • I encourage you to install R on your own machine if you are able.
    • In the “real world,” you will not have access to the server.
    • Installing on your own machine will help your future self troubleshoot issues.

Tidy Data

Journal article: Tidy Data by Wickham (2014, Journal of Statistical Software)

Book chapter: Data Tidying by Wickham, Çetinkaya-Rundel, and Grolemund

  • There are three interrelated rules that make a dataset tidy:
    1. Each variable is a column; each column is a variable.
    2. Each observation is a row; each row is an observation.
    3. Each value is a cell; each cell is a single value.

Tidyverse

Tidyverse

  • tibble for modern data frames.
  • readr and haven for data import.
    • readr is pulled in with tidyverse
    • haven needs to be called in on its own
  • tidyr for data tidying.
  • dplyr for data manipulation.
  • ggplot2 for data visualization.
  • It is not possible for me to teach you everything you will ever need to know about programming in R.

Tidyverse

  • A major advantage of using tidyverse is the common “language” between the functions.
  • Another advantage: the pipe operator, %>%.
    • Yes, there is a pipe operator now included in base R. No, I do not use it.
    • By default, %>% deposits everything that came before into the first argument of the next function.
      • If we want to insert it elsewhere, we can indicate that with a “.” in the function.
lm(body_mass_g ~ flipper_length_mm, data = penguins)

penguins %>% lm(body_mass_g ~ flipper_length_mm, data = .)

Tidyverse

  • If we try to use a function before calling its package in, we will see an error.
sw <- tibble(starwars) %>% filter(mass < 100)
Error in tibble(starwars) %>% filter(mass < 100): could not find function "%>%"
  • We are good to go after calling in tidyverse.
library(tidyverse)
sw <- tibble(starwars) %>% filter(mass < 100)
head(sw, n=3)

Importing Data

jhs_csv <- read_csv("/path/to/folder/analysislong.csv")
head(jhs_csv)

Importing Data

  • Be comfortable with Googling for help with code to import data.
  • As a collaborative statistician, I have received the following file types:
    • .sas7bdat
    • .sav
    • .dat
    • .csv
    • .xls
    • .xlsx
    • .txt
    • Google Sheet
    • hand written

Importing Data

  • There have been times where I have received data as a .xlsx, but I can’t get it to import properly.
    • Usually, the issue is that there is a character variable with too much text.
    • Sometimes, it’s that the variable type changes mid-dataset.
      • i.e., both a number and a character stored in the same vector.
  • Sometimes the solution is saving it as a different file type (I default to .csv).
  • Get comfortable Googling error messages.
    • I am still consulting Dr. Google for assistance on a daily basis!
  • Try not to do any data management within the original file type!
    • We want to be able to retrace our steps.
    • Reproducible research!

Data Manipulation

  • Functions:
    • select(): Selecting columns.
    • filter(): Filtering the observations.
    • mutate(): Adding or transforming columns.
    • summarise(): Summarizing data.
    • group_by(): Grouping data for summary operations.
    • %>%: Pipelines.

Data Manipulation

  • select(): Selecting columns.
jhs_csv %>% 
  select(subjid, visit, age, sex) %>% 
  head(n=4)

Data Manipulation

  • filter(): Filtering rows.
jhs_csv %>% 
  filter(visit == 1) %>% 
  head(n=3)

Data Manipulation

  • mutate(): Adding or transforming columns.
jhs_csv %>% 
  filter(visit == 1) %>%
  select(subjid, sex) %>%
  mutate(male = if_else(sex == "Male", 1, 0)) %>%
  head(n=3)

Data Manipulation

  • summarise(): Summarizing data.
jhs_csv %>% 
  filter(visit == 1) %>%
  summarize(n = n(),
            mean_BMI = round(mean(BMI, na.rm = TRUE),2),
            sd_BMI = round(sd(BMI, na.rm = TRUE),2),
            n_female = sum(sex == "Female", na.rm = TRUE),
            pct_female = round(sum(sex == "Female", na.rm = TRUE)*100/n(),2))

Data Manipulation

  • group_by(): Grouping data for summary operations.
jhs_csv %>% 
  filter(visit == 1) %>%
  group_by(HTN) %>%
  summarize(n = n(),
            mean_BMI = round(mean(BMI, na.rm = TRUE),2),
            sd_BMI = round(sd(BMI, na.rm = TRUE),2),
            n_female = sum(sex == "Female", na.rm = TRUE),
            pct_female = round(sum(sex == "Female", na.rm = TRUE)*100/n(),2))

Wrap Up

  • Today we have gently introduced data management in R.

  • I do not expect you to become an expert R programmer, but the more you practice, the easier it becomes.

  • Today’s activity: Assignment 0