Introduction to Technology

STA6349: Applied Bayesian Analysis

Introduction

  • Welcome to Applied Bayesian Analysis - Fall 2024!

    • Canvas set up
    • Syllabus
    • Discord
    • R/RStudio
    • Quarto
    • GitHub
    • Resources

Introduction

  • Days we definitely will not meet on Zoom:

    • Monday 09/02 (Labor Day)
    • Monday 10/14 (Dr. Seals traveling)
    • Wednesday 10/16 (Project 1)
    • Monday 11/11 (Veteran’s Day)
    • Wednesday 11/20 (Project 2)
    • Wednesday 11/27 (Thanksgiving)

Introduction

  • General topics:

    • Probability rules and distributions
    • Bayes Theorem
    • Prior distributions
    • Posterior distributions
    • Conjugate families
    • Beta-Binomial, Normal-Normal, and Gamma-Poisson models
    • Posterior simulation
    • Posterior inference
    • Linear regression
  • This is an applied class.

Weekly Schedule

  • Lecture weeks:

    • Lecture Monday
    • Lecture Wednesday
    • Throughout: activities in breakout rooms
  • Project weeks:

    • Short meeting on Monday to introduce project
    • No meeting on Wednesday
    • Project due the following Monday
  • Final Exam:

    • There will be a proctored and written final exam on Wednesday, 12/04.
    • If you are in Pensacola, the exam is 2:00-4:30 pm.
    • If you are online, you will schedule your exam based on the guidelines from MathStat proctoring.

GitHub

  • Our course lectures and labs are posted on GitHub.

  • Please bookmark the repository: GitHub for STA6349.

  • You will want to look at my .qmd files for formatting / \LaTeX purposes.

  • Feel free to poke around my GitHub to see materials for other classes.

R/RStudio

  • We will be using R in this course.

    • I use the RStudio IDE, however, if you would like to use another IDE, that is fine.
  • It is okay if you have not used R before!

  • Full disclosure: I am a biostatistician first, programmer second.

    • This means that I focus on the application of statistical methods and not on “understanding” the innerworkings of R.

      • R is a tool that we use, like how SAS, JMP, Stata, SPSS, Excel, etc. are tools.
    • Sometimes my code is not elegant/efficient, and that’s okay! Because our focus is on the application of methods, we are interested in the code working.

    • I have learned so much from my students since implementing R in the classroom.

      • Do not be afraid to teach me new things!
  • This is an applied class.

R/RStudio

  • You can install R and RStudio on your computer for free.

  • Alternative to installing: RStudio Server hosted by UWF HMCSE

  • Do not use Citrix.

  • I encourage you to install R on your own machine if you are able.

    • In the “real world,” you will not have access to the server.

    • Installing on your own machine will help your future self troubleshoot issues.

Tidy Data

Journal article: Tidy Data by Wickham (2014, Journal of Statistical Software)

Book chapter: Data Tidying by Wickham, Çetinkaya-Rundel, and Grolemund

  • There are three interrelated rules that make a dataset tidy:

    1. Each variable is a column; each column is a variable.
    2. Each observation is a row; each row is an observation.
    3. Each value is a cell; each cell is a single value.

Tidyverse

Tidyverse

  • tibble for modern data frames.

  • readr and haven for data import.

    • readr is pulled in with tidyverse
    • haven needs to be called in on its own
  • tidyr for data tidying.

  • dplyr for data manipulation.

  • ggplot2 for data visualization.

  • It is not possible for me to teach you everything you will ever need to know about programming in R.

Tidyverse

  • A major advantage of using tidyverse is the common “language” between the functions.

  • Another advantage: the pipe operator, %>%.

    • Yes, there is a pipe operator now included in base R. No, I do not use it.

    • By default, %>% deposits everything that came before into the first argument of the next function.

      • If we want to insert it elsewhere, we can indicate that with a “.” in the function.
lm(body_mass_g ~ flipper_length_mm, data = penguins)

penguins %>% lm(body_mass_g ~ flipper_length_mm, data = .)

Tidyverse

  • If we try to use a function before calling its package in, we will see an error.
sw <- tibble(starwars) %>% filter(mass < 100)
Error in tibble(starwars) %>% filter(mass < 100): could not find function "%>%"
  • We are good to go after calling in tidyverse.
library(tidyverse)
sw <- tibble(starwars) %>% filter(mass < 100)
head(sw, n=3)

Importing Data

jhs_csv <- read_csv("/path/to/folder/analysislong.csv")
head(jhs_csv, n = 6)

Importing Data

  • Be comfortable with Googling for help with code to import data.

  • As a collaborative statistician, I have received the following file types:

    • .sas7bdat
    • .sav
    • .dat
    • .csv
    • .xls
    • .xlsx
    • .txt
    • Google Sheet
    • hand written

Importing Data

  • There have been times where I have received data as a .xlsx, but I can’t get it to import properly.

    • Usually, the issue is that there is a character variable with too much text.

    • Sometimes, it’s that the variable type changes mid-dataset.

      • i.e., both a number and a character stored in the same vector.
  • Sometimes the solution is saving it as a different file type (I default to .csv).

  • Get comfortable Googling error messages.

    • I am still consulting Dr. Google for assistance on a daily basis!
  • Try not to do any data management within the original file type!

    • We want to be able to retrace our steps.

    • Reproducible research!

Data Manipulation

  • Functions:

    • select(): Selecting columns.
    • filter(): Filtering the observations.
    • mutate(): Adding or transforming columns.
    • summarise(): Summarizing data.
    • group_by(): Grouping data for summary operations.
    • %>%: Pipelines.

Data Manipulation

  • select(): Selecting columns.
jhs_csv %>% 
  select(subjid, visit, age, sex) %>% 
  head(n=4)

Data Manipulation

  • filter(): Filtering rows.
jhs_csv %>% 
  filter(visit == 1) %>% 
  head(n=3)

Data Manipulation

  • mutate(): Adding or transforming columns.
jhs_csv %>% 
  filter(visit == 1) %>%
  select(subjid, sex) %>%
  mutate(male = if_else(sex == "Male", 1, 0)) %>%
  head(n=3)

Data Manipulation

  • summarise(): Summarizing data.
jhs_csv %>% 
  filter(visit == 1) %>%
  summarize(n = n(),
            mean_BMI = round(mean(BMI, na.rm = TRUE),2),
            sd_BMI = round(sd(BMI, na.rm = TRUE),2),
            n_female = sum(sex == "Female", na.rm = TRUE),
            pct_female = round(sum(sex == "Female", na.rm = TRUE)*100/n(),2))

Data Manipulation

  • group_by(): Grouping data for summary operations.
jhs_csv %>% 
  filter(visit == 1) %>%
  group_by(HTN) %>%
  summarize(n = n(),
            mean_BMI = round(mean(BMI, na.rm = TRUE),2),
            sd_BMI = round(sd(BMI, na.rm = TRUE),2),
            n_female = sum(sex == "Female", na.rm = TRUE),
            pct_female = round(sum(sex == "Female", na.rm = TRUE)*100/n(),2))

Wrap Up

  • Today we have gently introduced data management in R.

  • I do not expect you to become an expert R programmer, but the more you practice, the easier it becomes.

  • Today’s activity: Quiz 0