Alejandra Plazola and Map5667
This is the 'College' dataset, in which it is a dataset that containing information abbout college majors across the U.S and their median incomes. This dataset is from the American Community Survey 2010-2012 Public Use Microdata Series https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-10-16 https://github.com/rfordatascience/tidytuesday. It has 21 variables and 172 observations. The variable names it contains that are Rank, Major_code, Major, Major_category, Total, Sample_size, Men, Women, ShareWomen, Employed, Full_time, Part_time, Full_time_year_round, Unemployed, Unemployment_rate, Media, P25th, P75th, College_jobs, Non_college_jobs, and Low_wage_jobs.
library(mvtnorm)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(lmtest)
library(glmnet)
College <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-10-16/recent-grads.csv")
cg <- College %>% na.omit()
College %>% group_by(Major_category) %>% count()
## # A tibble: 16 x 2
## # Groups: Major_category [16]
## Major_category n
## <chr> <int>
## 1 Agriculture & Natural Resources 10
## 2 Arts 8
## 3 Biology & Life Science 14
## 4 Business 13
## 5 Communications & Journalism 4
## 6 Computers & Mathematics 11
## 7 Education 16
## 8 Engineering 29
## 9 Health 12
## 10 Humanities & Liberal Arts 15
## 11 Industrial Arts & Consumer Services 7
## 12 Interdisciplinary 1
## 13 Law & Public Policy 5
## 14 Physical Sciences 10
## 15 Psychology & Social Work 9
## 16 Social Science 9
# Subsetting data to only have 5 observations for
# Major_Category
College <- College %>% filter(Major_category != "Arts" & Major_category !=
"Interdisciplinary" & Major_category != "Law & Public Policy" &
Major_category != "Industrial Arts & Consumer Services" &
Major_category != "Communications & Journalism" & Major_category !=
"Social Science" & Major_category != "Psychology & Social Work" &
Major_category != "Biology & Life Science" & Major_category !=
"Agriculture & Natural Resources" & Major_category != "Health" &
Major_category != "Physical Sciences")
ggplot(College, aes(x = Employed, y = Median)) + geom_point(alpha = 0.1) +
geom_density_2d(h = 2) + coord_fixed() + facet_wrap(~Major_category)
MANC <- manova(cbind(Employed, Median) ~ Major_category, data = College)
summary(MANC)
## Df Pillai approx F num Df den Df Pr(>F)
## Major_category 4 0.77805 12.575 8 158 6.901e-14 ***
## Residuals 79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(MANC)
## Response Employed :
## Df Sum Sq Mean Sq F value Pr(>F)
## Major_category 4 4.5140e+10 1.1285e+10 5.0807 0.001074 **
## Residuals 79 1.7547e+11 2.2211e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Median :
## Df Sum Sq Mean Sq F value Pr(>F)
## Major_category 4 9576798960 2394199740 28.774 9.28e-15 ***
## Residuals 79 6573456755 83208313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
College %>% group_by(Major_category) %>% summarize(mean(Employed),
mean(Median))
## # A tibble: 5 x 3
## Major_category `mean(Employed)` `mean(Median)`
## <chr> <dbl> <dbl>
## 1 Business 83749. 43538.
## 2 Computers & Mathematics 21627. 42745.
## 3 Education 29990. 32350
## 4 Engineering 14496. 57383.
## 5 Humanities & Liberal Arts 36275. 31913.
pairwise.t.test(College$Employed, College$Major_category, p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: College$Employed and College$Major_category
##
## Business Computers & Mathematics Education
## Computers & Mathematics 0.0019 - -
## Education 0.0031 0.6517 -
## Engineering 3.3e-05 0.6703 0.2943
## Humanities & Liberal Arts 0.0095 0.4360 0.7116
## Engineering
## Computers & Mathematics -
## Education -
## Engineering -
## Humanities & Liberal Arts 0.1502
##
## P value adjustment method: none
pairwise.t.test(College$Median, College$Major_category, p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: College$Median and College$Major_category
##
## Business Computers & Mathematics Education
## Computers & Mathematics 0.8325 - -
## Education 0.0015 0.0047 -
## Engineering 1.9e-05 2.1e-05 2.3e-13
## Humanities & Liberal Arts 0.0012 0.0037 0.8944
## Engineering
## Computers & Mathematics -
## Education -
## Engineering -
## Humanities & Liberal Arts 2.6e-13
##
## P value adjustment method: none
# Probability of at least one type I error
1 - (0.95^23)
## [1] 0.6926431
# Bonferroni Correction
0.05/23
## [1] 0.002173913
A one-way MANOVA was conducted to determine the effect of the Major_Category type (Engineering, Business, Computers&Mathematics, Education, and Humanities&Arts) on two dependent variables (Median and Employed). Based on the MANOVA, there was a significiant difference found between the 5 major categories, Pillai trace=0.77805,pseudo F (4, 158) = 12.575, p < 0.0001. Univariate ANOVAs for each dependent variable were conducted as follow-up tests to the MANOVA, using the Bonferroni method for controlling Type I error rates for multiple comparisons.The univariate ANOVAs for Employed and Media were also significant, F (4, 79) = 5.0807, p < .005, and F (2, 79) = 28.774, p < .0001, respectively.Post hoc analysis was performed conducting pairwise comparisons to determine which Major_Category differed in Employed and Median. All major categories differ based on both Employment and Median income.
There are several assumptions including multivariate normality, equal covariance between two dependent variables, linear relationships among variables, no extreme univariate or multivariate outliers, and no multicollinearity. It is likely that not all assumptions are met based on the density plot for each group as I can see some outliers for some.
I performed 1 MANOVA, 2 ANOVAS, and 20 t tests so the bonferroni significance level is α = .05/20 = .002173913. The probability of at least one type I error, unadjusted, is 0.6926431. All three Species were found to differ significantly from each other in terms of sepal length and petal width after adjusting for multiple comparisons (bonferroni α = .05/9 = .0056). Both Median and Employed are still significant after finding bonferroni corrected significance level.
set.seed(348)
summary(aov(Median ~ Major_category, data = College))
## Df Sum Sq Mean Sq F value Pr(>F)
## Major_category 4 9.577e+09 2.394e+09 28.77 9.28e-15 ***
## Residuals 79 6.573e+09 8.321e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
obs_F <- 28.77 #this is the observed F-statistic
Fs <- replicate(5000, {
# do everything in curly braces 5000 times and save the
# output
new <- College %>% mutate(Med = sample(Median)) #randomly permute response variable
# compute the F-statistic by hand
SSW <- new %>% group_by(Major_category) %>% summarize(SSW = sum((Med -
mean(Med))^2)) %>% summarize(sum(SSW)) %>% pull
SSB <- new %>% mutate(mean = mean(Med)) %>% group_by(Major_category) %>%
mutate(groupmean = mean(Med)) %>% summarize(SSB = sum((mean -
groupmean)^2)) %>% summarize(sum(SSB)) %>% pull
(SSB/2)/(SSW/57) #compute F statistic (num df = K-1 = 3-1, denom df = N-K = 60-3)
})
hist(Fs, prob = T)
abline(v = obs_F, col = "red", add = T)
mean(Fs > obs_F)
## [1] 0
The null hypothesis is that the true mean of Median income is the same for all 5 major category groups(Engineering, Computers&Mathematics,Education, Humanities&Arts, and Business). The alternative hypothesis is that at least of the means of Median incaome for all 5 major category groups differ. I conducted an ANOVA/Fstat test and the p-value for mean(Fs>obs_F) is 0. This means none of our 5000 F stats generated under the null hypothesis were bigger than our actual F stat of 28.77 and means that it definitely rejected the null hypothesis and that major category groups do differ.
College$M <- College$Men - mean(College$Men)
College$W <- College$Women - mean(College$Women)
fit1 <- lm(Median ~ W + M, data = College)
summary(fit1)
##
## Call:
## lm(formula = Median ~ W + M, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23071 -9453 -2661 6324 64526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.401e+04 1.435e+03 30.661 < 2e-16 ***
## W -1.933e-01 5.506e-02 -3.511 0.000733 ***
## M 1.414e-01 6.048e-02 2.339 0.021822 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13150 on 81 degrees of freedom
## Multiple R-squared: 0.1321, Adjusted R-squared: 0.1107
## F-statistic: 6.165 on 2 and 81 DF, p-value: 0.00322
ggplot(College, aes(y = M, x = W, color = Median)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)
qqnorm(College$M)
qqnorm(College$W)
shapiro.test(head(College$M))
##
## Shapiro-Wilk normality test
##
## data: head(College$M)
## W = 0.56745, p-value = 0.000182
shapiro.test(head(College$W))
##
## Shapiro-Wilk normality test
##
## data: head(College$W)
## W = 0.51881, p-value = 4.234e-05
bptest(fit1)
##
## studentized Breusch-Pagan test
##
## data: fit1
## BP = 1.7016, df = 2, p-value = 0.4271
library(sandwich)
coeftest(fit1, vcov = vcovHC(fit1))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44007.14286 1453.53766 30.2759 < 2.2e-16 ***
## W -0.19331 0.06623 -2.9188 0.004548 **
## M 0.14143 0.05465 2.5880 0.011439 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit1)$r.sq
## [1] 0.1321102
fit2 <- lm(Median ~ Men, data = College) #main effects of lat
summary(fit2)
##
## Call:
## lm(formula = Median ~ Men, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22054 -10295 -3865 7221 65950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.405e+04 1.796e+03 24.529 <2e-16 ***
## Men -2.400e-03 4.746e-02 -0.051 0.96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14030 on 82 degrees of freedom
## Multiple R-squared: 3.118e-05, Adjusted R-squared: -0.01216
## F-statistic: 0.002557 on 1 and 82 DF, p-value: 0.9598
fit3 <- lm(Median ~ Women, data = College)
summary(fit3)
##
## Call:
## lm(formula = Median ~ Women, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24115 -10415 -950 7291 63812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.622e+04 1.710e+03 27.031 <2e-16 ***
## Women -1.061e-01 4.159e-02 -2.551 0.0126 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13510 on 82 degrees of freedom
## Multiple R-squared: 0.07351, Adjusted R-squared: 0.06221
## F-statistic: 6.506 on 1 and 82 DF, p-value: 0.01261
# install.packages('interactions')
library(interactions)
interact_plot(fit1, M, W) #interaction plot
For every one unit increase in Median income, the total number of Women decreases by .01933 units and the total number of Men increases by .01414 units. The linearity assumption is violated based on the ggplot. There does not seem to be normality based on the q-q plots. From the Shapiro-Wilk, the p-value < 0.05 implying that the distribution of the data are significantly different from normal distribution; meaning we cannot assume normalitiy. Based on the Breusch-Pagan test, the p-Value > 0.05 indicates that the null hypothesis can rejected and therefore heterscedasticity does exists. With the robust standard errors, the t-values increases for the intercept and M, the t-values for W decreases, the p-values for Intercept and M decreased, and the p-values for W increased. The proportion of the variation in the outcome explained by this model is 0.1321102.
set.seed(348)
fit4 <- lm(Median ~ Men * Women, data = College)
boot_sd <- College[sample(nrow(College), replace = T), ]
samp_distn <- replicate(1000, {
boot_sd <- College[sample(nrow(College), replace = T), ]
fit5 <- lm(Median ~ Men * Women, data = boot_sd)
coef(fit5)
})
samp_distn %>% t %>% as.data.frame %>% summarize_all(sd)
## (Intercept) Men Women Men:Women
## 1 2436.675 0.07474423 0.1286051 1.675863e-06
# Uncorrected SEs
summary(fit4)
##
## Call:
## lm(formula = Median ~ Men * Women, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23170 -9449 -2706 6294 64437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.534e+04 1.949e+03 23.265 < 2e-16 ***
## Men 1.342e-01 8.733e-02 1.537 0.12831
## Women -1.973e-01 6.541e-02 -3.017 0.00343 **
## Men:Women 1.032e-07 8.944e-07 0.115 0.90839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13240 on 80 degrees of freedom
## Multiple R-squared: 0.1323, Adjusted R-squared: 0.09971
## F-statistic: 4.064 on 3 and 80 DF, p-value: 0.009666
# Corrected SEs
coeftest(fit4, vcov = vcovHC(fit4))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5342e+04 2.2933e+03 19.7712 < 2e-16 ***
## Men 1.3420e-01 6.1517e-02 2.1815 0.03208 *
## Women -1.9733e-01 9.8574e-02 -2.0018 0.04870 *
## Men:Women 1.0325e-07 8.4765e-07 0.1218 0.90336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The SEs for the uncorrected SEs for the Intercept and Women seem to be smaller than the corrected SEs. The SEs for the uncorrected SEs for Men and Men:Women tend to be bigger than corrected SEs. The p-values for the uncorrected SEs for intercept, Men, Men:Women seem to be lower than the p-values for the corrected SEs. The p-value for Women for the uncorrected SEs seem to be higher than the p-value for the corrected SE.
College_new <- College %>% mutate(Major_cat = if_else(Major_category ==
"Engineering", "Eng", "Other"))
College_new$y <- ifelse(College_new$Major_cat == "Eng", 1, 0)
fit6 <- glm(y ~ P25th + P75th, data = College_new, family = "binomial")
coeftest(fit6)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0572e+01 2.2396e+00 -4.7205 2.353e-06 ***
## P25th 1.1565e-04 6.2369e-05 1.8543 0.063703 .
## P75th 1.0598e-04 3.9637e-05 2.6737 0.007502 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef(fit6))
## (Intercept) P25th P75th
## 2.561903e-05 1.000116e+00 1.000106e+00
College_new$probs <- predict(fit6, type = "response")
table(predict = as.numeric(College_new$probs > 0.5), truth = College_new$y) %>%
addmargins()
## truth
## predict 0 1 Sum
## 0 48 5 53
## 1 7 24 31
## Sum 55 29 84
# Sensitivity(TPR)
24/29
## [1] 0.8275862
# Specificity(TNR)
48/55
## [1] 0.8727273
# Precision(PPV)
24/31
## [1] 0.7741935
## GIVE IT PREDICTED PROBS AND TRUTH LABELS (0/1), RETURNS
## VARIOUS DIAGNOSTICS
class_diag <- function(probs, truth) {
# CONFUSION MATRIX: CALCULATE ACCURACY, TPR, TNR, PPV
if (is.character(truth) == TRUE)
truth <- as.factor(truth)
if (is.numeric(truth) == FALSE & is.logical(truth) == FALSE)
truth <- as.numeric(truth) - 1
tab <- table(factor(probs > 0.5, levels = c("FALSE", "TRUE")),
factor(truth, levels = c(0, 1)))
acc = sum(diag(tab))/sum(tab)
sens = tab[2, 2]/colSums(tab)[2]
spec = tab[1, 1]/colSums(tab)[1]
ppv = tab[2, 2]/rowSums(tab)[2]
f1 = 2 * (sens * ppv)/(sens + ppv)
# CALCULATE EXACT AUC
ord <- order(probs, decreasing = TRUE)
probs <- probs[ord]
truth <- truth[ord]
TPR = cumsum(truth)/max(1, sum(truth))
FPR = cumsum(!truth)/max(1, sum(!truth))
dup <- c(probs[-1] >= probs[-length(probs)], FALSE)
TPR <- c(0, TPR[!dup], 1)
FPR <- c(0, FPR[!dup], 1)
n <- length(TPR)
auc <- sum(((TPR[-1] + TPR[-n])/2) * (FPR[-1] - FPR[-n]))
data.frame(acc, sens, spec, ppv, f1, auc)
}
probs1 <- predict(fit6, type = "response")
class_diag(probs1, College_new$y)
## acc sens spec ppv f1 auc
## 1 0.8571429 0.8275862 0.8727273 0.7741935 0.8 0.9213166
College_new$logit <- predict(fit6, type = "link")
College_new %>% ggplot() + geom_density(aes(logit, color = Major_cat,
fill = Major_cat), alpha = 0.4) + theme(legend.position = c(0.83,
0.83)) + geom_vline(xintercept = 0) + xlab("logit (log-odds)") +
geom_rug(aes(logit, color = Major_cat))
library(plotROC)
# ROCCurve
ROCplot <- ggplot(College_new) + geom_roc(aes(d = y, m = probs),
n.cuts = 0) + geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
lty = 2)
ROCplot
# AUC
calc_auc(ROCplot)
## PANEL group AUC
## 1 1 -1 0.9213166
The logistic regression model tells us the odds of being an Engineering major for when p25th income=0 and p75th income=0 is 2.561903e-05. Controlling for Engineering Major and P75th income, for every additional unit of P25th the odds of it being an Engineering major it increases by 1.000116. Controlling for Engineering and P25th income, the odds of it an Engineering major to increase income is by 1.000106.
The TPR/Sensitivity is 0.8275862, and the TNR/Specificity is 0.8727273: this means the model is good at predictings that the probability a major being either Engineering or other(the other 4 grouped together).The PPV/Precision is .7741935 which gives us the proportion of those majors classified as Engineering being Engineering. The accurarcy is .8571429 meaning it seems to be pretty good. The AUC found is 0.9213166, meaning the this model is "great!" at predicting between postive and negative classes.
The ROC curves shows the trade-off between TPR and TNR. The AUC caclculated is 0.9213166, meaning the this model is "great!", there is no change (this an error on my part).
# log regression using binary
cn <- College_new %>% select(-c("Major", "Major_category", "Major_cat"))
fit8 <- glm(y ~ ., data = cn, family = "binomial")
coeftest(fit8)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.5485e+15 8.8044e+07 63019497 < 2.2e-16 ***
## Rank -1.8253e+13 3.8359e+05 -47584893 < 2.2e-16 ***
## Major_code -7.1379e+11 6.0623e+03 -117743093 < 2.2e-16 ***
## Total 2.3652e+11 8.7626e+03 26992064 < 2.2e-16 ***
## Men -1.1283e+11 1.5267e+03 -73903241 < 2.2e-16 ***
## Women NA NA NA NA
## ShareWomen -4.2508e+15 6.6713e+07 -63717250 < 2.2e-16 ***
## Sample_size -1.5769e+12 1.8438e+05 -8552133 < 2.2e-16 ***
## Employed -1.7954e+12 2.0057e+04 -89513548 < 2.2e-16 ***
## Full_time 1.3103e+12 1.7577e+04 74549230 < 2.2e-16 ***
## Part_time 1.1620e+12 1.5114e+04 76883612 < 2.2e-16 ***
## Full_time_year_round 3.6174e+11 1.0886e+04 33230886 < 2.2e-16 ***
## Unemployed -8.6835e+11 2.6140e+04 -33219897 < 2.2e-16 ***
## Unemployment_rate -5.5195e+15 2.8385e+08 -19444982 < 2.2e-16 ***
## Median -2.1246e+10 2.7160e+03 -7822417 < 2.2e-16 ***
## P25th 3.4401e+10 2.0545e+03 16744641 < 2.2e-16 ***
## P75th -1.8411e+10 1.5804e+03 -11649499 < 2.2e-16 ***
## College_jobs -1.3887e+10 1.6381e+03 -8477182 < 2.2e-16 ***
## Non_college_jobs 5.9507e+10 3.2233e+03 18461503 < 2.2e-16 ***
## Low_wage_jobs -1.3730e+11 1.0496e+04 -13080908 < 2.2e-16 ***
## [ reached getOption("max.print") -- omitted 4 rows ]
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef(fit8))
## (Intercept) Rank Major_code
## Inf 0 0
## Total Men Women
## Inf 0 NA
## ShareWomen Sample_size Employed
## 0 0 0
## Full_time Part_time Full_time_year_round
## Inf Inf Inf
## Unemployed Unemployment_rate Median
## 0 0 0
## P25th P75th College_jobs
## Inf 0 0
## Non_college_jobs Low_wage_jobs M
## Inf 0 NA
## W probs logit
## NA Inf NA
probs2 <- predict(fit8, type = "response")
class_diag(probs2, cn$y)
## acc sens spec ppv f1 auc
## 1 0.9404762 0.9655172 0.9272727 0.875 0.9180328 0.946395
summary(fit8)
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = cn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.49 0.00 0.00 0.00 8.49
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.548e+15 8.804e+07 63019497 <2e-16 ***
## Rank -1.825e+13 3.836e+05 -47584893 <2e-16 ***
## Major_code -7.138e+11 6.062e+03 -117743093 <2e-16 ***
## Total 2.365e+11 8.763e+03 26992064 <2e-16 ***
## Men -1.128e+11 1.527e+03 -73903241 <2e-16 ***
## Women NA NA NA NA
## ShareWomen -4.251e+15 6.671e+07 -63717250 <2e-16 ***
## Sample_size -1.577e+12 1.844e+05 -8552133 <2e-16 ***
## Employed -1.795e+12 2.006e+04 -89513548 <2e-16 ***
## Full_time 1.310e+12 1.758e+04 74549230 <2e-16 ***
## Part_time 1.162e+12 1.511e+04 76883612 <2e-16 ***
## Full_time_year_round 3.617e+11 1.089e+04 33230886 <2e-16 ***
## Unemployed -8.684e+11 2.614e+04 -33219897 <2e-16 ***
## Unemployment_rate -5.520e+15 2.839e+08 -19444982 <2e-16 ***
## Median -2.125e+10 2.716e+03 -7822417 <2e-16 ***
## P25th 3.440e+10 2.054e+03 16744641 <2e-16 ***
## P75th -1.841e+10 1.580e+03 -11649499 <2e-16 ***
## College_jobs -1.389e+10 1.638e+03 -8477182 <2e-16 ***
## Non_college_jobs 5.951e+10 3.223e+03 18461503 <2e-16 ***
## Low_wage_jobs -1.373e+11 1.050e+04 -13080908 <2e-16 ***
## [ reached getOption("max.print") -- omitted 4 rows ]
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 108.27 on 83 degrees of freedom
## Residual deviance: 360.44 on 64 degrees of freedom
## AIC: 400.44
##
## Number of Fisher Scoring iterations: 12
set.seed(348)
k = 10
data <- cn[sample(nrow(cn)), ]
folds <- cut(seq(1:nrow(cn)), breaks = k, labels = F)
diags <- NULL
for (i in 1:k) {
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$y
fit <- glm(y ~ ., data = train, family = "binomial")
probs <- predict(fit, newdata = test, type = "response")
diags <- rbind(diags, class_diag(probs, truth))
}
summarize_all(diags, mean)
## acc sens spec ppv f1 auc
## 1 0.7638889 0.735 0.7828571 0.6666667 0.6797619 0.8003571
# lasso regression
set.seed(348)
cn <- College_new %>% select(-c("Major", "Major_category", "Major_cat"))
library(glmnet)
data(cn)
y <- as.matrix(cn$y)
x <- cn %>% select(-y) %>% mutate_all(scale) %>% as.matrix
head(x)
## Rank Major_code Total Men Women ShareWomen
## [1,] -1.334998 -0.5954497 -0.6135406 -0.54597552 -0.5765891 -1.3952364
## [2,] -1.316422 -0.5973626 -0.6389159 -0.58843378 -0.5823396 -1.4736612
## [3,] -1.297847 -0.5980003 -0.6373129 -0.58701645 -0.5808248 -1.2591408
## [4,] -1.279271 -0.5967250 -0.6308689 -0.57475347 -0.5807126 -1.4507725
## Sample_size Employed Full_time Part_time Full_time_year_round
## [1,] -0.5236655 -0.6015220 -0.5834324 -0.6517978 -0.5711890
## [2,] -0.5656982 -0.6274359 -0.6117473 -0.6602564 -0.5940255
## [3,] -0.5714958 -0.6272807 -0.6117035 -0.6633861 -0.5953639
## [4,] -0.5526536 -0.6251471 -0.6005133 -0.6619481 -0.5855489
## Unemployed Unemployment_rate Median P25th P75th College_jobs
## [1,] -0.6129443 -1.6039785 4.730927 5.6640426 4.071532 -0.5993097
## [2,] -0.6006506 1.5782654 2.221831 2.0320334 2.027737 -0.6614842
## [3,] -0.6183228 -1.4199900 2.078454 1.5780323 2.903649 -0.6559179
## [4,] -0.6121759 -0.5821417 1.863388 0.9424307 1.443795 -0.6520845
## Non_college_jobs Low_wage_jobs M W probs logit
## [1,] -0.5185841 -0.5265120 -0.54597552 -0.5765891 1.903466 4.907594
## [2,] -0.5230610 -0.5485560 -0.58843378 -0.5823396 1.889406 2.106488
## [3,] -0.5264501 -0.5562637 -0.58701645 -0.5808248 1.898339 2.446391
## [4,] -0.5295462 -0.5562637 -0.57475347 -0.5807126 1.748794 1.283963
## [ reached getOption("max.print") -- omitted 2 rows ]
cv <- cv.glmnet(x, y)
lasso1 <- glmnet(x, y, lambda = cv$lambda.1se)
coef(lasso1)
## 24 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.34523810
## Rank -0.03610480
## Major_code -0.09511901
## Total .
## Men .
## Women .
## ShareWomen -0.08307175
## Sample_size .
## Employed .
## Full_time .
## Part_time .
## Full_time_year_round .
## Unemployed .
## Unemployment_rate .
## Median 0.05774517
## P25th .
## P75th .
## College_jobs .
## Non_college_jobs .
## Low_wage_jobs .
## M .
## W .
## probs 0.12926262
## logit .
prob <- predict(lasso1, newx = x, type = "response")
class_diag(prob, cn$y)
## acc sens spec ppv f1 auc
## 1 0.8928571 0.862069 0.9090909 0.8333333 0.8474576 0.9768025
k = 10
data <- cn %>% sample_frac #put rows of dataset in random order
folds <- ntile(1:nrow(data), n = 10) #create fold labels
diags <- NULL
for (i in 1:k) {
train <- data[folds != i, ] #create training set (all but fold i)
test <- data[folds == i, ] #create test set (just fold i)
truth <- test$y #save truth labels from fold i
fit <- glm(y ~ Rank + Major_code + ShareWomen + Median, data = train,
family = "binomial")
probs <- predict(fit, newdata = test, type = "response")
diags <- rbind(diags, class_diag(probs, truth))
}
diags %>% summarize_all(mean)
## acc sens spec ppv f1 auc
## 1 0.8847222 0.8738095 0.869881 0.8490476 0.8389744 0.9580952
The logistic regression using Binary variables acc= .9404762 meaning it is pretty much accurate at predicting, sens= 0.9655172, spec=0.9272727, ppv=0.875, and AUC=0.946395. This model is a great fit for the predicted probilities of Engineering. The 10-fold CV model's acc=0.7638889, sens=0.735 spec=0.7828571, ppv=0.6666667, and AUC=0.8003571. The AUC is fair and when compared to original logistic regression AUC is decreased so the out of sampling isn't good in showing the model.
The variables retained are Rank, Major_code, ShareWomen, and Median. The lasso regression cv performance shows acc=0.8928571, sens=0.862069, spec=0.9090909 and ppv= 0.8333333. The AUC of the lasso regression is 0.9768025, which means it is great and when compared to past AUC it is the highest.
The out-of-sample AUC of the 10-fold CV of the lasso variables selected is 0.9580952. When comparing the AUC to the logistic regression above it is higher but lower than lasso regression cv.