Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have a larger dataset where it has to be presented in boxplot format, however there may be outliers within each group and I would want to perform statistical testing after excluding the outliers first, for sample df and code below:
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
score = c(2,3,4,5,100,5,8,9,10,95))
df %>% ggplot(aes(x=name,y=score)) + geom_boxplot() +
stat_compare_means(comparisons = list(c("Bob","Tom")),method="t.test", paired=F)
The stat_compare_means
function is used because I have much more groups and facets in the larger dataset making manual elimination of outliers very tedious (unless it can be incorporated into the whole dataset) so I was wondering if it is possible to somehow incorporate it into the function to make them ignore the outliers when computing the statistical tests? Thanks
–
–
–
If you want to remove the outliers in your statistical test, that means you will show test scores (without outliers) on a graph with outliers which is misleading. So you could remove the outliers beforehand to do the t.test. The first graph shows the t.test without outliers to a graph with outliers and the second graph shows a t.test without outliers to a graph without outliers:
library(dplyr)
library(ggpubr)
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
score = c(2,3,4,5,100,5,8,9,10,95))
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
val <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - val)] <- NA
y[x > (qnt[2] + val)] <- NA
df2 <- df %>%
group_by(name) %>%
mutate(score = remove_outliers(score)) %>%
ungroup()
indx <- which(is.na(df2$score), arr.ind=TRUE)
df %>% ggplot(aes(x=name,y=score)) +
geom_boxplot() +
stat_compare_means(data = df2[-indx,], comparisons = list(c("Bob","Tom")),
method="t.test",
paired=F)
df2 %>% ggplot(aes(x=name,y=score)) +
geom_boxplot() +
stat_compare_means(comparisons = list(c("Bob","Tom")),
method="t.test",
paired=F)
#> Warning: Removed 2 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 2 rows containing non-finite values (stat_signif).
Created on 2022-08-10 by the reprex package (v2.0.1)
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.