r - Removing outliers from statistical testing of stat_compare_means

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a larger dataset where it has to be presented in boxplot format, however there may be outliers within each group and I would want to perform statistical testing after excluding the outliers first, for sample df and code below:

df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
                score = c(2,3,4,5,100,5,8,9,10,95))
df %>% ggplot(aes(x=name,y=score)) + geom_boxplot() + 
    stat_compare_means(comparisons = list(c("Bob","Tom")),method="t.test", paired=F)
The stat_compare_means function is used because I have much more groups and facets in the larger dataset making manual elimination of outliers very tedious (unless it can be incorporated into the whole dataset) so I was wondering if it is possible to somehow incorporate it into the function to make them ignore the outliers when computing the statistical tests? Thanks
                outlier.shape = NA removes the outliers from the boxplot itself but only in the visual sense i.e. it doesn't actually dump the data it just hides it so the statistical calculation still includes it
– Jeff238
                Aug 10, 2022 at 12:07
                First you need to define what is an "outlier", secondly you need to justify why you will remove this data, third you should conclude that removing data is misleading and bad practice.
– user2974951
                Aug 10, 2022 at 12:21
                In a large dataset investigating the origin of each individual outlier is tedious, but simply removing everything that looks like an outlier makes any inference from the data invalid. So, this is bad practice. An alternative is to use methods that are resistant or robust to the presence of outliers, instead of assuming normality and altering your data. If you have enough data, another possibility is to compute confidence intervals by boostrapping.
– Pedro J. Aphalo
                Aug 17, 2022 at 13:52
If you want to remove the outliers in your statistical test, that means you will show test scores (without outliers) on a graph with outliers which is misleading. So you could remove the outliers beforehand to do the t.test. The first graph shows the t.test without outliers to a graph with outliers and the second graph shows a t.test without outliers to a graph without outliers:
library(dplyr)
library(ggpubr)
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
                score = c(2,3,4,5,100,5,8,9,10,95))
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
  val <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - val)] <- NA
  y[x > (qnt[2] + val)] <- NA
df2 <- df %>% 
  group_by(name) %>% 
  mutate(score = remove_outliers(score)) %>% 
  ungroup() 
indx <- which(is.na(df2$score), arr.ind=TRUE)
df %>% ggplot(aes(x=name,y=score)) + 
  geom_boxplot() + 
  stat_compare_means(data = df2[-indx,], comparisons = list(c("Bob","Tom")), 
                     method="t.test", 
                     paired=F)
df2 %>% ggplot(aes(x=name,y=score)) + 
  geom_boxplot() + 
  stat_compare_means(comparisons = list(c("Bob","Tom")), 
                     method="t.test", 
                     paired=F)
#> Warning: Removed 2 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 2 rows containing non-finite values (stat_signif).
^{Created on 2022-08-10 by the reprex package (v2.0.1)}
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.