Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA behavior in prop.test #748

Open
AmeliaMN opened this issue Feb 6, 2020 · 10 comments
Open

NA behavior in prop.test #748

AmeliaMN opened this issue Feb 6, 2020 · 10 comments

Comments

@AmeliaMN
Copy link

AmeliaMN commented Feb 6, 2020

I was having trouble interpreting the error message

Error in stats::prop.test(t(table_from_formula), p = p, conf.level = conf.level,  : 
  'x' must have 2 columns

but this old issue made it clear that it occurs when one or more of your variables has more than two categories. It turns out that my data has more then two categories because it is a factor with two levels... plus NA. Here's a reprex:

prop.test(anysub~sex, data=HELPrct)

What is the recommended way to deal with this in mosaic?

@rpruim
Copy link
Contributor

rpruim commented Feb 6, 2020

I don't think there is a single best way to handle missing data. In this particular example, missing is the dominant category:

tally(~ anysub, data = HELPrct)
## anysub
##   no  yes <NA> 
##   56  190  207

So it is good that we don't simply proceed to run the test. This function doesn't have an na.rm argument, but neither does stats::prop.test() which this is based on. We could consider adding that, but for now, you will need to deal with missingness before calling prop.test().

I'll have to look to see if it is easy to give a more informative error message. One of the tricky things with prop.test() is that it implements several different kinds of tests.

@AmeliaMN
Copy link
Author

AmeliaMN commented Feb 6, 2020

Sure. I was just trying to make a reprex here, my real data has only one NA value.

This is my first time teaching really intro R labs in a long while, and I'm trying to avoid talking too much about data wrangling. Most of our datasets are pretty clean, but they do tend to have some NA values!

I'm just wondering what people generally do about NAs when teaching formula syntax. I would rather not show $, so I think my solution will be to use dplyr::filter() without the pipe, but it feels like there could be a mosaic function analogous to tidyr::drop_na() that could be used around a dataset, in the same way that mosaic has factorize().

@rpruim
Copy link
Contributor

rpruim commented Feb 7, 2020

My point is that we shouldn't automate throwing away missing values -- certainly not silently.

Since an analysis can involve multiple R functions each using different sets of variables, I think dealing with NAs is best done outside of a particular function as step before summarizing and modeling begins. data %>% drop_na() or drop_na(data) seem like fine options here if that's the dropping rule you want. (One could insert a call to select() to eliminate unused columns first.)

Do you have a proposal for a function that behaves differently from drop_na()?

@nicholasjhorton
Copy link
Contributor

Thanks @AmeliaMN for sharing your question/suggestion.

@rpruim Is there a way to make the error message more user friendly? Offer a hint in addition to the error message?

@AmeliaMN
Copy link
Author

I think @nicholasjhorton knows this, but I'm not sure if you do, @rpruim -- this semester I'm teaching two R labs, one in "formula" syntax (mosaic and ggformula packages) and one in "tidy" syntax (tidyverse and infer). It has been (and I'm sure will continue to be!) a very interesting exercise.

I think drop_na() is great, and that's what I taught in my tidy labs when we started doing summary statistics,

GSS %>%
  drop_na(highest_year_of_school_completed) %>%
  summarize(
    mean = mean(highest_year_of_school_completed),
    median = median(highest_year_of_school_completed)
  )

even at this stage, I am struggling with what to do in my formula labs. I ended up showing two approaches:

mean(~highest_year_of_school_completed, data = GSS, na.rm = TRUE)

and

options(na.rm = TRUE) 
median(~highest_year_of_school_completed, data = GSS)

My understanding is that having a global option about na.rm comes from mosaic, but I could be wrong about that. Students seemed to prefer setting the options, rather than repeating na.rm=TRUE in each of their summary statistics functions, but of course this is dangerous because it lulls students into a false sense of security about NAs.

Maybe what I'm requesting is for mosaic to importFrom drop_na() from tidyr. I really don't want to add another package to my library() calls for my formula labs just for that one function.

Wrapping back to prop.test(), then my interactive workflow would be something like, run

prop.test(anysub~sex, data=HELPrct)

and get an unhelpful error message. Hopefully, then I'd realize it was about NAs. Then I would do data processing, and run it again,

HELPrct_no_nas_for_prop <- drop_na(HELPrct, anysub, sex)
prop.test(anysub~sex, data=HELPrct_no_nas_for_prop)

?

(I guess I could do a one-liner, prop.test(anysub~sex, data=drop_na(HELPrct, anysub, sex)), but I generally try to avoid nesting like that in intro classes. And, in my formula labs I'm trying to avoid showing %>%.)

Anyway, that's just a bunch of thoughts, but maybe useful.

@rpruim
Copy link
Contributor

rpruim commented Feb 29, 2020

Current working version:

prop.test(anysub~sex, data = HELPrct)
## Error: anysub has 3 levels (including NA).  Only 2 are allowed.

Now thinking about implementing an na.rm option for the case when a formula is used.

@rpruim
Copy link
Contributor

rpruim commented Feb 29, 2020

@AmeliaMN : How does this look?

Note: na.rm can be a vector of dimensions from which to drop NAs or TRUE (all dimensions) or FALSE (none). Remaining NAs are treated as a category and a warning is emitted identifying the variable(s) in question.

library(mosaic)

prop.test(anysub ~ link, data = HELPrct)
#> Error: anysub has 3 levels (including NA).  Only 2 are allowed.

prop.test(anysub ~ link, data = HELPrct, na.rm = TRUE)
#> 
#>  2-sample test for equality of proportions with continuity correction
#> 
#> data:  tally(anysub ~ link)
#> X-squared = 9.2749, df = 1, p-value = 0.002323
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -0.29428286 -0.05895097
#> sample estimates:
#>    prop 1    prop 2 
#> 0.1567164 0.3333333

prop.test(link ~ anysub, data = HELPrct)
#> Error: link has 3 levels (including NA).  Only 2 are allowed.

prop.test(link ~ anysub, data = HELPrct, na.rm = 1)
#> Warning: NA is being treated as a category for anysub
#> 
#>  3-sample test for equality of proportions without continuity
#>  correction
#> 
#> data:  tally(link ~ anysub)
#> X-squared = 19.25, df = 2, p-value = 6.607e-05
#> alternative hypothesis: two.sided
#> sample estimates:
#>    prop 1    prop 2    prop 3 
#> 0.3750000 0.6174863 0.6979167

prop.test(link ~ anysub, data = HELPrct, na.rm = TRUE)
#> 
#>  2-sample test for equality of proportions with continuity correction
#> 
#> data:  tally(link ~ anysub)
#> X-squared = 9.2749, df = 1, p-value = 0.002323
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -0.3991840 -0.0857887
#> sample estimates:
#>    prop 1    prop 2 
#> 0.3750000 0.6174863

Created on 2020-02-29 by the reprex package (v0.3.0)

rpruim pushed a commit that referenced this issue Feb 29, 2020
@rpruim
Copy link
Contributor

rpruim commented Feb 29, 2020

I hope to send this to CRAN early next week.

@AmeliaMN
Copy link
Author

AmeliaMN commented Mar 1, 2020

This looks amazing! Thank you for addressing this.

@rpruim
Copy link
Contributor

rpruim commented Mar 14, 2020

Just confirming that this went to CRAN and closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants