NA behavior in prop.test #748

AmeliaMN · 2020-02-06T15:48:52Z

I was having trouble interpreting the error message

Error in stats::prop.test(t(table_from_formula), p = p, conf.level = conf.level,  : 
  'x' must have 2 columns

but this old issue made it clear that it occurs when one or more of your variables has more than two categories. It turns out that my data has more then two categories because it is a factor with two levels... plus NA. Here's a reprex:

prop.test(anysub~sex, data=HELPrct)

What is the recommended way to deal with this in mosaic?

The text was updated successfully, but these errors were encountered:

rpruim · 2020-02-06T18:10:54Z

I don't think there is a single best way to handle missing data. In this particular example, missing is the dominant category:

tally(~ anysub, data = HELPrct)
## anysub
##   no  yes <NA> 
##   56  190  207

So it is good that we don't simply proceed to run the test. This function doesn't have an na.rm argument, but neither does stats::prop.test() which this is based on. We could consider adding that, but for now, you will need to deal with missingness before calling prop.test().

I'll have to look to see if it is easy to give a more informative error message. One of the tricky things with prop.test() is that it implements several different kinds of tests.

AmeliaMN · 2020-02-06T21:03:23Z

Sure. I was just trying to make a reprex here, my real data has only one NA value.

This is my first time teaching really intro R labs in a long while, and I'm trying to avoid talking too much about data wrangling. Most of our datasets are pretty clean, but they do tend to have some NA values!

I'm just wondering what people generally do about NAs when teaching formula syntax. I would rather not show $, so I think my solution will be to use dplyr::filter() without the pipe, but it feels like there could be a mosaic function analogous to tidyr::drop_na() that could be used around a dataset, in the same way that mosaic has factorize().

rpruim · 2020-02-07T04:33:04Z

My point is that we shouldn't automate throwing away missing values -- certainly not silently.

Since an analysis can involve multiple R functions each using different sets of variables, I think dealing with NAs is best done outside of a particular function as step before summarizing and modeling begins. data %>% drop_na() or drop_na(data) seem like fine options here if that's the dropping rule you want. (One could insert a call to select() to eliminate unused columns first.)

Do you have a proposal for a function that behaves differently from drop_na()?

nicholasjhorton · 2020-02-08T16:39:09Z

Thanks @AmeliaMN for sharing your question/suggestion.

@rpruim Is there a way to make the error message more user friendly? Offer a hint in addition to the error message?

AmeliaMN · 2020-02-29T16:02:31Z

I think @nicholasjhorton knows this, but I'm not sure if you do, @rpruim -- this semester I'm teaching two R labs, one in "formula" syntax (mosaic and ggformula packages) and one in "tidy" syntax (tidyverse and infer). It has been (and I'm sure will continue to be!) a very interesting exercise.

I think drop_na() is great, and that's what I taught in my tidy labs when we started doing summary statistics,

GSS %>%
  drop_na(highest_year_of_school_completed) %>%
  summarize(
    mean = mean(highest_year_of_school_completed),
    median = median(highest_year_of_school_completed)
  )

even at this stage, I am struggling with what to do in my formula labs. I ended up showing two approaches:

mean(~highest_year_of_school_completed, data = GSS, na.rm = TRUE)

and

options(na.rm = TRUE) 
median(~highest_year_of_school_completed, data = GSS)

My understanding is that having a global option about na.rm comes from mosaic, but I could be wrong about that. Students seemed to prefer setting the options, rather than repeating na.rm=TRUE in each of their summary statistics functions, but of course this is dangerous because it lulls students into a false sense of security about NAs.

Maybe what I'm requesting is for mosaic to importFrom drop_na() from tidyr. I really don't want to add another package to my library() calls for my formula labs just for that one function.

Wrapping back to prop.test(), then my interactive workflow would be something like, run

prop.test(anysub~sex, data=HELPrct)

and get an unhelpful error message. Hopefully, then I'd realize it was about NAs. Then I would do data processing, and run it again,

HELPrct_no_nas_for_prop <- drop_na(HELPrct, anysub, sex)
prop.test(anysub~sex, data=HELPrct_no_nas_for_prop)

?

(I guess I could do a one-liner, prop.test(anysub~sex, data=drop_na(HELPrct, anysub, sex)), but I generally try to avoid nesting like that in intro classes. And, in my formula labs I'm trying to avoid showing %>%.)

Anyway, that's just a bunch of thoughts, but maybe useful.

rpruim · 2020-02-29T19:36:47Z

Current working version:

prop.test(anysub~sex, data = HELPrct)
## Error: anysub has 3 levels (including NA).  Only 2 are allowed.

Now thinking about implementing an na.rm option for the case when a formula is used.

rpruim · 2020-02-29T20:06:55Z

@AmeliaMN : How does this look?

Note: na.rm can be a vector of dimensions from which to drop NAs or TRUE (all dimensions) or FALSE (none). Remaining NAs are treated as a category and a warning is emitted identifying the variable(s) in question.

library(mosaic)

prop.test(anysub ~ link, data = HELPrct)
#> Error: anysub has 3 levels (including NA).  Only 2 are allowed.

prop.test(anysub ~ link, data = HELPrct, na.rm = TRUE)
#> 
#>  2-sample test for equality of proportions with continuity correction
#> 
#> data:  tally(anysub ~ link)
#> X-squared = 9.2749, df = 1, p-value = 0.002323
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -0.29428286 -0.05895097
#> sample estimates:
#>    prop 1    prop 2 
#> 0.1567164 0.3333333

prop.test(link ~ anysub, data = HELPrct)
#> Error: link has 3 levels (including NA).  Only 2 are allowed.

prop.test(link ~ anysub, data = HELPrct, na.rm = 1)
#> Warning: NA is being treated as a category for anysub
#> 
#>  3-sample test for equality of proportions without continuity
#>  correction
#> 
#> data:  tally(link ~ anysub)
#> X-squared = 19.25, df = 2, p-value = 6.607e-05
#> alternative hypothesis: two.sided
#> sample estimates:
#>    prop 1    prop 2    prop 3 
#> 0.3750000 0.6174863 0.6979167

prop.test(link ~ anysub, data = HELPrct, na.rm = TRUE)
#> 
#>  2-sample test for equality of proportions with continuity correction
#> 
#> data:  tally(link ~ anysub)
#> X-squared = 9.2749, df = 1, p-value = 0.002323
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -0.3991840 -0.0857887
#> sample estimates:
#>    prop 1    prop 2 
#> 0.3750000 0.6174863

^{Created on 2020-02-29 by the reprex package (v0.3.0)}

rpruim · 2020-02-29T20:23:23Z

I hope to send this to CRAN early next week.

AmeliaMN · 2020-03-01T16:06:56Z

This looks amazing! Thank you for addressing this.

rpruim · 2020-03-14T17:43:00Z

Just confirming that this went to CRAN and closing the issue.

rpruim pushed a commit that referenced this issue Feb 29, 2020

Addressing #748: NAs in prop.test()

2fb37ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NA behavior in prop.test #748

NA behavior in prop.test #748

AmeliaMN commented Feb 6, 2020

rpruim commented Feb 6, 2020

AmeliaMN commented Feb 6, 2020

rpruim commented Feb 7, 2020

nicholasjhorton commented Feb 8, 2020

AmeliaMN commented Feb 29, 2020

rpruim commented Feb 29, 2020

rpruim commented Feb 29, 2020

rpruim commented Feb 29, 2020

AmeliaMN commented Mar 1, 2020

rpruim commented Mar 14, 2020

NA behavior in prop.test #748

NA behavior in prop.test #748

Comments

AmeliaMN commented Feb 6, 2020

rpruim commented Feb 6, 2020

AmeliaMN commented Feb 6, 2020

rpruim commented Feb 7, 2020

nicholasjhorton commented Feb 8, 2020

AmeliaMN commented Feb 29, 2020

rpruim commented Feb 29, 2020

rpruim commented Feb 29, 2020

rpruim commented Feb 29, 2020

AmeliaMN commented Mar 1, 2020

rpruim commented Mar 14, 2020