Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wondering about multiple Dependent variables #746

Open
MichaelJMahometa opened this issue Sep 24, 2019 · 20 comments
Open

Wondering about multiple Dependent variables #746

MichaelJMahometa opened this issue Sep 24, 2019 · 20 comments

Comments

@MichaelJMahometa
Copy link

First, I love mosaic -- I've been transitioning to for HS students using R.

I use it also with an undergraduate regression course. In the past I've used something like describe() from psych to get a quick look at the descriptives for multiple variables:

names(GoosePermits)
vars <- c("bid","keep","sell")
library(psych)
describe(select(GoosePermits, one_of(vars)))

But, I'd really like to keep to mosaic as much as possible (and the tidyverse run out with piping if possible). Is if possible to get favstats() to produce a multiple variable table (summary for multiple variables at once)? Something like:

#This does NOT work:
favstats(bid + keep + sell ~ NULL, data=GoosePermits)

Any direction or advice is appreciated,
Michael

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

I'll have to give some thought to this. In principle, it should be possible to do, we just need to process the formula differently, loop over LHS variables, and decorate the output so it is clear what's what.

It might be easier to implement in df_stats()

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

Proof of concept:

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
##       _target_    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

To do list

  • Update documentation and examples.
  • Decide on name for column indicating LHS (currently _response_, see below).
  • Decide whether that column should be included even if only one LHS variable is processed.
  • Decide how to deal with custom statistics and naming. (Long names, the current default, don't work well here.)

Some options for the last item:

  1. change default from long_names = TRUE to long_names = FALSE. (If we keep response` even when there is only one response, this doesn't really lose any information.)
  2. Introduce long_names = "default" and handle the two cases differently.
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
##     _response_    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

Regarding doing this for favstats()...

  • The code there is not as clean, so it would be harder to implement.
  • The output format is not a data frame, so it is less clear where to record the response variable.

I'm inclined to do this for df_stats() only at this point.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

@nicholasjhorton : Any thoughts about naming? We want to avoid using a name that might be among the names of the variables in the data set. Using underscore makes things harder to use downstream, however.

Perhaps we could use response as long as response is not in the names of the data and use _response_ otherwise.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

When processing multiple response expressions, long_names will be set to FALSE. We could additionally make that the default for a single response, but that would result in a change in behavior for old code.

Some examples:

## df_stats(Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
## ##      response    Species  mean        sd
## ## 1 Sepal.Width     setosa 3.428 0.3790644
## ## 2 Sepal.Width versicolor 2.770 0.3137983
## ## 3 Sepal.Width  virginica 2.974 0.3224966

df_stats(Sepal.Width ~ Species, data = iris, mean, sd)
##      response    Species mean_Sepal.Width sd_Sepal.Width
## 1 Sepal.Width     setosa            3.428      0.3790644
## 2 Sepal.Width versicolor            2.770      0.3137983
## 3 Sepal.Width  virginica            2.974      0.3224966

df_stats(Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
##      response    Species  mean        sd
## 1 Sepal.Width     setosa 3.428 0.3790644
## 2 Sepal.Width versicolor 2.770 0.3137983
## 3 Sepal.Width  virginica 2.974 0.3224966

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd)
##       response    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

# long_names = TRUE is ignored in this situation
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd, long_names = TRUE)
##       response    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

Updated to do list

  • Documentation and examples
  • Should long_names = FALSE be the default? [Current code has TRUE as default]
  • Should response / _response_ be included in output when long_names = TRUE and there is only one response? [Current code includes.]

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

Additional item: Need to consider what df_stats( ~ a + b, data = ... ) should do. Currently it is equivalent to a ~ b, but we could make it equivalent to a + b ~ 1.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

Here's POC for the change:

df_stats(~ Sepal.Length + Sepal.Width, data = iris)
##       response min  Q1 median  Q3 max     mean        sd   n missing
## 1 Sepal.Length 4.3 5.1    5.8 6.4 7.9 5.843333 0.8280661 150       0
## 2  Sepal.Width 2.0 2.8    3.0 3.3 4.4 3.057333 0.4358663 150       0

df_stats(~ Sepal.Length + Sepal.Width | Species, data = iris)
##       response    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

@MichaelJMahometa
Copy link
Author

MichaelJMahometa commented Sep 25, 2019

Would

df_stats(~ Sepal.Length + Sepal.Width | Species, data = iris)

have the equivalent:

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)

And, would

df_stats(~ Sepal.Length + Sepal.Width, data = iris)

have the equivalent:

df_stats(Sepal.Length + Sepal.Width ~ NULL, data = iris)

(thinking of equivalency with favstats() and mean() concepts in mosaic)

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

Yes. Basically ~ rhs | cond gets converted into rhs ~ 1 | cond.

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
##       response    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

I'll need to do a bit more testing to make sure I didn't break anything, but this seems to be working as I intended.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2019

@MichaelJMahometa, If you want to try it out:

devtools::install_github("ProjectMOSAIC/mosaicCore", ref = "beta")

@dtkaplan
Copy link
Contributor

I'd recommend against starting a name with underscore since, as you know, it requires back-ticks in many settings. Also, I'm against having the names of the output columns (as opposed to their values) differ depending on the names of variables in the input data frame. I don't think there's any real need, since "response" will be duplicated in the output only if the user creates such a name in the ... of the call to df_stats().

Why "response" and not "variable" or "name" or "variable_name"?

Do you want to allow a formula like . ~ Species to handle all of the variables?

@rpruim
Copy link
Contributor

rpruim commented Sep 29, 2019

naming the response variable column

I'm not sure what the best name is. variable is perhaps too generic (Species is also a variable in the example above.) I'd like something that makes it clear that this is the thing the mean/sd/etc are computed OF. Do we have a word for that? I chose response because it sits in the "response slot" of the formula if you are thinking about models. But this is easy to change if we come up with something we like better.

I just modified the "backup name" to be response_var_. That avoids needing to escape and is unlikely to collide with things. (But as you say, response is also not likely to collide with names in the output data frame, so this is just some extra caution and not likely to be a behavior many users see.)

long vs short names for summaries

Sounds like your vote is for long_names = FALSE. Especially if there is a column containing the response variable name, I think I'm happy with that (even though it will be a change from previous versions).

expanding .

I thought about handling . on the left side but I haven't decided if we should.

Currently y ~ . works becausemodel.frame() takes care of the expansion for us and . ~ x does not -- just as in model.frame().

One wrinkle if we allow . ~ x is that if . expands to include both quantitative and categorical variables, the summaries will likely not be meaningful for some of the variables.

@rpruim
Copy link
Contributor

rpruim commented Sep 29, 2019

Since it occurred to both of us, I decided to try implementing support for . ~ rhs. This can be abused with less than desirable results, but I guess there legitimate use cases.

Example:

df_stats(. ~ Species, data = iris, mean, sd)

##        response    Species  mean        sd
## 1  Sepal.Length     setosa 5.006 0.3524897
## 2  Sepal.Length versicolor 5.936 0.5161711
## 3  Sepal.Length  virginica 6.588 0.6358796
## 4   Sepal.Width     setosa 3.428 0.3790644
## 5   Sepal.Width versicolor 2.770 0.3137983
## 6   Sepal.Width  virginica 2.974 0.3224966
## 7  Petal.Length     setosa 1.462 0.1736640
## 8  Petal.Length versicolor 4.260 0.4699110
## 9  Petal.Length  virginica 5.552 0.5518947
## 10  Petal.Width     setosa 0.246 0.1053856
## 11  Petal.Width versicolor 1.326 0.1977527
## 12  Petal.Width  virginica 2.026 0.2746501

@nicholasjhorton
Copy link
Contributor

nicholasjhorton commented Oct 2, 2019 via email

@nicholasjhorton
Copy link
Contributor

nicholasjhorton commented Oct 2, 2019 via email

@rpruim
Copy link
Contributor

rpruim commented Jun 24, 2020

Looks like this got left on a development branch and didn't get merged into master. I guess I should fix that ;-)

@rpruim
Copy link
Contributor

rpruim commented Jun 24, 2020

Looks like I need to fix some tests that are written assuming the old behavior.

@rpruim
Copy link
Contributor

rpruim commented Jun 28, 2020

Tests adjusted (in mosaicCore) to match new behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants