Wondering about multiple Dependent variables #746

MichaelJMahometa · 2019-09-24T15:10:54Z

First, I love mosaic -- I've been transitioning to for HS students using R.

I use it also with an undergraduate regression course. In the past I've used something like describe() from psych to get a quick look at the descriptives for multiple variables:

names(GoosePermits)
vars <- c("bid","keep","sell")
library(psych)
describe(select(GoosePermits, one_of(vars)))

But, I'd really like to keep to mosaic as much as possible (and the tidyverse run out with piping if possible). Is if possible to get favstats() to produce a multiple variable table (summary for multiple variables at once)? Something like:

#This does NOT work:
favstats(bid + keep + sell ~ NULL, data=GoosePermits)

Any direction or advice is appreciated,
Michael

The text was updated successfully, but these errors were encountered:

rpruim · 2019-09-25T13:38:46Z

I'll have to give some thought to this. In principle, it should be possible to do, we just need to process the formula differently, loop over LHS variables, and decorate the output so it is clear what's what.

It might be easier to implement in df_stats()

rpruim · 2019-09-25T13:42:06Z

Proof of concept:

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
##       _target_    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

rpruim · 2019-09-25T13:51:18Z

To do list

Update documentation and examples.
Decide on name for column indicating LHS (currently _response_, see below).
Decide whether that column should be included even if only one LHS variable is processed.
Decide how to deal with custom statistics and naming. (Long names, the current default, don't work well here.)

Some options for the last item:

change default from long_names = TRUE to long_names = FALSE. (If we keep response` even when there is only one response, this doesn't really lose any information.)
Introduce long_names = "default" and handle the two cases differently.

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
##     _response_    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

rpruim · 2019-09-25T13:53:32Z

Regarding doing this for favstats()...

The code there is not as clean, so it would be harder to implement.
The output format is not a data frame, so it is less clear where to record the response variable.

I'm inclined to do this for df_stats() only at this point.

rpruim · 2019-09-25T13:58:05Z

@nicholasjhorton : Any thoughts about naming? We want to avoid using a name that might be among the names of the variables in the data set. Using underscore makes things harder to use downstream, however.

Perhaps we could use response as long as response is not in the names of the data and use _response_ otherwise.

rpruim · 2019-09-25T14:29:26Z

When processing multiple response expressions, long_names will be set to FALSE. We could additionally make that the default for a single response, but that would result in a change in behavior for old code.

Some examples:

## df_stats(Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
## ##      response    Species  mean        sd
## ## 1 Sepal.Width     setosa 3.428 0.3790644
## ## 2 Sepal.Width versicolor 2.770 0.3137983
## ## 3 Sepal.Width  virginica 2.974 0.3224966

df_stats(Sepal.Width ~ Species, data = iris, mean, sd)
##      response    Species mean_Sepal.Width sd_Sepal.Width
## 1 Sepal.Width     setosa            3.428      0.3790644
## 2 Sepal.Width versicolor            2.770      0.3137983
## 3 Sepal.Width  virginica            2.974      0.3224966

df_stats(Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
##      response    Species  mean        sd
## 1 Sepal.Width     setosa 3.428 0.3790644
## 2 Sepal.Width versicolor 2.770 0.3137983
## 3 Sepal.Width  virginica 2.974 0.3224966

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd)
##       response    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

# long_names = TRUE is ignored in this situation
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd, long_names = TRUE)
##       response    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

rpruim · 2019-09-25T14:32:54Z

Updated to do list

Documentation and examples
Should long_names = FALSE be the default? [Current code has TRUE as default]
Should response / _response_ be included in output when long_names = TRUE and there is only one response? [Current code includes.]

rpruim · 2019-09-25T16:23:50Z

Additional item: Need to consider what df_stats( ~ a + b, data = ... ) should do. Currently it is equivalent to a ~ b, but we could make it equivalent to a + b ~ 1.

rpruim · 2019-09-25T16:31:43Z

Here's POC for the change:

df_stats(~ Sepal.Length + Sepal.Width, data = iris)
##       response min  Q1 median  Q3 max     mean        sd   n missing
## 1 Sepal.Length 4.3 5.1    5.8 6.4 7.9 5.843333 0.8280661 150       0
## 2  Sepal.Width 2.0 2.8    3.0 3.3 4.4 3.057333 0.4358663 150       0

df_stats(~ Sepal.Length + Sepal.Width | Species, data = iris)
##       response    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

MichaelJMahometa · 2019-09-25T16:34:57Z

Would

df_stats(~ Sepal.Length + Sepal.Width | Species, data = iris)

have the equivalent:

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)

And, would

df_stats(~ Sepal.Length + Sepal.Width, data = iris)

have the equivalent:

df_stats(Sepal.Length + Sepal.Width ~ NULL, data = iris)

(thinking of equivalency with favstats() and mean() concepts in mosaic)

rpruim · 2019-09-25T16:38:02Z

Yes. Basically ~ rhs | cond gets converted into rhs ~ 1 | cond.

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
##       response    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

I'll need to do a bit more testing to make sure I didn't break anything, but this seems to be working as I intended.

rpruim · 2019-09-25T16:39:44Z

@MichaelJMahometa, If you want to try it out:

devtools::install_github("ProjectMOSAIC/mosaicCore", ref = "beta")

dtkaplan · 2019-09-28T22:50:15Z

I'd recommend against starting a name with underscore since, as you know, it requires back-ticks in many settings. Also, I'm against having the names of the output columns (as opposed to their values) differ depending on the names of variables in the input data frame. I don't think there's any real need, since "response" will be duplicated in the output only if the user creates such a name in the ... of the call to df_stats().

Why "response" and not "variable" or "name" or "variable_name"?

Do you want to allow a formula like . ~ Species to handle all of the variables?

rpruim · 2019-09-29T00:56:26Z

naming the response variable column

I'm not sure what the best name is. variable is perhaps too generic (Species is also a variable in the example above.) I'd like something that makes it clear that this is the thing the mean/sd/etc are computed OF. Do we have a word for that? I chose response because it sits in the "response slot" of the formula if you are thinking about models. But this is easy to change if we come up with something we like better.

I just modified the "backup name" to be response_var_. That avoids needing to escape and is unlikely to collide with things. (But as you say, response is also not likely to collide with names in the output data frame, so this is just some extra caution and not likely to be a behavior many users see.)

long vs short names for summaries

Sounds like your vote is for long_names = FALSE. Especially if there is a column containing the response variable name, I think I'm happy with that (even though it will be a change from previous versions).

expanding .

I thought about handling . on the left side but I haven't decided if we should.

Currently y ~ . works becausemodel.frame() takes care of the expansion for us and . ~ x does not -- just as in model.frame().

One wrinkle if we allow . ~ x is that if . expands to include both quantitative and categorical variables, the summaries will likely not be meaningful for some of the variables.

rpruim · 2019-09-29T01:45:17Z

Since it occurred to both of us, I decided to try implementing support for . ~ rhs. This can be abused with less than desirable results, but I guess there legitimate use cases.

Example:

df_stats(. ~ Species, data = iris, mean, sd)

##        response    Species  mean        sd
## 1  Sepal.Length     setosa 5.006 0.3524897
## 2  Sepal.Length versicolor 5.936 0.5161711
## 3  Sepal.Length  virginica 6.588 0.6358796
## 4   Sepal.Width     setosa 3.428 0.3790644
## 5   Sepal.Width versicolor 2.770 0.3137983
## 6   Sepal.Width  virginica 2.974 0.3224966
## 7  Petal.Length     setosa 1.462 0.1736640
## 8  Petal.Length versicolor 4.260 0.4699110
## 9  Petal.Length  virginica 5.552 0.5518947
## 10  Petal.Width     setosa 0.246 0.1053856
## 11  Petal.Width versicolor 1.326 0.1977527
## 12  Petal.Width  virginica 2.026 0.2746501

nicholasjhorton · 2019-10-02T14:29:26Z

I really like the . addition: nicely done!

…

On Sep 28, 2019, at 9:45 PM, Randall Pruim ***@***.***> wrote: Since it occurred to both of us, I decided to try implementing support for . ~ rhs. This can be abused with less than desirable results, but I guess there legitimate use cases. Example: df_stats(. ~ Species, data = iris, mean, sd ) ## response Species mean sd ## 1 Sepal.Length setosa 5.006 0.3524897 ## 2 Sepal.Length versicolor 5.936 0.5161711 ## 3 Sepal.Length virginica 6.588 0.6358796 ## 4 Sepal.Width setosa 3.428 0.3790644 ## 5 Sepal.Width versicolor 2.770 0.3137983 ## 6 Sepal.Width virginica 2.974 0.3224966 ## 7 Petal.Length setosa 1.462 0.1736640 ## 8 Petal.Length versicolor 4.260 0.4699110 ## 9 Petal.Length virginica 5.552 0.5518947 ## 10 Petal.Width setosa 0.246 0.1053856 ## 11 Petal.Width versicolor 1.326 0.1977527 ## 12 Petal.Width virginica 2.026 0.2746501 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nicholasjhorton · 2019-10-02T14:30:09Z

I like this proposal.

…

On Sep 28, 2019, at 8:56 PM, Randall Pruim ***@***.***> wrote: I just modified the "backup name" to be response_var_.

rpruim · 2020-06-24T22:36:41Z

Looks like this got left on a development branch and didn't get merged into master. I guess I should fix that ;-)

rpruim · 2020-06-24T22:39:38Z

Looks like I need to fix some tests that are written assuming the old behavior.

rpruim · 2020-06-28T03:47:15Z

Tests adjusted (in mosaicCore) to match new behavior.

rpruim mentioned this issue Jun 28, 2020

Allow multiple response variables in df_stats() ProjectMOSAIC/mosaicCore#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wondering about multiple Dependent variables #746

Wondering about multiple Dependent variables #746

MichaelJMahometa commented Sep 24, 2019

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019 •

edited

Loading

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019 •

edited

Loading

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

MichaelJMahometa commented Sep 25, 2019 •

edited

Loading

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

dtkaplan commented Sep 28, 2019

rpruim commented Sep 29, 2019

rpruim commented Sep 29, 2019

nicholasjhorton commented Oct 2, 2019 via email

nicholasjhorton commented Oct 2, 2019 via email

rpruim commented Jun 24, 2020

rpruim commented Jun 24, 2020

rpruim commented Jun 28, 2020

Wondering about multiple Dependent variables #746

Wondering about multiple Dependent variables #746

Comments

MichaelJMahometa commented Sep 24, 2019

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019 • edited Loading

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019 • edited Loading

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

MichaelJMahometa commented Sep 25, 2019 • edited Loading

rpruim commented Sep 25, 2019

rpruim commented Sep 25, 2019

dtkaplan commented Sep 28, 2019

rpruim commented Sep 29, 2019

naming the response variable column

long vs short names for summaries

expanding .

rpruim commented Sep 29, 2019

nicholasjhorton commented Oct 2, 2019 via email

nicholasjhorton commented Oct 2, 2019 via email

rpruim commented Jun 24, 2020

rpruim commented Jun 24, 2020

rpruim commented Jun 28, 2020

rpruim commented Sep 25, 2019 •

edited

Loading

rpruim commented Sep 25, 2019 •

edited

Loading

MichaelJMahometa commented Sep 25, 2019 •

edited

Loading