Merge pull request #93 from alan-turing-institute/dev

Revamp readme
JuliaAI · Feb 12, 2020 · 67cbae6 · 67cbae6
2 parents aa5ebd0 + 1659a3e
commit 67cbae6
Show file tree

Hide file tree

Showing 2 changed files with 226 additions and 77 deletions.
diff --git a/README.md b/README.md
@@ -4,21 +4,16 @@
 | :-----------: | :------: |
 | [![Build Status](https://travis-ci.org/alan-turing-institute/ScientificTypes.jl.svg?branch=master)](https://travis-ci.org/alan-turing-institute/ScientificTypes.jl) | [![codecov.io](http://codecov.io/github/alan-turing-institute/ScientificTypes.jl/coverage.svg?branch=master)](http://codecov.io/github/alan-turing-institute/ScientificTypes.jl?branch=master) |
 
-A light-weight, dependency-free Julia interface for implementing conventions
-about the scientific interpretation of data.
-This package should only be used by developers who intend to define their own
-scientific type convention.
-The [MLJScientificTypes.jl](https://github.com/alan-turing-institute/MLJScientificTypes.jl) packages implements such a convention used in the [MLJ](https://github.com/alan-turing-institute/MLJ.jl)
-universe.
+A light-weight, dependency-free, Julia interface defining a collection
+of types (without instances) for implementing conventions about the
+scientific interpretation of data.
 
-## Purpose
+This package makes the distinction between the **machine type** and
+**scientific type** of data:
 
-The package makes the distinction between **machine type** and **scientific type**:
-
-* the _machine type_ is a Julia type the data is currently encoded as (for instance: `Float64`)
-* the _scientific type_ is a type defined by this package which
-  encapsulates how the data should be _interpreted_ (for instance:
-  `Continuous` or `Multiclass`)
+* The _machine type_ is a Julia type the data is currently encoded as (e.g., `Float64`)
+* The _scientific type_ is a type defined by this package which
+  encapsulates how the data should be _interpreted_ (e.g., `Continuous` or `Multiclass`)
 
 The distinction is useful because the same machine type is often used
 to represent data with *differing* scientific interpretations - `Int`
@@ -27,11 +22,38 @@ is used for product numbers (a factor) but also for a person's weight
 type is frequently represented by *different* machine types - both
 `Int` and `Float64` are used to represent weights, for example.
 
-### Type hierarchy
 
-The package provides a hierarchy of Julia types representing data types for use
-in method dispatch (e.g., for trait values). Instances of the types play no
-role.
+#### Contents
+
+ - [Who is this repository for?](#who-is-this-repository-for)
+ - [What's provided here?](#what-is-provided-here)
+ - [Defining a new convention](#defining-a-new-convention)
+
+
+## Who is this repository for?
+
+This package
+should only be used by developers who intend to define their own
+scientific type convention.  The
+[MLJScientificTypes.jl](https://github.com/alan-turing-institute/MLJScientificTypes.jl)
+package implements such a convention used in the
+[MLJ](https://github.com/alan-turing-institute/MLJ.jl) universe.
+
+The purpose of this package is to provide a mechanism for articulating
+conventions around the scientific interpretation of data. With such a
+convention in place, a numerical algorithm declares its data
+requirements in terms of scientific types, the user has a convenient
+way to check compliance of his data with that requirement, and the
+developer understands precisely the constraints his data specification
+places on the actual machine type of the data supplied.
+
+## What is provided here?
+
+#### 1. Scientific types
+
+ScientificTypes provides a hierarchy of Julia types
+representing data types for use in method dispatch (e.g., for trait
+values). Instances of the types play no role.
 
 ```
 Found
@@ -50,99 +72,224 @@ Found
 └─ Unknown
 ```
 
-## Defining a new convention
+The types `Finite{N}`, `Multiclass{N}` and `OrderedFactor{N}` are all
+parametrised by the number of levels `N`, while `Image{W,H}`,
+`GrayImage{W,H}` and `ColorImage{W,H}` are all parametrised by the
+image width and height dimensions, `(W, H)`. 
 
-If you want to implement your own convention, you can consider the [MLJScientificTypes.jl](https://github.com/alan-turing-institute/MLJScientificTypes.jl) as a blueprint.
+The `Table` type also has a type parameter, for conveying the
+scientific type(s) of table columns. See [More on the `Table`
+type](#more-on-the-table-type).
 
-The steps below summarise the possible steps in defining such a convention:
+The julia native `Missing` type is also regarded as a scientific
+type. 
 
-* declare a new convention,
-* declare new traits,
-* add new scientific types,
-* add explicit `scitype` and `Scitype` definitions,
-* define a `coerce` function.
+#### 2. The `scitype` and `Scitype` methods
 
-Each step is explained below taking the MLJ convention as an example.
+ScientificTypes provides a method `scitype` for articulating a
+particular convention: `scitype(X)` is the scientific type of object
+`X`. For example, in the `MLJ` convention, implemented by
+[MLJScientificTypes](https://github.com/alan-turing-institute/MLJScientificTypes.jl),
+one has `scitype(3.14) = Continuous` and `scitype(42) = Count`.
 
-### Declaring a new convention
+> *Aside.* `scitype` is *not* a mapping of types to types but from
+> *instances* to types. This is because one may want to distinguish
+> the scientific type of objects having the same machine type. For
+> example, in the `MLJ` convention, some
+> `CategoricalArrays.CategoricalValue` objects have the scitype
+> `OrderedFactor` but others are `Multiclass`. In CategoricalArrays.jl
+> the `ordered` attribute is not a type parameter and so it can only
+> be extracted from instances. 
 
-In the module, define a
+The developer implementing a particular scientific type convention
+[overloads](#defining-a-new-convention) the `scitype` method
+appropriately. However, this package provides certain rudimentary
+fallback behaviour; only Property 1 below should be altered by the
+developer:
+
+**Property 0.** `scitype(missing) = Missing` (`Missing` is the only native type also regarded as a scientific type).
+
+**Property 1.** `scitype(X) = Unknown`, unless `X` is a tuple, an
+abstract array, or `missing`.
+
+**Property 2.** The scitype of a `k`-tuple is `Tuple{S1, S2, ...,
+Sk}` where `Sj` is the scitype of the `j`th element.
+
+For example, in the `MLJ` convention:
 
 ```julia
-struct MyConvention <: ScientificTypes.Convention end
+julia> scitype((1, 4.5))
+Tuple{Count, Continuous}
 ```
 
-and add an init function with:
+**Property 3.** The scitype of an `AbstractArray`, `A`, is
+always`AbstractArray{U}` where `U` is the union of the scitypes of the
+elements of `A`, with one exception: If `typeof(A) <:
+AbstractArray{Union{Missing,T}}` for some `T` different from `Any`,
+then the scitype of `A` is `AbstractArray{Union{Missing, U}}`, where
+`U` is the union over all non-missing elements, **even if `A` has no
+missing elements.**
+
+The exception is made for performance reasons. In `MLJ`:
 
 ```julia
-function __init__()
-  ScientificTypes.set_convention(MyConvention())
-end
+julia> v = [1.3, 4.5, missing]
+julia> scitype(v)
+AbstractArray{Union{Missing, Continuous},1}
 ```
 
-Subsequently you will have functions dispatching over `::MyConvention` for
-instance in the MLJ case:
-
 ```julia
-ScientificTypes.scitype(::Integer, ::MLJ) = Count
+julia> scitype(v[1:2])
+AbstractArray{Union{Missing, Continuous},1}
 ```
 
-### Declaring new traits
+> *Performance note.* Computing type unions over large arrays is
+> expensive and, depending on the convention's implementation and the
+> array eltype, computing the scitype can be slow. In the common case
+> that the scitype of an array can be determined from the machine type
+> of the object alone, the implementer of a new connvention can speed
+> up compututations by implementing a `Scitype` method.  Do
+> `?ScientificTypes.Scitype` for details.
+
+
+#### 3. Trait dictionary
+
+Scientific types provides a dictionary `TRAIT_FUNCTION_GIVEN_NAME` for
+registering names (symbols) for boolean-value trait functions used to
+dispatch `scitype` in cases that direct type-dispatch is
+inadequate. See [below](#adding-explicit-scitype-declarations) for
+details.
+
+#### 4. Convenience methods
+
+Scientific provides the following convenience functions:
+
+- `trait(X)` - return the trait name associated with the trait holding for `X`
 
-It's useful to mark containers that meet explicit traits; by default everything
-is marked as `:other`. In the MLJ convention, we specifically consider all
-containers that meet the [`Tables.jl`](https://github.com/JuliaData/Tables.jl)
-interface. In order to declare this you have to add a key to the
-`TRAIT_FUNCTION_GIVEN_NAME` dictionary with a boolean function that verifies
-the trait. This must also be placed in your `__init__` function.
-In the case of the MLJ convention:
+- `set_convention(C)` - activate the convention named `C`
+
+- `set_convention()` - inspect the active convention
+
+- `scitype_union(A)` - return the union of the scitypes of all elements of iterable `A`
+
+- `elscitype(A)` - return the "element scitype" of array `A`
+
+Query the doc-strings for details.
+
+
+#### More on the `Table` type
+
+An object of scitype `Table{K}` is expected to have a notion of
+"columns", which are `AbstractVector`s, and the intention of the type
+parameter `K` is to encode the scientific type(s) of its
+columns. Specifically, developers are requested to adhere to the
+following:
+
+**Tabular data convention.** If `scitype(X) <: Table`, then in fact
 
 ```julia
-function __init__()
-    ScientificTypes.set_convention(MLJ())
-    ScientificTypes.TRAIT_FUNCTION_GIVEN_NAME[:table] = Tables.istable
-end
+scitype(X) == Table{Union{scitype(c1), ..., scitype(cn)}}
+```
+
+where `c1`, `c2`, ..., `cn` are the columns of `X`. With this
+definition, common type checks can be performed with tables.  For
+instance, you could check that each column of `X` has an element
+scitype that is either `Continuous` or `Finite`:
+
+```@example 5
+scitype(X) <: Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Finite}}}
 ```
 
-### Adding scientific types
+A built-in `Table` constructor provides a shorthand for the right-hand side:
+
+```@example 5
+scitype(X) <: Table(Continuous, Finite)
+```
+
+Note that `Table(Continuous,Finite)` is a *type* union and not a `Table` *instance*.
+
+
+## Defining a new convention
 
-You may want to extend the type hierarchy defined above. This is done as usual
-with something like
+If you want to implement your own convention, you can consider the
+[MLJScientificTypes.jl](https://github.com/alan-turing-institute/MLJScientificTypes.jl)
+as a blueprint.
+
+The steps below summarise the possible steps in defining such a convention:
+
+* declare a new convention,
+* add explicit `scitype` (and `Scitype`) definitions,
+* register any traits that were needed to define scitypes,
+* optionally define `coerce` methods for your convention
+
+Each step is explained below, taking the MLJ convention as an example.
+
+### Naming the convention
+
+In the module, define a
 
 ```julia
-struct MyNewType{P} <: Known end
+struct MyConvention <: ScientificTypes.Convention end
 ```
 
-Recall that Scientific Types are only used for dispatching and so should not
-have fields.
+and add an init function with:
 
-### Adding explicit `scitype` and `Scitype` definitions
+```julia
+function __init__()
+  ScientificTypes.set_convention(MyConvention())
+end
+```
 
-The `scitype` functions indicate default mappings from *machine type* to a
-*scientific type*. For instance in the MLJ convention:
+### Adding explicit `scitype` declarations.
+
+When overloading `scitype` one needs to dipatch over the convention,
+as in this example:
 
 ```julia
-ScientificType.scitype(::Integer, ::MLJ) = Count
+ScientificTypes.scitype(::Integer, ::MLJ) = Count
 ```
 
-where `::MLJ` refers to the convention.
+In some cases, however, the scientific type to be attributed to an
+object might depend on the evaluation of a boolean-valued trait
+function. There is a mechanism for "registering" such traits to
+streamline trait-based dispatch of the `scitype` method. This is best
+illustrated with an example.
+
+In the MLJ convention, all containers that meet the
+[`Tables.jl`](https://github.com/JuliaData/Tables.jl) interface are
+deemed to have scitype `Table`. These are detected using the Tables.jl
+trait `istable`. Our first step is to choose a name for the trait, in
+	this case `:table`. Our `scitype` declaration then reads:
+
+```
+function ScientificTypes.scitype(X, ::MLJ, ::Val{:table})
+   K = <some type depending on columns of X>
+   return Table{K}
+end
+```
 
-The `Scitype` functions will typically match a few of your `scitype` functions
-to automatically obtain the scientific type of arrays of a type.
-For instance in the MLJ convention:
+For this to work we now need to register the trait, which means adding
+to the `TRAIT_FUNCTION_GIVEN_NAME` dictionary, which should be
+performed within the init function of the defining package:
 
 ```julia
-ST.Scitype(::Type{<:Integer}, ::MLJ) = Count
+function __init__()
+    ScientificTypes.set_convention(MLJ())
+    ScientificTypes.TRAIT_FUNCTION_GIVEN_NAME[:table] = Tables.istable
+end
 ```
 
-meaning that the scitype of an array such as `[1,2,3]` will directly be
-inferred as an array of `Count`.
+**Important limitation.** One may not add a trait function to
+the `TRAIT_FUNCTION_GIVEN_NAME` dictionary if it holds `true` on some
+object `X` for which an existing trait already holds true.
+
 
 ### Defining a `coerce` function
 
-It may be very useful to define a function allowing you to convert an object
-with one scitype to another scitype. In the MLJ convention, this is assumed by
-the `coerce` function.
+It may be very useful to define a function to coerce machine types so
+as to correct an unintended scientific interpretation, according to a
+given convention.  In the `MLJ` convention, this is implemented by
+defining `coerce` methods (no stub provided by `ScientificTypes`)
 
 For instance consider the simplified:
 
@@ -153,11 +300,13 @@ function coerce(y::AbstractArray{T}, T2::Type{<:Union{Missing,Continuous}}
 end
 ```
 
-This maps an array of Real to an array of `AbstractFloat` (which are mapped to
-`Continuous` in the MLJ convention).
+Under this definition, `coerce([1, 2, 4], Continuous)` is mapped to
+`[1.0, 2.0, 4.0]`, which has scitype `AbstractVector{Continuous}`.
+
+In the case of tabular data, one might additionally define `coerce`
+methods to selectively coerce data in specified columns. See
+[MLJScientificType](https://github.com/alan-turing-institute/MLJScientificTypes.jl)
+for examples.
+
+
 
-Further, if you work with specific containers, you may want to define a
-`coerce` function that works on the container by applying `coerce` on each
-of the features. In the MLJ convention, we work with tabular objects and
-define a `coerce` function which applies specific coercion on each of the
-columns.
diff --git a/src/ScientificTypes.jl b/src/ScientificTypes.jl
@@ -93,7 +93,7 @@ const TRAIT_FUNCTION_GIVEN_NAME = Dict{Symbol,Function}()
 """
     trait(X)
 
-Check `X` against traits specified in `TRAIT_FUNCTION_GIVEN_NAME` and returns
+Check `X` against traits specified in `TRAIT_FUNCTION_GIVEN_NAME` and return
 a symbol corresponding to the matching trait, or `:other` if `X` didn't match
 any of the trait functions.
 """