Skip to content

Commit

Permalink
Merge pull request #93 from alan-turing-institute/dev
Browse files Browse the repository at this point in the history
Revamp readme
  • Loading branch information
ablaom authored Feb 12, 2020
2 parents aa5ebd0 + 1659a3e commit 67cbae6
Show file tree
Hide file tree
Showing 2 changed files with 226 additions and 77 deletions.
301 changes: 225 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,16 @@
| :-----------: | :------: |
| [![Build Status](https://travis-ci.org/alan-turing-institute/ScientificTypes.jl.svg?branch=master)](https://travis-ci.org/alan-turing-institute/ScientificTypes.jl) | [![codecov.io](http://codecov.io/github/alan-turing-institute/ScientificTypes.jl/coverage.svg?branch=master)](http://codecov.io/github/alan-turing-institute/ScientificTypes.jl?branch=master) |

A light-weight, dependency-free Julia interface for implementing conventions
about the scientific interpretation of data.
This package should only be used by developers who intend to define their own
scientific type convention.
The [MLJScientificTypes.jl](https://github.com/alan-turing-institute/MLJScientificTypes.jl) packages implements such a convention used in the [MLJ](https://github.com/alan-turing-institute/MLJ.jl)
universe.
A light-weight, dependency-free, Julia interface defining a collection
of types (without instances) for implementing conventions about the
scientific interpretation of data.

## Purpose
This package makes the distinction between the **machine type** and
**scientific type** of data:

The package makes the distinction between **machine type** and **scientific type**:

* the _machine type_ is a Julia type the data is currently encoded as (for instance: `Float64`)
* the _scientific type_ is a type defined by this package which
encapsulates how the data should be _interpreted_ (for instance:
`Continuous` or `Multiclass`)
* The _machine type_ is a Julia type the data is currently encoded as (e.g., `Float64`)
* The _scientific type_ is a type defined by this package which
encapsulates how the data should be _interpreted_ (e.g., `Continuous` or `Multiclass`)

The distinction is useful because the same machine type is often used
to represent data with *differing* scientific interpretations - `Int`
Expand All @@ -27,11 +22,38 @@ is used for product numbers (a factor) but also for a person's weight
type is frequently represented by *different* machine types - both
`Int` and `Float64` are used to represent weights, for example.

### Type hierarchy

The package provides a hierarchy of Julia types representing data types for use
in method dispatch (e.g., for trait values). Instances of the types play no
role.
#### Contents

- [Who is this repository for?](#who-is-this-repository-for)
- [What's provided here?](#what-is-provided-here)
- [Defining a new convention](#defining-a-new-convention)


## Who is this repository for?

This package
should only be used by developers who intend to define their own
scientific type convention. The
[MLJScientificTypes.jl](https://github.com/alan-turing-institute/MLJScientificTypes.jl)
package implements such a convention used in the
[MLJ](https://github.com/alan-turing-institute/MLJ.jl) universe.

The purpose of this package is to provide a mechanism for articulating
conventions around the scientific interpretation of data. With such a
convention in place, a numerical algorithm declares its data
requirements in terms of scientific types, the user has a convenient
way to check compliance of his data with that requirement, and the
developer understands precisely the constraints his data specification
places on the actual machine type of the data supplied.

## What is provided here?

#### 1. Scientific types

ScientificTypes provides a hierarchy of Julia types
representing data types for use in method dispatch (e.g., for trait
values). Instances of the types play no role.

```
Found
Expand All @@ -50,99 +72,224 @@ Found
└─ Unknown
```

## Defining a new convention
The types `Finite{N}`, `Multiclass{N}` and `OrderedFactor{N}` are all
parametrised by the number of levels `N`, while `Image{W,H}`,
`GrayImage{W,H}` and `ColorImage{W,H}` are all parametrised by the
image width and height dimensions, `(W, H)`.

If you want to implement your own convention, you can consider the [MLJScientificTypes.jl](https://github.com/alan-turing-institute/MLJScientificTypes.jl) as a blueprint.
The `Table` type also has a type parameter, for conveying the
scientific type(s) of table columns. See [More on the `Table`
type](#more-on-the-table-type).

The steps below summarise the possible steps in defining such a convention:
The julia native `Missing` type is also regarded as a scientific
type.

* declare a new convention,
* declare new traits,
* add new scientific types,
* add explicit `scitype` and `Scitype` definitions,
* define a `coerce` function.
#### 2. The `scitype` and `Scitype` methods

Each step is explained below taking the MLJ convention as an example.
ScientificTypes provides a method `scitype` for articulating a
particular convention: `scitype(X)` is the scientific type of object
`X`. For example, in the `MLJ` convention, implemented by
[MLJScientificTypes](https://github.com/alan-turing-institute/MLJScientificTypes.jl),
one has `scitype(3.14) = Continuous` and `scitype(42) = Count`.

### Declaring a new convention
> *Aside.* `scitype` is *not* a mapping of types to types but from
> *instances* to types. This is because one may want to distinguish
> the scientific type of objects having the same machine type. For
> example, in the `MLJ` convention, some
> `CategoricalArrays.CategoricalValue` objects have the scitype
> `OrderedFactor` but others are `Multiclass`. In CategoricalArrays.jl
> the `ordered` attribute is not a type parameter and so it can only
> be extracted from instances.
In the module, define a
The developer implementing a particular scientific type convention
[overloads](#defining-a-new-convention) the `scitype` method
appropriately. However, this package provides certain rudimentary
fallback behaviour; only Property 1 below should be altered by the
developer:

**Property 0.** `scitype(missing) = Missing` (`Missing` is the only native type also regarded as a scientific type).

**Property 1.** `scitype(X) = Unknown`, unless `X` is a tuple, an
abstract array, or `missing`.

**Property 2.** The scitype of a `k`-tuple is `Tuple{S1, S2, ...,
Sk}` where `Sj` is the scitype of the `j`th element.

For example, in the `MLJ` convention:

```julia
struct MyConvention <: ScientificTypes.Convention end
julia> scitype((1, 4.5))
Tuple{Count, Continuous}
```

and add an init function with:
**Property 3.** The scitype of an `AbstractArray`, `A`, is
always`AbstractArray{U}` where `U` is the union of the scitypes of the
elements of `A`, with one exception: If `typeof(A) <:
AbstractArray{Union{Missing,T}}` for some `T` different from `Any`,
then the scitype of `A` is `AbstractArray{Union{Missing, U}}`, where
`U` is the union over all non-missing elements, **even if `A` has no
missing elements.**

The exception is made for performance reasons. In `MLJ`:

```julia
function __init__()
ScientificTypes.set_convention(MyConvention())
end
julia> v = [1.3, 4.5, missing]
julia> scitype(v)
AbstractArray{Union{Missing, Continuous},1}
```

Subsequently you will have functions dispatching over `::MyConvention` for
instance in the MLJ case:

```julia
ScientificTypes.scitype(::Integer, ::MLJ) = Count
julia> scitype(v[1:2])
AbstractArray{Union{Missing, Continuous},1}
```

### Declaring new traits
> *Performance note.* Computing type unions over large arrays is
> expensive and, depending on the convention's implementation and the
> array eltype, computing the scitype can be slow. In the common case
> that the scitype of an array can be determined from the machine type
> of the object alone, the implementer of a new connvention can speed
> up compututations by implementing a `Scitype` method. Do
> `?ScientificTypes.Scitype` for details.

#### 3. Trait dictionary

Scientific types provides a dictionary `TRAIT_FUNCTION_GIVEN_NAME` for
registering names (symbols) for boolean-value trait functions used to
dispatch `scitype` in cases that direct type-dispatch is
inadequate. See [below](#adding-explicit-scitype-declarations) for
details.

#### 4. Convenience methods

Scientific provides the following convenience functions:

- `trait(X)` - return the trait name associated with the trait holding for `X`

It's useful to mark containers that meet explicit traits; by default everything
is marked as `:other`. In the MLJ convention, we specifically consider all
containers that meet the [`Tables.jl`](https://github.com/JuliaData/Tables.jl)
interface. In order to declare this you have to add a key to the
`TRAIT_FUNCTION_GIVEN_NAME` dictionary with a boolean function that verifies
the trait. This must also be placed in your `__init__` function.
In the case of the MLJ convention:
- `set_convention(C)` - activate the convention named `C`

- `set_convention()` - inspect the active convention

- `scitype_union(A)` - return the union of the scitypes of all elements of iterable `A`

- `elscitype(A)` - return the "element scitype" of array `A`

Query the doc-strings for details.


#### More on the `Table` type

An object of scitype `Table{K}` is expected to have a notion of
"columns", which are `AbstractVector`s, and the intention of the type
parameter `K` is to encode the scientific type(s) of its
columns. Specifically, developers are requested to adhere to the
following:

**Tabular data convention.** If `scitype(X) <: Table`, then in fact

```julia
function __init__()
ScientificTypes.set_convention(MLJ())
ScientificTypes.TRAIT_FUNCTION_GIVEN_NAME[:table] = Tables.istable
end
scitype(X) == Table{Union{scitype(c1), ..., scitype(cn)}}
```

where `c1`, `c2`, ..., `cn` are the columns of `X`. With this
definition, common type checks can be performed with tables. For
instance, you could check that each column of `X` has an element
scitype that is either `Continuous` or `Finite`:

```@example 5
scitype(X) <: Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Finite}}}
```

### Adding scientific types
A built-in `Table` constructor provides a shorthand for the right-hand side:

```@example 5
scitype(X) <: Table(Continuous, Finite)
```

Note that `Table(Continuous,Finite)` is a *type* union and not a `Table` *instance*.


## Defining a new convention

You may want to extend the type hierarchy defined above. This is done as usual
with something like
If you want to implement your own convention, you can consider the
[MLJScientificTypes.jl](https://github.com/alan-turing-institute/MLJScientificTypes.jl)
as a blueprint.

The steps below summarise the possible steps in defining such a convention:

* declare a new convention,
* add explicit `scitype` (and `Scitype`) definitions,
* register any traits that were needed to define scitypes,
* optionally define `coerce` methods for your convention

Each step is explained below, taking the MLJ convention as an example.

### Naming the convention

In the module, define a

```julia
struct MyNewType{P} <: Known end
struct MyConvention <: ScientificTypes.Convention end
```

Recall that Scientific Types are only used for dispatching and so should not
have fields.
and add an init function with:

### Adding explicit `scitype` and `Scitype` definitions
```julia
function __init__()
ScientificTypes.set_convention(MyConvention())
end
```

The `scitype` functions indicate default mappings from *machine type* to a
*scientific type*. For instance in the MLJ convention:
### Adding explicit `scitype` declarations.

When overloading `scitype` one needs to dipatch over the convention,
as in this example:

```julia
ScientificType.scitype(::Integer, ::MLJ) = Count
ScientificTypes.scitype(::Integer, ::MLJ) = Count
```

where `::MLJ` refers to the convention.
In some cases, however, the scientific type to be attributed to an
object might depend on the evaluation of a boolean-valued trait
function. There is a mechanism for "registering" such traits to
streamline trait-based dispatch of the `scitype` method. This is best
illustrated with an example.

In the MLJ convention, all containers that meet the
[`Tables.jl`](https://github.com/JuliaData/Tables.jl) interface are
deemed to have scitype `Table`. These are detected using the Tables.jl
trait `istable`. Our first step is to choose a name for the trait, in
this case `:table`. Our `scitype` declaration then reads:

```
function ScientificTypes.scitype(X, ::MLJ, ::Val{:table})
K = <some type depending on columns of X>
return Table{K}
end
```

The `Scitype` functions will typically match a few of your `scitype` functions
to automatically obtain the scientific type of arrays of a type.
For instance in the MLJ convention:
For this to work we now need to register the trait, which means adding
to the `TRAIT_FUNCTION_GIVEN_NAME` dictionary, which should be
performed within the init function of the defining package:

```julia
ST.Scitype(::Type{<:Integer}, ::MLJ) = Count
function __init__()
ScientificTypes.set_convention(MLJ())
ScientificTypes.TRAIT_FUNCTION_GIVEN_NAME[:table] = Tables.istable
end
```

meaning that the scitype of an array such as `[1,2,3]` will directly be
inferred as an array of `Count`.
**Important limitation.** One may not add a trait function to
the `TRAIT_FUNCTION_GIVEN_NAME` dictionary if it holds `true` on some
object `X` for which an existing trait already holds true.


### Defining a `coerce` function

It may be very useful to define a function allowing you to convert an object
with one scitype to another scitype. In the MLJ convention, this is assumed by
the `coerce` function.
It may be very useful to define a function to coerce machine types so
as to correct an unintended scientific interpretation, according to a
given convention. In the `MLJ` convention, this is implemented by
defining `coerce` methods (no stub provided by `ScientificTypes`)

For instance consider the simplified:

Expand All @@ -153,11 +300,13 @@ function coerce(y::AbstractArray{T}, T2::Type{<:Union{Missing,Continuous}}
end
```

This maps an array of Real to an array of `AbstractFloat` (which are mapped to
`Continuous` in the MLJ convention).
Under this definition, `coerce([1, 2, 4], Continuous)` is mapped to
`[1.0, 2.0, 4.0]`, which has scitype `AbstractVector{Continuous}`.

In the case of tabular data, one might additionally define `coerce`
methods to selectively coerce data in specified columns. See
[MLJScientificType](https://github.com/alan-turing-institute/MLJScientificTypes.jl)
for examples.



Further, if you work with specific containers, you may want to define a
`coerce` function that works on the container by applying `coerce` on each
of the features. In the MLJ convention, we work with tabular objects and
define a `coerce` function which applies specific coercion on each of the
columns.
2 changes: 1 addition & 1 deletion src/ScientificTypes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ const TRAIT_FUNCTION_GIVEN_NAME = Dict{Symbol,Function}()
"""
trait(X)
Check `X` against traits specified in `TRAIT_FUNCTION_GIVEN_NAME` and returns
Check `X` against traits specified in `TRAIT_FUNCTION_GIVEN_NAME` and return
a symbol corresponding to the matching trait, or `:other` if `X` didn't match
any of the trait functions.
"""
Expand Down

0 comments on commit 67cbae6

Please sign in to comment.