Skip to content

Commit

Permalink
make release-tag: Merge branch 'master' into stable
Browse files Browse the repository at this point in the history
  • Loading branch information
csala committed Aug 15, 2020
2 parents 61871c4 + 4d989b5 commit c49f7b6
Show file tree
Hide file tree
Showing 28 changed files with 2,410 additions and 1,321 deletions.
12 changes: 12 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# History

## 0.1.1 (2020-08-15)

This release includes a few new features to make DeepEcho work on more types of datasets
as well as to making it easier to add new datasets to the benchmarking framework.

* Add `segment_size` and `sequence_index` arguments to `fit` method.
* Add `sequence_length` as an optional argument to `sample` and `sample_sequence` methods.
* Update the Dataset storage format to add `sequence_index` and versioning.
* Separate the sequence assembling process in its own `deepecho.sequences` module.
* Add function `make_dataset` to create a dataset from a dataframe and just a few column names.
* Add notebook tutorial to show how to create a datasets and use them.

## 0.1.0 (2020-08-11)

First release.
Expand Down
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,15 @@

# Overview

**DeepEcho** is a Python library that implements generative models for **mixed-type**,
**multivariate** time series.

1. Provide multiple models both, from classical **statistical** modeling of time series to the
latest in **Deep Learning** based models.
2. Provide a robust **benchmarking framework** for evaluating these methods under a set of
multiple metrics.
3. Provide ability for a **Machine Learning researchers** to submit a new method following our
model and sample API and get evaluated.
**DeepEcho** is a **Synthetic Data Generation** Python library for **mixed-type**, **multivariate
time series**. It provides:

1. Multiple models based both on **classical statistical modeling** of time series and the latest
in **Deep Learning** techniques.
2. A robust [benchmarking framework](benchmark) for evaluating these methods on multiple datasets
and with multiple metrics.
3. Ability for **Machine Learning researchers** to submit new methods following our `model` and
`sample` API and get evaluated.

## Try it out now!

Expand Down
73 changes: 38 additions & 35 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,41 +18,44 @@ Most notably, many datasets from this collection are Time Series Classification
downloaded from the [timeseriesclassification.com](http://www.timeseriesclassification.com/)
website.

This is the complete list of datasets and their characteristics:

| dataset | size | entities | entity_columns | context_columns | data_columns | max_sequence_len | min_sequence_len |
|---------------------------|-----------|------------|------------------|-------------------|----------------|--------------------|--------------------|
| Libras | 108.74 KB | 360 | 1 | 1 | 4 | 45 | 45 |
| AtrialFibrillation | 111.02 KB | 30 | 1 | 1 | 4 | 640 | 640 |
| BasicMotions | 196.06 KB | 80 | 1 | 1 | 8 | 100 | 100 |
| ERing | 223.5 KB | 300 | 1 | 1 | 6 | 65 | 65 |
| RacketSports | 235.39 KB | 303 | 1 | 1 | 8 | 30 | 30 |
| Epilepsy | 439.75 KB | 275 | 1 | 1 | 5 | 206 | 206 |
| PenDigits | 441.87 KB | 10992 | 1 | 1 | 4 | 8 | 8 |
| JapaneseVowels | 475.01 KB | 640 | 1 | 1 | 14 | 29 | 7 |
| StandWalkJump | 504.3 KB | 27 | 1 | 1 | 6 | 2500 | 2500 |
| FingerMovements | 764.23 KB | 416 | 1 | 1 | 30 | 50 | 50 |
| EchoNASDAQ | 968.61 KB | 19 | 1 | 2 | 8 | 9401 | 82 |
| Handwriting | 1.38 MB | 1000 | 1 | 1 | 5 | 152 | 152 |
| UWaveGestureLibrary | 1.46 MB | 440 | 1 | 1 | 5 | 315 | 315 |
| NATOPS | 1.78 MB | 360 | 1 | 1 | 26 | 51 | 51 |
| ArticularyWordRecognition | 1.93 MB | 575 | 1 | 1 | 11 | 144 | 144 |
| Cricket | 3.13 MB | 180 | 1 | 1 | 8 | 1197 | 1197 |
| SelfRegulationSCP2 | 3.84 MB | 380 | 1 | 1 | 9 | 1152 | 1152 |
| LSST | 4.2 MB | 4925 | 1 | 1 | 8 | 36 | 36 |
| SelfRegulationSCP1 | 4.34 MB | 561 | 1 | 1 | 8 | 896 | 896 |
| CharacterTrajectories | 4.97 MB | 2858 | 1 | 1 | 5 | 182 | 60 |
| HandMovementDirection | 5.24 MB | 234 | 1 | 1 | 12 | 400 | 400 |
| EthanolConcentration | 10.75 MB | 524 | 1 | 1 | 5 | 1751 | 1751 |
| SpokenArabicDigits | 15.81 MB | 8798 | 1 | 1 | 15 | 93 | 4 |
| Heartbeat | 28.25 MB | 409 | 1 | 1 | 63 | 405 | 405 |
| PhonemeSpectra | 50.42 MB | 6668 | 1 | 1 | 13 | 217 | 217 |
| MotorImagery | 70.96 MB | 378 | 1 | 1 | 66 | 3000 | 3000 |
| DuckDuckGeese | 104.82 MB | 100 | 1 | 1 | 1347 | 270 | 270 |
| PEMS-SF | 110.03 MB | 440 | 1 | 1 | 965 | 144 | 144 |
| EigenWorms | 128.72 MB | 259 | 1 | 1 | 8 | 17984 | 17984 |
| InsectWingbeat | 195.23 MB | 50000 | 1 | 1 | 202 | 22 | 2 |
| FaceDetection | 331.16 MB | 9414 | 1 | 1 | 146 | 62 | 62 |
This is the complete list of avilable datasets and some of their characteristics:

| dataset | size | entities | data_columns | max_sequence_len |
|---------------------------|-----------|------------|---------------|--------------------|
| Libras | 108.74 KB | 360 | 4 | 45 |
| AtrialFibrillation | 111.02 KB | 30 | 4 | 640 |
| BasicMotions | 196.06 KB | 80 | 8 | 100 |
| ERing | 223.5 KB | 300 | 6 | 65 |
| RacketSports | 235.39 KB | 303 | 8 | 30 |
| Epilepsy | 439.75 KB | 275 | 5 | 206 |
| PenDigits | 441.87 KB | 10992 | 4 | 8 |
| JapaneseVowels | 475.01 KB | 640 | 14 | 29 |
| StandWalkJump | 504.3 KB | 27 | 6 | 2500 |
| FingerMovements | 764.23 KB | 416 | 30 | 50 |
| EchoNASDAQ | 968.61 KB | 19 | 8 | 9401 |
| Handwriting | 1.38 MB | 1000 | 5 | 152 |
| UWaveGestureLibrary | 1.46 MB | 440 | 5 | 315 |
| NATOPS | 1.78 MB | 360 | 26 | 51 |
| ArticularyWordRecognition | 1.93 MB | 575 | 11 | 144 |
| Cricket | 3.13 MB | 180 | 8 | 1197 |
| SelfRegulationSCP2 | 3.84 MB | 380 | 9 | 1152 |
| LSST | 4.2 MB | 4925 | 8 | 36 |
| SelfRegulationSCP1 | 4.34 MB | 561 | 8 | 896 |
| CharacterTrajectories | 4.97 MB | 2858 | 5 | 182 |
| HandMovementDirection | 5.24 MB | 234 | 12 | 400 |
| EthanolConcentration | 10.75 MB | 524 | 5 | 1751 |
| SpokenArabicDigits | 15.81 MB | 8798 | 15 | 93 |
| Heartbeat | 28.25 MB | 409 | 63 | 405 |
| PhonemeSpectra | 50.42 MB | 6668 | 13 | 217 |
| MotorImagery | 70.96 MB | 378 | 66 | 3000 |
| DuckDuckGeese | 104.82 MB | 100 | 1347 | 270 |
| PEMS-SF | 110.03 MB | 440 | 965 | 144 |
| EigenWorms | 128.72 MB | 259 | 8 | 17984 |
| InsectWingbeat | 195.23 MB | 50000 | 202 | 22 |
| FaceDetection | 331.16 MB | 9414 | 146 | 62 |

Further details more details about how the format in which these datasets are stored as well
as how to create yours, please [follow this tutorial](../tutorials/02_DeepEcho_Benchmark_Datasets.ipynb)

### Modeling and Sampling process

Expand Down
2 changes: 1 addition & 1 deletion benchmark/deepecho/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

__author__ = 'MIT Data To AI Lab'
__email__ = 'dailabmit@gmail.com'
__version__ = '0.1.0'
__version__ = '0.1.1.dev1'
__path__ = __import__('pkgutil').extend_path(__path__, __name__)

from deepecho.base import DeepEcho
Expand Down
9 changes: 6 additions & 3 deletions benchmark/deepecho/benchmark/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@


DEFAULT_MODELS = {
'PARModel': (PARModel, {'epochs': 256, 'cuda': True})
'PARModel': (PARModel, {'epochs': 1024, 'cuda': True})
}


Expand Down Expand Up @@ -105,7 +105,7 @@ def _draw_stop(self, **kwargs):


def run_benchmark(models=None, datasets=None, metrics=None, max_entities=None,
distributed=False, output_path=None):
segment_size=None, distributed=False, output_path=None):
"""Score the indicated models on the indicated datasets.
Args:
Expand All @@ -128,6 +128,9 @@ def run_benchmark(models=None, datasets=None, metrics=None, max_entities=None,
max_entities (int):
Max number of entities to load per dataset.
Defaults to ``None``.
segment_size (int):
If specified, cut each training sequence in several segments of the
indicated size.
distributed (bool):
Whether to use dask for distributed computing.
Defaults to ``False``.
Expand All @@ -152,7 +155,7 @@ def run_benchmark(models=None, datasets=None, metrics=None, max_entities=None,
delayed = []
for name, model in models.items():
result = evaluate_model_on_datasets(
name, model, datasets, metrics, max_entities, distributed)
name, model, datasets, metrics, max_entities, segment_size, distributed)
delayed.extend(result)

if distributed:
Expand Down
32 changes: 27 additions & 5 deletions benchmark/deepecho/benchmark/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
"""DeepEcho Command Line Interface module."""

import argparse
import copy
import logging
import sys

import humanfriendly
import tabulate

from deepecho.benchmark import get_datasets_list, run_benchmark
from deepecho.benchmark import DEFAULT_MODELS, get_datasets_list, run_benchmark


def _logging_setup(verbosity):
Expand Down Expand Up @@ -37,12 +38,27 @@ def _run(args):

Client(LocalCluster(n_workers=args.workers, threads_per_worker=args.threads))

if args.epochs is not None:
models = args.models
if models is None:
models = DEFAULT_MODELS.keys()

args.models = {}
for model_name in models:
model = copy.deepcopy(DEFAULT_MODELS[model_name])
model_kwargs = model[1]
if 'epochs' in model_kwargs:
model_kwargs['epochs'] = args.epochs

args.models[model_name] = model

# run
results = run_benchmark(
args.models,
args.datasets,
args.metrics,
args.max_entities,
args.segment_size,
args.distributed,
args.output_path,
)
Expand All @@ -57,8 +73,9 @@ def _run(args):

def _datasets_list(args):
_logging_setup(args.verbose)
datasets = get_datasets_list(args.extended)
datasets['size'] = datasets['size'].apply(humanfriendly.format_size)
datasets = get_datasets_list()
datasets['size_in_kb'] = datasets['size_in_kb'].apply(humanfriendly.format_size)
datasets = datasets.rename({'size_in_kb': 'size'})

print('Available DeepEcho Datasets:')
print(tabulate.tabulate(
Expand All @@ -80,8 +97,6 @@ def _get_parser():
'datasets-list', help='Get the list of available DeepEcho Datasets')
datasets_list.set_defaults(action=_datasets_list)
datasets_list.set_defaults(user=None)
datasets_list.add_argument('-e', '--extended', action='store_true',
help='Add dataset details (Slow).')
datasets_list.add_argument('-v', '--verbose', action='count', default=0,
help='Be verbose. Use -vv for increased verbosity.')

Expand All @@ -100,6 +115,13 @@ def _get_parser():
help='Datasets/s to be used. Accepts multiple names.')
run.add_argument('-M', '--max-entities', type=int,
help='Maximum number of entities to load per dataset.')
run.add_argument('-S', '--segment-size', type=int,
help=(
'If specified, cut each training sequence in several segments '
'of the indicated size.'
))
run.add_argument('-E', '--epochs', type=int,
help='Number of epochs to be performed by the models.')
run.add_argument('-s', '--metrics', nargs='+',
choices=['sdmetrics', 'classification', 'rf_detection', 'lstm_detection'],
help='Metric/s to use. Accepts multiple names.')
Expand Down
Loading

0 comments on commit c49f7b6

Please sign in to comment.