make release-tag: Merge branch 'master' into stable

sdv-dev · Aug 15, 2020 · c49f7b6 · c49f7b6
2 parents 61871c4 + 4d989b5
commit c49f7b6
Show file tree

Hide file tree

Showing 28 changed files with 2,410 additions and 1,321 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,17 @@
 # History
 
+## 0.1.1 (2020-08-15)
+
+This release includes a few new features to make DeepEcho work on more types of datasets
+as well as to making it easier to add new datasets to the benchmarking framework.
+
+* Add `segment_size` and `sequence_index` arguments to `fit` method.
+* Add `sequence_length` as an optional argument to `sample` and `sample_sequence` methods.
+* Update the Dataset storage format to add `sequence_index` and versioning.
+* Separate the sequence assembling process in its own `deepecho.sequences` module.
+* Add function `make_dataset` to create a dataset from a dataframe and just a few column names.
+* Add notebook tutorial to show how to create a datasets and use them.
+
 ## 0.1.0 (2020-08-11)
 
 First release.

diff --git a/README.md b/README.md
@@ -23,15 +23,15 @@
 
 # Overview
 
-**DeepEcho** is a Python library that implements generative models for **mixed-type**,
-**multivariate** time series.
-
-1. Provide multiple models both, from classical **statistical** modeling of time series to the
-   latest in **Deep Learning** based models.
-2. Provide a robust **benchmarking framework** for evaluating these methods under a set of
-   multiple metrics.
-3. Provide ability for a **Machine Learning researchers** to submit a new method following our
-   model and sample API and get evaluated.
+**DeepEcho** is a **Synthetic Data Generation** Python library for **mixed-type**, **multivariate
+time series**. It provides:
+
+1. Multiple models based both on **classical statistical modeling** of time series and the latest
+   in **Deep Learning** techniques.
+2. A robust [benchmarking framework](benchmark) for evaluating these methods on multiple datasets
+   and with multiple metrics.
+3. Ability for **Machine Learning researchers** to submit new methods following our `model` and
+   `sample` API and get evaluated.
 
 ## Try it out now!
 

diff --git a/benchmark/README.md b/benchmark/README.md
@@ -18,41 +18,44 @@ Most notably, many datasets from this collection are Time Series Classification
 downloaded from the [timeseriesclassification.com](http://www.timeseriesclassification.com/)
 website.
 
-This is the complete list of datasets and their characteristics:
-
-| dataset                   | size      |   entities |   entity_columns |   context_columns |   data_columns |   max_sequence_len |   min_sequence_len |
-|---------------------------|-----------|------------|------------------|-------------------|----------------|--------------------|--------------------|
-| Libras                    | 108.74 KB |        360 |                1 |                 1 |              4 |                 45 |                 45 |
-| AtrialFibrillation        | 111.02 KB |         30 |                1 |                 1 |              4 |                640 |                640 |
-| BasicMotions              | 196.06 KB |         80 |                1 |                 1 |              8 |                100 |                100 |
-| ERing                     | 223.5 KB  |        300 |                1 |                 1 |              6 |                 65 |                 65 |
-| RacketSports              | 235.39 KB |        303 |                1 |                 1 |              8 |                 30 |                 30 |
-| Epilepsy                  | 439.75 KB |        275 |                1 |                 1 |              5 |                206 |                206 |
-| PenDigits                 | 441.87 KB |      10992 |                1 |                 1 |              4 |                  8 |                  8 |
-| JapaneseVowels            | 475.01 KB |        640 |                1 |                 1 |             14 |                 29 |                  7 |
-| StandWalkJump             | 504.3 KB  |         27 |                1 |                 1 |              6 |               2500 |               2500 |
-| FingerMovements           | 764.23 KB |        416 |                1 |                 1 |             30 |                 50 |                 50 |
-| EchoNASDAQ                | 968.61 KB |         19 |                1 |                 2 |              8 |               9401 |                 82 |
-| Handwriting               | 1.38 MB   |       1000 |                1 |                 1 |              5 |                152 |                152 |
-| UWaveGestureLibrary       | 1.46 MB   |        440 |                1 |                 1 |              5 |                315 |                315 |
-| NATOPS                    | 1.78 MB   |        360 |                1 |                 1 |             26 |                 51 |                 51 |
-| ArticularyWordRecognition | 1.93 MB   |        575 |                1 |                 1 |             11 |                144 |                144 |
-| Cricket                   | 3.13 MB   |        180 |                1 |                 1 |              8 |               1197 |               1197 |
-| SelfRegulationSCP2        | 3.84 MB   |        380 |                1 |                 1 |              9 |               1152 |               1152 |
-| LSST                      | 4.2 MB    |       4925 |                1 |                 1 |              8 |                 36 |                 36 |
-| SelfRegulationSCP1        | 4.34 MB   |        561 |                1 |                 1 |              8 |                896 |                896 |
-| CharacterTrajectories     | 4.97 MB   |       2858 |                1 |                 1 |              5 |                182 |                 60 |
-| HandMovementDirection     | 5.24 MB   |        234 |                1 |                 1 |             12 |                400 |                400 |
-| EthanolConcentration      | 10.75 MB  |        524 |                1 |                 1 |              5 |               1751 |               1751 |
-| SpokenArabicDigits        | 15.81 MB  |       8798 |                1 |                 1 |             15 |                 93 |                  4 |
-| Heartbeat                 | 28.25 MB  |        409 |                1 |                 1 |             63 |                405 |                405 |
-| PhonemeSpectra            | 50.42 MB  |       6668 |                1 |                 1 |             13 |                217 |                217 |
-| MotorImagery              | 70.96 MB  |        378 |                1 |                 1 |             66 |               3000 |               3000 |
-| DuckDuckGeese             | 104.82 MB |        100 |                1 |                 1 |           1347 |                270 |                270 |
-| PEMS-SF                   | 110.03 MB |        440 |                1 |                 1 |            965 |                144 |                144 |
-| EigenWorms                | 128.72 MB |        259 |                1 |                 1 |              8 |              17984 |              17984 |
-| InsectWingbeat            | 195.23 MB |      50000 |                1 |                 1 |            202 |                 22 |                  2 |
-| FaceDetection             | 331.16 MB |       9414 |                1 |                 1 |            146 |                 62 |                 62 |
+This is the complete list of avilable datasets and some of their characteristics:
+
+| dataset                   | size      |   entities |  data_columns |   max_sequence_len |
+|---------------------------|-----------|------------|---------------|--------------------|
+| Libras                    | 108.74 KB |        360 |             4 |                 45 |
+| AtrialFibrillation        | 111.02 KB |         30 |             4 |                640 |
+| BasicMotions              | 196.06 KB |         80 |             8 |                100 |
+| ERing                     | 223.5 KB  |        300 |             6 |                 65 |
+| RacketSports              | 235.39 KB |        303 |             8 |                 30 |
+| Epilepsy                  | 439.75 KB |        275 |             5 |                206 |
+| PenDigits                 | 441.87 KB |      10992 |             4 |                  8 |
+| JapaneseVowels            | 475.01 KB |        640 |            14 |                 29 |
+| StandWalkJump             | 504.3 KB  |         27 |             6 |               2500 |
+| FingerMovements           | 764.23 KB |        416 |            30 |                 50 |
+| EchoNASDAQ                | 968.61 KB |         19 |             8 |               9401 |
+| Handwriting               | 1.38 MB   |       1000 |             5 |                152 |
+| UWaveGestureLibrary       | 1.46 MB   |        440 |             5 |                315 |
+| NATOPS                    | 1.78 MB   |        360 |            26 |                 51 |
+| ArticularyWordRecognition | 1.93 MB   |        575 |            11 |                144 |
+| Cricket                   | 3.13 MB   |        180 |             8 |               1197 |
+| SelfRegulationSCP2        | 3.84 MB   |        380 |             9 |               1152 |
+| LSST                      | 4.2 MB    |       4925 |             8 |                 36 |
+| SelfRegulationSCP1        | 4.34 MB   |        561 |             8 |                896 |
+| CharacterTrajectories     | 4.97 MB   |       2858 |             5 |                182 |
+| HandMovementDirection     | 5.24 MB   |        234 |            12 |                400 |
+| EthanolConcentration      | 10.75 MB  |        524 |             5 |               1751 |
+| SpokenArabicDigits        | 15.81 MB  |       8798 |            15 |                 93 |
+| Heartbeat                 | 28.25 MB  |        409 |            63 |                405 |
+| PhonemeSpectra            | 50.42 MB  |       6668 |            13 |                217 |
+| MotorImagery              | 70.96 MB  |        378 |            66 |               3000 |
+| DuckDuckGeese             | 104.82 MB |        100 |          1347 |                270 |
+| PEMS-SF                   | 110.03 MB |        440 |           965 |                144 |
+| EigenWorms                | 128.72 MB |        259 |             8 |              17984 |
+| InsectWingbeat            | 195.23 MB |      50000 |           202 |                 22 |
+| FaceDetection             | 331.16 MB |       9414 |           146 |                 62 |
+
+Further details more details about how the format in which these datasets are stored as well
+as how to create yours, please [follow this tutorial](../tutorials/02_DeepEcho_Benchmark_Datasets.ipynb)
 
 ### Modeling and Sampling process
 

diff --git a/benchmark/deepecho/__init__.py b/benchmark/deepecho/__init__.py
@@ -2,7 +2,7 @@
 
 __author__ = 'MIT Data To AI Lab'
 __email__ = 'dailabmit@gmail.com'
-__version__ = '0.1.0'
+__version__ = '0.1.1.dev1'
 __path__ = __import__('pkgutil').extend_path(__path__, __name__)
 
 from deepecho.base import DeepEcho

diff --git a/benchmark/deepecho/benchmark/__init__.py b/benchmark/deepecho/benchmark/__init__.py
@@ -18,7 +18,7 @@
 
 
 DEFAULT_MODELS = {
-    'PARModel': (PARModel, {'epochs': 256, 'cuda': True})
+    'PARModel': (PARModel, {'epochs': 1024, 'cuda': True})
 }
 
 
@@ -105,7 +105,7 @@ def _draw_stop(self, **kwargs):
 
 
 def run_benchmark(models=None, datasets=None, metrics=None, max_entities=None,
-                  distributed=False, output_path=None):
+                  segment_size=None, distributed=False, output_path=None):
     """Score the indicated models on the indicated datasets.
 
     Args:
@@ -128,6 +128,9 @@ def run_benchmark(models=None, datasets=None, metrics=None, max_entities=None,
         max_entities (int):
             Max number of entities to load per dataset.
             Defaults to ``None``.
+        segment_size (int):
+            If specified, cut each training sequence in several segments of the
+            indicated size.
         distributed (bool):
             Whether to use dask for distributed computing.
             Defaults to ``False``.
@@ -152,7 +155,7 @@ def run_benchmark(models=None, datasets=None, metrics=None, max_entities=None,
     delayed = []
     for name, model in models.items():
         result = evaluate_model_on_datasets(
-            name, model, datasets, metrics, max_entities, distributed)
+            name, model, datasets, metrics, max_entities, segment_size, distributed)
         delayed.extend(result)
 
     if distributed:

diff --git a/benchmark/deepecho/benchmark/__main__.py b/benchmark/deepecho/benchmark/__main__.py
@@ -3,13 +3,14 @@
 """DeepEcho Command Line Interface module."""
 
 import argparse
+import copy
 import logging
 import sys
 
 import humanfriendly
 import tabulate
 
-from deepecho.benchmark import get_datasets_list, run_benchmark
+from deepecho.benchmark import DEFAULT_MODELS, get_datasets_list, run_benchmark
 
 
 def _logging_setup(verbosity):
@@ -37,12 +38,27 @@ def _run(args):
 
         Client(LocalCluster(n_workers=args.workers, threads_per_worker=args.threads))
 
+    if args.epochs is not None:
+        models = args.models
+        if models is None:
+            models = DEFAULT_MODELS.keys()
+
+        args.models = {}
+        for model_name in models:
+            model = copy.deepcopy(DEFAULT_MODELS[model_name])
+            model_kwargs = model[1]
+            if 'epochs' in model_kwargs:
+                model_kwargs['epochs'] = args.epochs
+
+            args.models[model_name] = model
+
     # run
     results = run_benchmark(
         args.models,
         args.datasets,
         args.metrics,
         args.max_entities,
+        args.segment_size,
         args.distributed,
         args.output_path,
     )
@@ -57,8 +73,9 @@ def _run(args):
 
 def _datasets_list(args):
     _logging_setup(args.verbose)
-    datasets = get_datasets_list(args.extended)
-    datasets['size'] = datasets['size'].apply(humanfriendly.format_size)
+    datasets = get_datasets_list()
+    datasets['size_in_kb'] = datasets['size_in_kb'].apply(humanfriendly.format_size)
+    datasets = datasets.rename({'size_in_kb': 'size'})
 
     print('Available DeepEcho Datasets:')
     print(tabulate.tabulate(
@@ -80,8 +97,6 @@ def _get_parser():
         'datasets-list', help='Get the list of available DeepEcho Datasets')
     datasets_list.set_defaults(action=_datasets_list)
     datasets_list.set_defaults(user=None)
-    datasets_list.add_argument('-e', '--extended', action='store_true',
-                               help='Add dataset details (Slow).')
     datasets_list.add_argument('-v', '--verbose', action='count', default=0,
                                help='Be verbose. Use -vv for increased verbosity.')
 
@@ -100,6 +115,13 @@ def _get_parser():
                      help='Datasets/s to be used. Accepts multiple names.')
     run.add_argument('-M', '--max-entities', type=int,
                      help='Maximum number of entities to load per dataset.')
+    run.add_argument('-S', '--segment-size', type=int,
+                     help=(
+                         'If specified, cut each training sequence in several segments '
+                         'of the indicated size.'
+                     ))
+    run.add_argument('-E', '--epochs', type=int,
+                     help='Number of epochs to be performed by the models.')
     run.add_argument('-s', '--metrics', nargs='+',
                      choices=['sdmetrics', 'classification', 'rf_detection', 'lstm_detection'],
                      help='Metric/s to use. Accepts multiple names.')