- Tristan - Determine what kind of machine learning method to use (supervised, reinforcement, unsupervised, transfer, etc.)
- Tristan - Investigate how to partition data into training, validation and test subsets.
- Tristan - Run the clustering algorithm using the different data subset partitions.
- Tristan - Select a cross validation method to get a more accurate rating of the performance of each model.
- Tristan - Interpret the clustering results and make adjustments if needed.
- Ryan - Determine optimal amount of vector space reduction that can be achieved without losing anything above minimal classification data.
- Ryan - Decide methodology of dimension reduction and document rationale behind choice.
- Ryan - Actual implementation of dimension reduction.
- Ryan - Compare application results when reducing to different numbers of dimensions.
- Ryan - Document if the optimal number of dimensions is relatively consistent across different document embedding techniques.
- Ryan - Design quality and informative visualizations for reduced 2D vector space.
- Ryan - *Potential. Employ different method/tool in order to reduce dimensions. Convey whether results remain consistent or differ.
- Yahya - Break document into word pairs
- Yahya - Break document into sentences
- Yahya - Filter data to keep relevant word tokens
- Yahya - Embed tokens into high dimensionality semantic vectors
- Yahya - Compare embedding methods and optimize using the best method