forked from cemoody/lda2vec
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
96 lines (93 loc) · 2.83 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Add tests for target
Add tests for global targets
Add examples of specific documents to 20ng example
Add better naming to categorical variables, e.g. like target variables
Keep track of doc counts between model serializations
Add bigramming
Add better README
Add an example script with HN with doc id, client id, and predicted score
Add super simple explanatory models
Remove spacy dep
Change EmbedMixture naming to possible values and n latent factors
Print out topics while training
Add doctets to lda2vec main classes
Randomize chunking order on fit
Add loss tracking and reporting classes to code
Finish filling out docstrings
Add multiple targets for one component
Add convergence criterion
Add docs on:
Installation
HN Tutorial
Parse document into vector
Setup LDA for document
Mesure perplexity
Visualize topics
Add supervised component
Mesure perplexity
Visualize topics
Add another component for time
Mesure perplexity
Visualize topics
Visualize topics, changing temperature
Data formats
Loose
Compact
Flat
Contexts
Categorical contexts
Other contexts TBA
Targets
RMSE
Logistic
Softmax
Advanced
Options
GPU
Gradient Clipping
Online learning, fraction argument
Logging progress
Perplexity
Model saving, prediction
Dropout fractions
Nomenclature
Categorical Feature
Each category in set has n_possible_values
Each feature has n_latent_factors
Each feature has a single target
Components
Each component defined total number of documents and number of topics
Each component may also have supervised targets
Done:
Add BoW mode
Add logger
Add fake data generator
Add perplexity measurements
Add tracking utility
Add utilities for converting corpora
Put license
Add masks / skips / pads
Add reindexing on the fly
Convert docstrings to numpy format
Implement corpus loose to dense and vice versa
Add fit function for all data at once
Add CI & coverage & license icons
Add readthedocs support
Add examples to CI
Add dropout
Change component naming to 'categorical feature'
Add linear layers between input latent and output context
Merge skipgram branch
Add topic numbers to topic print out
Try higher importance to the prior
Change prob model to just model prob of word in topic
Add word dropout
Add an example script with 20 newsgroups -- LDA
Add visualization for topic-word
Implement skipgram contexts
Prevent mixing between documents
Add temperature to perplexity measurements
Add temperature to viz
Add model saving
Add model predicting
Hook up RTD to docstrings