-
-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should we do something like More Like This (MLT) in Rubix ML? #75
Comments
Great question @marclaporte! I am not expert but I read the document you linked and it looks like they're performing some type of full-text search using TF-IDF (Term Frequency Inverse Document Frequency) scores. Here is the algorithm that they are using. https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html It looks like they use document distance as the method of choosing related documents. If that is the case then selecting the document that is 'most like this' is just a matter of finding the document's nearest neighbor. You can accomplish this in Rubix ML with the Word Count Vectorizer and TF-IDF Transformer and the KDNeighbors classifier with k set to 1. The label of each sample document in this case will be the identifier of a unique entry in a database such as another document or question/answer pair. use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TFIDFTransformer;
use Rubix\ML\Classifiers\KDNeighbors;
// Import training samples and labels
$dataset = Labeled::build($samples, $labels)
->apply(new WordCountVectorizer())
->apply(new TFIDFTransformer());
$estimator = new KDNeighbors(1);
$estimator->train($dataset);
$mlt = $estimator->predictSample($sample); Where $sample is the TF-IDF encoded document that you want to find the MLT of and $mlt is the label or index of the MLT document. Since you'll be using a spatial tree for the queries it will be very fast O(n log n) instead of O(n^2) as with the brute force method. Some other approaches that you can take with Rubix ML include other Learners such as Naive Bayes or a Multilayer Perceptron. Your input will still be the TF-IDF weighted word counts but these models may work better (or worse) for your particular problem. Here is an example project that classifies the sentiment of a document using an MLP. https://github.com/RubixML/Sentiment Thanks again for the great question! |
In Marc's case, a typical scenario involves finding the 10 to 50 most similar documents among a possibly large set (say, 10K documents). So we would have to use k=10 to 50, not 1. How well does the k-Nearest Neighbour algorithm scale? How long would it take to identify the k=10 nearest neighbours among a set of say, 5K documents, each consisting of say, 2 pages? From what I understand, ES's MLT algorithm proceeds in 3 phases:
This allows it to scale to very large sets of documents, because it does not need to evaluate the document similarity with each and every document in the dataset. |
Hey @alaindesilets Yes you'd increase k with the number of similar documents you wanted retrieved and instead of the Scaling KNN depends on the underlying implementation. In Rubix we have Estimators for both brute force and spatial tree accelerated implementations. There are tradeoffs with each but the tree versions are much faster. The search time complexity of the tree-based Estimator is O(n log n) and since it uses summary nodes (Ball or Box nodes) we are able to pre-prune branches before entering them. Note the difference in the benchmarks below. The test conditions are 2,500 training and 10,000 test samples of 4 dimensions. I haven't done a benchmark using TF-IDF scores but I'd encourage you to do some on a prototype model if performance and scalability are a concern. Having that said, due to the lazy learning nature of KNN, it's not the most scalable algorithm to choose from in the library. I go a little deeper into the pro and cons of each Estimator here. It sounds like we're on the same page with the Elasticsearch algorithm. One way you can narrow the input feature space down in Rubix is through the Word Count Vectorizer. Particularly the max vocabulary and min document frequency parameters. The example below will use only the top 1000 words/tokens (ranked by term frequency) that appear in at least 3 different documents as the vocabulary. You can change the tokenizer from Word to N-gram or Skip Gram as well. This is not exactly the same as the Elasticsearch algorithm just pointing that out. use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;
$dataset->apply(new WordCountVectorizer(1000, 3, new NGram(1, 2))); Another way to reduce the size of the input feature space would be to perform dimensionality reduction or feature selection after computing the TF-IDF values. I go a little into this here. I'd say if you wanted to replicate the Elasticsearch algorithm using ML then just use Elasticsearch but if you wanted to take control of the algorithm and make changes to it over time then you can design something in Rubix. |
We added maximum document frequency to Word Count Vectorizer in the last commit (5203397) so now you can narrow the vocabulary by removing frequent tokens similarly to Elasticsearch. Example use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;
$dataset->apply(new WordCountVectorizer(1000, 3, 500, new NGram(1, 2))); |
We discussed this at the recent community meeting, and Victor and Andrew should meet up about this in coming weeks. |
Great @marclaporte I already began doing some experiments on my end Victor and I will take it from there when ready |
In fact, I experimented with this as well a bit but have some questions. @andrewdalpino meet you on telegram! |
Here is what I came up with - it uses the positive reviews from the Sentiment dataset. You should be able to drop this code right into the project (ex <?php
include __DIR__ . '/vendor/autoload.php';
use Rubix\ML\Other\Loggers\Screen;
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Transformers\HTMLStripper;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TfIdfTransformer;
use Rubix\ML\Graph\Trees\BallTree;
use Rubix\ML\Kernels\Distance\Cosine;
ini_set('memory_limit', '-1');
$logger = new Screen();
$logger->info('Loading data into memory');
$samples = $labels = [];
foreach (glob("train/positive/*.txt") as $file) {
$samples[] = [file_get_contents($file)];
$labels[] = basename($file);
}
$dataset = new Labeled($samples, $labels);
$dataset = $dataset->apply(new HTMLStripper())
->apply(new TextNormalizer())
->apply(new WordCountVectorizer(5000, 3, 5000))
->apply(new TfIdfTransformer());
$testing = $dataset->randomize()->take(1);
$tree = new BallTree(30, new Cosine());
$logger->info('Growing spatial tree');
$start = microtime(true);
$tree->grow($dataset);
$minutes = (microtime(true) - $start) / 60.0;
$logger->info("Done in $minutes minutes");
$start = microtime(true);
[$samples, $labels, $distances] = $tree->nearest($testing->sample(0), 5);
$seconds = microtime(true) - $start;
$logger->info("Search took $seconds seconds");
echo PHP_EOL;
echo 'Input Document:' . PHP_EOL;
echo file_get_contents("train/positive/{$testing->label(0)}") . PHP_EOL;
echo PHP_EOL;
echo 'Most like this:' . PHP_EOL;
echo file_get_contents("train/positive/$labels[0]") . PHP_EOL;
echo PHP_EOL;
echo 'Top 5 Files:' . PHP_EOL;
print_r($labels);
echo PHP_EOL;
echo 'Top 5 Document Distances:' . PHP_EOL;
print_r($distances) . PHP_EOL; |
What if I want to train an existing model and then just query against it in runtime? I suppose BallTree cannot be persisted with existing persisters and I'd need to use KDNeighbours? In fact, this works with pretty similar speed, so I'm planning to use it unless there is an option to persist a tree that is already grown: <?php
require('vendor/autoload.php');
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Extractors\CSV;
use Rubix\ML\Kernels\Distance\Cosine;
use Rubix\ML\Persisters\Filesystem;
use Rubix\ML\Pipeline;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TfIdfTransformer;
use Rubix\ML\Transformers\StopWordFilter;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Classifiers\KDNeighbors;
use Rubix\ML\Graph\Trees\BallTree;
$stopwords = ['i','me','my','myself','we','our','ours','ourselves','you','your','yours','yourself','yourselves','he','him','his','himself','she','her','hers','herself','it','its','itself','they','them','their','theirs','themselves','what','which','who','whom','this','that','these','those','am','is','are','was','were','be','been','being','have','has','had','having','do','does','did','doing','a','an','the','and','but','if','or','because','as','until','while','of','at','by','for','with','about','against','between','into','through','during','before','after','above','below','to','from','up','down','in','out','on','off','over','under','again','further','then','once','here','there','when','where','why','how','all','any','both','each','few','more','most','other','some','such','no','nor','not','only','own','same','so','than','too','very','s','t','can','will','just','don','should','now'];
$nearest = 20;
$start = microtime(true);
$persister = new Filesystem('trained.model');
if (is_file('trained.model')) {
$estimator = $persister->load();
} else {
$dataset = Labeled::fromIterator(new CSV('tracker_46.csv'));
$mdf = $dataset->numRows()/2;
$estimator = new Pipeline([
new TextNormalizer(),
new StopWordFilter($stopwords),
new WordCountVectorizer(10000, 1, $mdf),
new TfIdfTransformer(),
], new KDNeighbors($nearest, true, new BallTree($nearest, new Cosine())));
$estimator->train($dataset);
$persister->save($estimator);
}
$mlt = $estimator->probaSample(["problem resolution and decision trees"]);
arsort($mlt);
$mlt = array_slice($mlt, 0, $nearest, true);
print_r($mlt);
echo (memory_get_peak_usage()/1024/1024)." MB\n";
$time = microtime(true) - $start;
echo $time." s\n"; |
Any object can be persisted (Ex. |
Just a heads up I've added a BM25 transformer to the Extras package that you can try out as well. This should be an improvement over TF-IDF for document retrieval. So far I've been getting great results. You can just replace the TF-IDF transformer with BM25 and viola! https://github.com/RubixML/Extras/blob/master/docs/transformers/bm25-transformer.md |
So @kroky with the new BM25 Transformer we can replicate 2 of Lucene's search strategies. The first is their BM25 method which we replicate by using the new BM25 Transformer in place of TF-IDF in the examples above. https://lucene.apache.org/core/7_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html The second is Lucene's TF-IDF strategy. Since their version of TF-IDF is just a special case of BM25 with https://lucene.apache.org/core/7_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html And to be complete, you can also use the TF-IDF Transformer like before, however, this does not have document length normalization so is simpler than Lucene's TF-IDF strategy. In addition TF-IDF transformer does not dampen the effect of high term counts by taking the square root of the term frequency (however the new BM25 Transformer does use a non-linear TF dampening function controlled by the |
Also @kroky just a heads up so you don't spend hours scratching your head like I did ... things get a little weird with Cosine distance and zero vectors (norm = 0). Cosine is undefined for zero vectors (a or b or both with 0 term counts from the vocabulary or a blank document) but we currently return 1 in this scenario which is correct in terms of the equation but breaks down under two circumstances. The first is that the distance from a zero vector to a non-zero vector will return 1 when really it should be 2 (the furthest distance possible) and the second is the distance between two zero vectors is computed as 1 when it really should be 0 (two blank documents can be interpreted as being the same). Unfortunately, the only way I could think of so far to detect this condition is if we had the vector norms (if norm = 0 then vector contains all zeros) and so far I haven't been able to make this jive with the new optimization you implemented. However, the good news is that it does work with the implementation that I outlined here since it does compute the exact norms. This wouldn't seem like such a big issue but unfortunately it causes Ball Tree to enter an infinite loop during construction :/ Can you think of another way around this using your implementation? You can test the bug by inserting a zero vector as one of the samples and then training/growing the tree. You'll know there's a problem when the universe expands into nothingness before the algorithm completes. Adding the following to Cosine distance compute right before the final return statement fixes the issue. if ($ssA === 0.0 and $ssB === 0.0) {
return 0.0;
}
if ($ssA === 0.0 or $ssB === 0.0) {
return 2.0;
} |
Yes, I noticed the infinite loop in Ball::split when one of the dataset samples was empty. It kept finding the same centeroid. I thought a solution would be to remove empty samples but then you are right that we could have zero vectors with 0 term counts from the vocabulary. How about this solution? public function compute(array $a, array $b) : float
{
$sigma = $ssA = $ssB = 0.0;
$zeroA = $zeroB = false;
foreach ($a as $i => $valueA) {
$valueB = $b[$i];
$sigma += $valueA * $valueB;
$zeroA = $zeroA || $valueA;
$zeroB = $zeroB || $valueB;
}
$zeroA = !$zeroA;
$zeroB = !$zeroB;
if ($zeroA && $zeroB) {
return 0.0;
}
if ($zeroA || $zeroB) {
return 2.0;
}
if (abs($sigma) > EPSILON) {
foreach ($a as $i => $valueA) {
$ssA += $valueA ** 2;
$ssB += $b[$i] ** 2;
}
}
return 1.0 - ($sigma / (sqrt($ssA * $ssB) ?: EPSILON));
} I am running benchmarks with |
BM25 is great! I read more about its superiority to TFIDF and seems like a very nice replacement. Our use-case is not related so much to the document length as most of the documents have similar lengths but a way to discard extremely popular words and rank higher the documents containing more of the search string words is awesome. I also found it is faster than TFIDF, not sure why - considerably faster for small corpus size and slightly faster for bigger ones. |
It could be that the learner can construct a better Ball Tree (more efficient at clustering similar documents) with the new feature representation ¯_(ツ)_/¯
BM25 dampens the effect of high term counts but it doesn't discard them - that is the job of Word Count Vectorizer (through not including the word in the vocab in the first place). Any word with a document frequency greater than max document frequency (I think we'll change this setting to a proportion of the training set in 1.0.0) for example will not be included in the vocabulary. |
Hey @kroky there was an issue with the benchmark but it's fixed now. I guess it wasn't calling the setUp() method to instantiate the kernel. Green is your original optimization, red is the new one with compensation for zero vectors, and green is the one I outlined here with zero vector compensation. I'm thinking we go with blue unless we can beat it or come close with more elegant code. |
Also, you may find this useful. I experimented with adding dimensionality reduction to the features. Went from 10,000 to 500 features with hardly any loss in "relevancy". It does take longer to train (due to the O(n^3) matmul), but it may be worth it as the search time was reduced to less than 1 second. You may even get good results with alot less than 500 dimensions depending on your dataset. In my experiment I added Gaussian Random Projector to the pipeline. This effectively projects the BM25 vectors onto a spherically random hyperplane yet most of the information needed to identify the documents remains encoded in this new vector space of lower dimensionality. It's pretty crazy, you can look up the Johnson–Lindenstrauss lemma for more info as to why this works. Apparently this technique is called random indexing in the NLP world. Here's a paper I found that explains the concept. use Rubix\ML\Transformers\HTMLStripper;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\BM25Transformer;
use Rubix\ML\Transformers\GaussianRandomProjector;
$dataset = $dataset->apply(new HTMLStripper())
->apply(new TextNormalizer())
->apply(new WordCountVectorizer(10000, 3, 500))
->apply(new BM25Transformer(1.2, 0.75))
->apply(new GaussianRandomProjector(500)); This is the result Note that after random projection, there no longer is a column for the word 'Stargate' or any words from the vocabulary for that matter. However, the 500 dimensions still encode the entire vocabulary in the new vector space. |
Very useful, thanks, I will take it into consideration. Regarding optimization - I agree we should go with the fastest possible method as speed of one computation in this case affects performance a lot. |
Ok @kroky, the fix will be out in 0.1.5 then (I went with the blue one) ... I really liked your implementation though (clever and elegant), I'm bummed it didn't work out with this undefined behavior Btw I'd also recommend trying Sparse Random Projector (they use it in the random indexing paper). Hint in 0.2.0 it will be getting a feature to allow you to control the sparsity of the projections. For my dataset of 10,000ish movie reviews, 500 dimensions seems to be sufficient to capture most of the information contained in the high-dimensional (10,000) BM25 weighted word count vectors. |
A couple more things @kroky - just added to the Extras repo is the new Token Hashing Vectorizer that works well for low memory footprint applications. It doesn't build a vocabulary like Word Count Vectorizer. Instead, it employs a hashing trick to bucket token counts into their place in a fixed-length feature vector. It does suffer from collisions though so beware of that, especially in lower dimensional vector spaces (I'd guess < 10,000). It doesn't have any control over very rare or stop words either (no max/min document frequency). $dataset = $dataset->apply(new HTMLStripper())
->apply(new TextNormalizer())
// ->apply(new WordCountVectorizer(10000, 3, 500))
->apply(new TokenHashingVectorizer(10000))
->apply(new BM25Transformer(1.2, 0.75)); I did a drop-in replacement of Word Count Vectorizer in the Sentiment demo and there was about a .05 drop in the F1 score. My guess is that this is do to noise introduced by colliding tokens. Note that unlike Python, we do not have a sparse array implementation like scipy so we can't just define an insanely large vector to avoid collisions as it will actually need the memory and compute. Also, in 0.2.0, the TF-IDF Transformer will have variable Laplace (additive) smoothing. In previous versions it used add-one smoothing but now it is variable with add-one as the default. This parameter essentially borrows probability mass from the fitted vocabulary to give to words/tokens it has never seen before effectively reducing the impact of high inverse document frequencies. Setting smoothing to less than 1.0 (the default), however, will actually give some probability mass back to the fitted vocabulary. https://docs.rubixml.com/en/0.2.0/transformers/tf-idf-transformer.html#parameters Finally, it is possible to pre-normalize the samples using L1 or L2 normalizers. This would allow us to forego normalization in the Cosine kernel. What we could do is implement a 'Dot' distance kernel that only computes the dot product without normalization. I'd imagine this would shave a bit more time off both search and tree building. Let me know your thoughts! |
Andrew, thanks for all that info and new releases! I was a bit away from the topic but getting back in full speed. I am very near to the point where RubixML will be actually usable inside Tiki and our first model will be the MLT example we discuss. Your latest suggestions sound very good and we can play with them a bit more once the integration is in place as it will allow non-developers to build, train and execute models. Will share more details a I have them. |
I just wanted to bring up the infinite loop problem as it occurred once again with v0.2.0 and the cosine kernel updates in place - that is returning proper results for zero vectors. I think we might have an additional edge case in BallTree::split method which we need to take care of. Have to debug it more, though. My actual dataset is a very small set (94 samples) of mostly testing data (vector dimension is just 42), so the vectors contain a lot of zeros but still, I believe the infinite loop should be handled more elegantly. Will debug more and get back to you. |
Andrew, submitted a PR #118 regarding the infinite loop problem... |
Just saw this now @kroky thanks for the PR, we also found similar bugs in a few other trees as well thanks to your excellent debugging skills |
For the record, Rubix ML is now in Tiki, with MLT: https://doc.tiki.org/Machine-Learning Congrats @andrewdalpino and @kroky ! Thank you @alaindesilets for your insight. |
Since 0.4.0 we now can perform Latent Semantic Indexing to the document matrix. use Rubix\ML\Transformers\HTMLStripper;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\BM25Transformer;
use Rubix\ML\Transformers\TruncatedSVD;
$dataset = $dataset->apply(new HTMLStripper())
->apply(new TextNormalizer())
->apply(new WordCountVectorizer(10000, 3, 500))
->apply(new BM25Transformer(1.2, 0.75))
->apply(new TruncatedSVD(100)); A dimensionality reduction factor of 100 (10000 / 100) gives about the same order of speedup. CC: @kroky @marclaporte |
@andrewdalpino: This is fantastic! This is something we massively use, and performance is an important factor. @kroky: Can you upgrade Tiki to use this new version? Thanks! |
Hey @andrewdalpino! Thanks for the update. I tried to use it but fail with this error in Tensor extension: The data is a sample one:
Not sure what is wrong with this data. If I run the same through the rubix/tensor php package it works fine. However, it seems the pecl extension is not working fine. Can you please check? Also, noticed an issue with the rubix/ml php package - MissingExtension is misspelled in the filename like this: MIssingExtension.php, so it throws a fatal error instead of catchable MissingExtension error if pecl extension is not loaded. Didn't have time to submit a PR but thought it is worth mentioning. |
Hello @andrewdalpino & @kroky I have working on the natural language searching in a ecommerce(php & Mysql) search like agolia , so here i am using the rubixml that but encounter when train the dataset canFatal error: Uncaught TypeError: Unsupported operand types: string - float in C:\laragon\www\machine-learning\vendor\rubix\ml\src\Kernels\Distance\Euclidean.php:47 Stack trace: #0 C:\laragon\www\machine-learning\vendor\rubix\ml\src\Classifiers\KNearestNeighbors.php(304): Rubix\ML\Kernels\Distance\Euclidean->compute(Array, Array)Here is my coderequire 'vendor/autoload.php'; // Import the necessary classes $host = 'localhost'; $dsn = "mysql:host=$host;dbname=$db;charset=UTF8"; try { $query = $connection->prepare('SELECT id, name FROM products '); $dataset = Labeled::fromIterator($extractor)->apply(new NumericStringConverter()) // Create a Word2Vec transformer with 100 dimensions and a window size of 2 // Create a KNN classifier with 3 neighbors // Create a pipeline with the transformer and the estimator // Train the pipeline on the dataset "; print_r($dataset); die(); |
I suppose the data you feed to the iterator or that comes from the extractor is not all numeric but contains strings and that's why the vector distance kernel throws the error - please check your data. |
ok thanks will check |
MoreLikeThis is something we use massively. I am confident there is a way to replicate this with Rubix ML (and even have a learning system), but I am wondering about a suggested approach.
Related links:
Thanks!
The text was updated successfully, but these errors were encountered: