Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add segment for data scanning errors #15

Merged
merged 1 commit into from
Oct 27, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion src/content/docs/features/fine-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,37 @@ description: A page describing fine-tuning concept.
---

### Why?

Code models are trained on a vast amount of code from the internet, which may not perfectly, align with your specific codebase, APIs, objects, or coding style. By fine-tuning the model, you can make it more familiar with your codebase and coding patterns. This allows the model to better understand your specific needs and provide more relevant and accurate code suggestions. Fine-tuning essentially helps the model memorize the patterns and structures commonly found in your code, resulting in improved suggestions tailored to your coding style and requirements.

### Which Files to Feed?

It's a good idea to give the model the current code of your projects, because it's likely any new code in the same project will be similar -- that's what makes suggestions relevant and useful. However, it's NOT a good idea feed 3rd party libraries that you use, as the model may learn to generate code similar to the internals of those libraries.

### Data Scanning

During the scanning process, files uploaded to Refact as a dataset for the fine-tuning process are validated to determine their suitability.

Files rejected during the validation process are dismissed and won't be incorporated into the dataset used in the fine-tuning stage.

The potential rejection reasons are listed below:

1. **Linguist error** - Indicates that Refact couldn't open the file or the file might be corrupted.
2. **Not text** - That reason is applicable for binary files which are not suitable for the fine-tuning.
3. **File is too large** - If the file size is more than 512kb, that file will not be included in the dataset as a too-large file.
4. **Excluded by mask** - Refers to files that are manually excluded.
5. **Duplicates** - Duplicated files are rejected from the dataset.
6. **Lots of digits** - If the percentage of digits in the file exceeds a specific amount, the file will be rejected from the filtered dataset.
7. **Filter empty** - That reason is applicable when perplexity a metric assessing the model's predictive probability) cannot be calculated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed bracket: ...perplexity (a metric...


:::note
Files that didn't pass the linguist scanning could not be included manually after the filtering process.
:::

### GUI

Use `Sources` and `Finetune` tabs in the web UI to upload files (`.zip`, `.gz`, `.bz2` archive, or a link to your git repository) and run the fine-tune process. After the fine-tuning process finishes (which should take several hours), you can dynamically turn it on and off and observe the difference it makes for code suggestions.

:::note
Both VS Code and JB plugins cache the responses. To force the model to produce a new suggestion (rather than immediately responding with a cached one), you can change the text a few lines above, for example, a comment Alternatively, you can use the Manual Suggestion Trigger (a key combination), which always produces a new suggestion.
:::
:::