Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gaby: exclude stale(?) web document #63

Open
hyangah opened this issue Nov 26, 2024 · 2 comments
Open

gaby: exclude stale(?) web document #63

hyangah opened this issue Nov 26, 2024 · 2 comments

Comments

@hyangah
Copy link
Contributor

hyangah commented Nov 26, 2024

From golang/go#67901 (comment)

Docs like https://go.dev/doc/go1.17_spec#Package_initialization are kept for historical purposes.
We may come up with a workaround for this specific issue. I am not sure about general solutions.

Some approaches I am thinking of:

  • Label such docs manually in the document source and exclude them

  • Label such docs using LLM (e.g. "obsolete"?) and exclude them

    (we can also do the same for issues that we don't want to appear in the related info by labelling/classifying appropriately)

  • Before posting, drop almost duplicates (e.g. by checking pair-wise similarity comparison)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/633395 mentions this issue: internal/gaby: exclude go1.17_spec docs from crawling

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/635176 mentions this issue: internal/devtools/cmd/rmdoc: delete crawled pages from corpus

gopherbot pushed a commit that referenced this issue Dec 15, 2024
This page was temporarily added to help spec revision.
It will be removed at the start of go1.25.
Until then, ignore this page.
(We have two entries for this page in our DB)

For #63

Change-Id: Ibf369100ca25f47ca487bb87f7327388ef8dcef3
Reviewed-on: https://go-review.googlesource.com/c/oscar/+/633395
Reviewed-by: Tatiana Bradley <tatianabradley@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Dec 15, 2024
Gaby splits each crawled webpage into docs for embedding, computes
embedding, and store them in the vector db. Delete all the docs
and their embedding.

This is meant to be run after the webpage is excluded from
crawling with Crawler.Deny.

For #63

Change-Id: I095a65b9a834ccf48062facc3654f40b43562e15
Reviewed-on: https://go-review.googlesource.com/c/oscar/+/635176
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Jonathan Amsterdam <jba@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants