You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After working with this software for a while, I'm becoming aware that there are many valid site configurations out there that we are unable to navigate due to the limitations of the spider and harvesting system.
Given the above planned features for the spider, it would improve code testing significantly to set up a simple web server with a robots.txt and sitemap.xml at the base that delivers content in some of the ways commonly used by data repositories. For example, being able to test the navigation of javascript elements that render JSON-LD content after the page is loaded (i.e. MagIC DataONEorg/member-repos#16), an application/ld+json delivery system (i.e. Harvard Dataverse DataONEorg/member-repos#52, some valid but alternative configurations of schema.org data (i.e. CanWIN DataONEorg/member-repos#67) and perhaps some misconfigured robots.txt scenarios (i.e. Borealis DataONEorg/member-repos#51), without needing to crawl the repositories themselves.
The text was updated successfully, but these errors were encountered:
Related to:
After working with this software for a while, I'm becoming aware that there are many valid site configurations out there that we are unable to navigate due to the limitations of the spider and harvesting system.
Given the above planned features for the spider, it would improve code testing significantly to set up a simple web server with a
robots.txt
andsitemap.xml
at the base that delivers content in some of the ways commonly used by data repositories. For example, being able to test the navigation of javascript elements that render JSON-LD content after the page is loaded (i.e. MagIC DataONEorg/member-repos#16), anapplication/ld+json
delivery system (i.e. Harvard Dataverse DataONEorg/member-repos#52, some valid but alternative configurations of schema.org data (i.e. CanWIN DataONEorg/member-repos#67) and perhaps some misconfigured robots.txt scenarios (i.e. Borealis DataONEorg/member-repos#51), without needing to crawl the repositories themselves.The text was updated successfully, but these errors were encountered: