- Crawler
- Configuration
- Docker environmental variables
Clearly Defined's crawler
The environmental variables for the cdcrawler-dev App Service include:
- CRAWLER_AZBLOB_CONNECTION_STRING
- CRAWLER_AZBLOB_CONTAINER_NAME
- CRAWLER_DEADLETTER_PROVIDER
- CRAWLER_GITHUB_TOKEN
- CRAWLER_HOST
- CRAWLER_INSIGHTS_KEY
- CRAWLER_NAME
- CRAWLER_QUEUE_AZURE_CONNECTION_STRING
- CRAWLER_QUEUE_PREFIX
- CRAWLER_QUEUE_PROVIDER
- CRAWLER_SCANCODE_PARALLELISM
- CRAWLER_SERVICE_AUTH_TOKEN
- CRAWLER_SERVICE_URL
- CRAWLER_STORE_PROVIDER
- CRAWLER_WEBHOOK_TOKEN
- CRAWLER_WEBHOOK_URL
- DOCKER_CUSTOM_IMAGE_NAME
- DOCKER_ENABLE_CI
- DOCKER_REGISTRY_SERVER_PASSWORD
- HARVEST_AZBLOB_CONNECTION_STRING
- HARVEST_AZBLOB_CONTAINER_NAME
- WEBSITE_HTTPLOGGING_RETENTION_DAYS
- WEBSITES_ENABLE_APP_SERVICE_STORAGE
That's a lot! Let's break these down, several of them are how the crawler connects to other Azure services.
These environmental variables refer to an Azure Storage Blob. Blobs are good places to store unstructured data.
Azure Blobs are part of Azure Storage Accounts. An Azure Storage Account has many blobs (and also queues and tables).
The Azure Storage account where we keep this (and other) blobs is clearlydefineddev.
When the crawler tries to process some request more than N times (possibly 5) and fails, it puts it in the deadletter box. Most often its a request for a component that no longer (or never did) exists.
In this case, the CRAWLER_DEADLETTER_PROVIDER value is cd(azblob).
Deadletter documents are also stored in the same blob as CRAWLER_AZBLOB. There is a "deadletter" folder within the blob.
The crawler makes extensive use of the GitHub API. This is an API token that allows it to do this.
This is used to identify a group of crawler deployments.
Note that we only use this in the development environment, not in the production environment.
We use Azure Application Insights to monitor the crawler application. This requires a key and this is where it is kept.
This is a name to refer to the crawler with. Note that we set it in the App Service in the development environment and in the Docker file for the Prod environment.
The crawler does not harvest repos or package repositories immediately when they are found. It queues them up to be harvested, then takes items off that queue and harvests them in a timely fashion. This is to avoid overloading our infrastructure with too many harvests at once.
We use an Azure Storage Queue (which is kept within the same Azure Storage Account as the blobs used by this environment)
This is an optional variable that's used if you want to have your queues in a different Azure account from the results azblobs. Intended to be used if you are hosting the crawler yourself and submitting results to CD's Azure blobs. So that your queue data is segregated from the CD Azure account. For security and compliance regulations.
This environment variable is a number of scancode-toolkit
processes to run in parallel. scancode-toolkit
is one of the main
tools that collect the licensing data that goes into the final definition, and increasing parallelism, if the CPU allows,
speeds up processing of individual definitions. The default value is 2
, and a good ballpark value is ~80% of total CPUs
available for crawler.
This token is used by the ClearlyDefined service to send requests to the ClearlyDefined crawler.
This is the url of the App Service running the crawler.
It's unclear where this environmental variable is used within the crawler.
We use multiple services to store the crawler's harvests of license information.
If you look at the value of this environmental variable, you will see that it is "cdDispatch+cd(azblob)+webhook"
These are used by the crawler configuration code.
Let's break this down
cd(azblob)
This indicates that we are storing the content of the harvests (the license information about the component) in an Azure Storage Blob. In this case, we are using the the same blob as we use for the CRAWLER_AZBLOB environmental variables.
webhook
This refers to a webhook on the ClearlyDefined service (the backend API). This is what the crawler calls when it completes a harvest. The ClearlyDefined service will then take action on the information in the harvest.
cdDispatch
We use a few different "dispatchers" - which are used to fetch GitHub repos or Package Repositories to harvest the license data from. We use one for GitHub repos, one for npm packages, one for crates, etc.
cdDispatch refers to the generic base file that handles calls to the various dispatchers.
These environmental variables are used to define the url for the ClearlyDefined service's webhook URL (This is what the crawler calls after it completes a harvest).
In Dev the webhook url is "https://dev-api.clearlydefined.io/webhook".
The token is what we use to authenticate to the API (so that only the crawler can call that part of the ClearlyDefined Service api)
The Docker environmental variables define what container image is used for the Crawler, as well as what registry that image is kept in, and authentication info for the registry.
Used by Azure App Service to deploy based on image registry webhook.
These values are identical to the ones stored for the CRAWLER_AZBLOB environmental variables. It's not clear why these are separate, the do not appear to be used anywhere in the crawler. They may be able to be removed.
This does not appear to be used anywhere in the Crawler. It may be able to be removed.
This also does not appear to be used anywhere in the Crawler. It may be able to be removed.
And that concludes our discussion of cdcrawler-dev.
Now let's move onto the most complex App Service - clearlydefined-api-dev