A multi-pronged service created for the goal of collecting training data for USC research project "Early Fire Detection" that includes:
- an ArangoDB instance which stores the urls of ALERTWildfire's cameras (as collected by
scripts/enumerator.py
) and Tweets of interest as collected by the Tweet monitor - a distributed and asynchronous scraper which collects classic cam images from http://www.AlertWildfire.org and uploads a zip compressed archive of the images to Google Drive after each full execution
- a Tweet monitor that saves Tweets that mention @AlertWildfire's Twitter account (potentially in regards to a wildfire) to a database
- an asynchronous scraper that retrieves infrared cam images from http://beta.alertwildfire.org/infrared-cameras/ and uploads the images to Google Drive
"ALERTWildfire is a network of over 900 specialized camera installations in California, Nevada, Idaho and Oregon used by first responders and volunteers to detect and monitor wildfires." - Nevada Today
- Create a Twitter Developer account, start a new project, and set the SEARCHTWEETS_ENDPOINT, SEARCHTWEETS_BEARER_TOKEN, SEARCHTWEETS_CONSUMER_KEY, and SEARCHTWEETS_CONSUMER_SECRET environment variables in
docker-compose.yml
accordingly. Step-by-step guide to making your first request to the new Twitter API v2 - Create a Google Developer account, create a new project with the Google Drive API (ensure that the scopes include read access to file metadata and write/file upload access to drive), authenticate a user outside of Docker (I used Google's quickstart and a modified version of this exists at
scripts/gdrive-token-helper.py
), and set PROJECT_ID, TOKEN, REFRESH_TOKEN, and GDRIVE_PARENT_DIR environment variables accordingly.
docker-compose build --parallel && docker-compose up -d
ArangoDB database instance that stores all classic camera URLS (as collected by scripts/enumerator.py
), infrared camera URLS, and Tweets from the Tweet Alerts monitor
Technologies:
- Docker
- ArangoDB (latest)
cameras example:
{
"url": "http://www.alertwildfire.org/orangecoca/index.html?camera=Axis-DeerCanyon1",
"timestamp": "2021-08-24T20:51:37.433870",
"axis": "orangecoca.Axis-DeerCanyon1"
}
tweets example:
{
"id": "1430287078156234757",
"text": "RT @CphilpottCraig: Evening timelapse 5:25-6:25pm #CaldorFire Armstrong Lookout camera. @AlertWildfire viewing North from South side of fir…",
"scrape_timestamp": "2021-08-24T22:55:25.862109"
}
ir-cameras example::
{
"axis": "Danaher_606Z_Thermal",
"epoch": 1631050791,
"url": "https://weathernode.net/img/flir/Danaher_606Z_Thermal_1631050791.jpg",
"timestamp": "2021-09-09T18:54:53.195532"
}
Celery backend for scraping app.
Technologies:
- Docker
- Redis (latest)
Celery broker for scraping app.
Technologies:
- Docker
- RabbitMQ (latest)
RabbitMQ config file located at rabbitmq/myrabbit.conf
. consumer_timeout
is set to 1 hour in milliseconds, 10 minutes longer than the timeout time (in seconds) explicitly set for each scraping task in the Scraper's producer.
## Consumer timeout
## If a message delivered to a consumer has not been acknowledge before this timer
## triggers the channel will be force closed by the broker. This ensure that
## faultly consumers that never ack will not hold on to messages indefinitely.
##
## Set to 1 hour in milliseconds
consumer_timeout = 3600000
Classic cameras image scraping queue producer. This process is invoked when a new Tweet to AlertWildfire's Twitter account is recognized. Tweets are queried every minute. If a camera is mentioned by name or axis in a Tweet's text, the camera is prioritized when scraping.
Technologies:
- Docker
- ArangoDB (latest)
- Python 3.9
- Celery (5.1.2)
- searchtweets-v2
- Redis (latest)
- RabbitMQ (latest)
RABBITMQ_HOST: RabbitMQ host
RABBITMQ_PORT: RabbitMQ port
RABBITMQ_DEFAULT_USER: RabbitMQ user
RABBITMQ_DEFAULT_PASS: RabbitMQ password
REDIS_HOST: Redis host
REDIS_PORT: Redis port
CONCURRENCY: integer number of concurrent celery tasks
DB_HOST: database host
DB_PORT: (arangodb) database port
DB_NAME: (arangodb) database name
DB_USER: (arangodb) database user
DB_PASS: (arangodb) database password
SEARCHTWEETS_ENDPOINT: Twitter Developer API endpoint
SEARCHTWEETS_BEARER_TOKEN: Twitter Developer API bearer token
SEARCHTWEETS_CONSUMER_KEY: Twitter Developer API key
SEARCHTWEETS_CONSUMER_SECRET: Twitter Developer API secret
CHUNK_SIZE: integer number of camera urls to be retrieved by asynchronous HTTP requests per celery task
QUEUE: name of the queue to push tasks to
Logs are sent to stdout and stderr. This can be changed in classic-producer/conf/supervise-producer.conf
.
Distributed, asynchronous scraping service of classic images from ALERTWildfire cameras.
Technologies:
- Docker
- Python 3.9
- Celery (5.1.2)
- requests_html (0.10.0)
- Redis (latest)
- RabbitMQ (latest)
- Google Drive API
- Free Proxyscrape API
RABBITMQ_HOST: RabbitMQ host
RABBITMQ_PORT: RabbitMQ port
RABBITMQ_DEFAULT_USER: RabbitMQ user
RABBITMQ_DEFAULT_PASS: RabbitMQ password
REDIS_HOST: Redis host
REDIS_PORT: Redis port
CONCURRENCY: integer number of concurrent celery tasks
LOGLEVEL: logging level (i.e. info)
QUEUE: name of the queue to retrieve tasks from
DB_HOST: database host
DB_PORT: (arangodb) database port
DB_NAME: (arangodb) database name
DB_USER: (arangodb) database user
DB_PASS: (arangodb) database password
CLIENT_ID: Twitter API client ID
CLIENT_SECRET: Twitter API client secret
PROJECT_ID: Google Drive API project ID
TOKEN: Google Drive API token
REFRESH_TOKEN: Google Drive API refresh token
GDRIVE_PARENT_DIR: ID of Google Drive directory in which to save zip archives of the scraped images
Logs are sent to stdout and stderr. This can be changed in classic-scraper/conf/supervise-celery.conf
.
Infrared cameras image scraping queue producer.
Technologies:
- Docker
- ArangoDB (latest)
- Python 3.9
- Celery (5.1.2)
- Redis (latest)
- RabbitMQ (latest)
RABBITMQ_HOST: RabbitMQ host
RABBITMQ_PORT: RabbitMQ port
RABBITMQ_DEFAULT_USER: RabbitMQ user
RABBITMQ_DEFAULT_PASS: RabbitMQ password
REDIS_HOST: Redis host
REDIS_PORT: Redis port
CONCURRENCY: integer number of concurrent celery tasks
DB_HOST: database host
DB_PORT: (arangodb) database port
DB_NAME: (arangodb) database name
DB_USER: (arangodb) database user
DB_PASS: (arangodb) database password
QUEUE: name of the queue to push tasks to
Logs are sent to stdout and stderr. This can be changed in infrared-producer/conf/supervise-producer.conf
.
Distributed, asynchronous scraping service of infrared images from ALERTWildfire cameras.
Technologies:
- Docker
- ArangoDB (latest)
- Python 3.9
- Celery (5.1.2)
- requests_html (0.10.0)
- Redis (latest)
- RabbitMQ (latest)
- Google Drive API
- Free Proxyscrape API
RABBITMQ_HOST: RabbitMQ host
RABBITMQ_PORT: RabbitMQ port
RABBITMQ_DEFAULT_USER: RabbitMQ user
RABBITMQ_DEFAULT_PASS: RabbitMQ password
REDIS_HOST: Redis host
REDIS_PORT: Redis port
CONCURRENCY: integer number of concurrent celery tasks
LOGLEVEL: logging level (i.e. info)
QUEUE: name of the queue to retrieve tasks from
DB_HOST: database host
DB_PORT: (arangodb) database port
DB_NAME: (arangodb) database name
DB_USER: (arangodb) database user
DB_PASS: (arangodb) database password
CLIENT_ID: Twitter API client ID
CLIENT_SECRET: Twitter API client secret
PROJECT_ID: Google Drive API project ID
TOKEN: Google Drive API token
REFRESH_TOKEN: Google Drive API refresh token
GDRIVE_PARENT_DIR: ID of Google Drive directory in which to save zip archives of the scraped images
Logs are sent to stdout and stderr. This can be changed in infrared-scraper/conf/supervise-celery.conf
.