A service for getting movie and TV show metadata for an IMDb ID via HTTP or gRPC, using the official IMDb datasets
First you need import the data of the IMDb dataset into a database, then you need to start the web service which is backed by the database and finally you can query it via HTTP or gRPC.
First you need import the data of the IMDb dataset into a database. We support BadgerDB and bbolt.
Steps:
-
Download the
title.basics.tsv.gz
dataset from https://datasets.imdbws.com- For more info about IMDb datasets see https://www.imdb.com/interfaces/
-
⚠ Warning:
IMDb.com, Inc
is the copyright owner of the data in the IMDb datasets. You may only use the data for personal and non-commercial use. For more info see "Can I use IMDb data in my software?" and their copyright/conditions of use statement.
-
Exract the TSV file somewhere
-
Run the import tool with the appropriate CLI arguments
- Example:
imdb2meta-import -tsvPath "/home/john/Downloads/data.tsv" -badgerPath "/home/john/imdb2meta/badger"
- Example:
Note: The import takes a while (and much longer with bbolt than with BadgerDB), the process requires a lot of memory and the final DB size is fairly big.
With a 6-core, 12-thread CPU and a mid-range SSD, an import of all data (7351639 rows as of 2020-11-21) into BadgerDB takes 4 minutes, up to 1.03 GB memory and the final DB size is 1.29 GB.
When skipping TV episodes and storing only the minimal metadata it takes 1 minute and 5 seconds, up to 530 MB memory and the final DB size is 314 MB.
CLI reference:
Usage of imdb2meta-import:
-badgerPath string
Path to the directory with the BadgerDB files
-boltPath string
Path to the bbolt DB file
-limit int
Limit the number of rows to process (excluding the header row)
-minimal
Only store minimal metadata (ID, type, title, release/start year)
-skipEpisodes
Skip storing individual TV episodes
-skipMisc
Skip title types like "videoGame", "audiobook" and "radioSeries"
-tsvPath string
Path to the "data.tsv" file that's inside the "title.basics.tsv.gz" archive
After importing the data you can start the web service.
Example: imdb2meta-service -badgerPath "/home/john/imdb2meta/badger"
CLI reference:
Usage of imdb2meta-service:
-badgerPath string
Path to the directory with the BadgerDB files
-bindAddr string
Local interface address to bind to. "localhost" only allows access from the local host. "0.0.0.0" binds to all network interfaces. (default "localhost")
-boltPath string
Path to the bbolt DB file
-grpcPort int
Port to listen on for gRPC requests (default 8081)
-httpPort int
Port to listen on for HTTP requests (default 8080)
You can also run the service as Docker container.
- Update the image:
docker pull doingodswork/imdb2meta-service
- Start the container:
docker run --name imdb2meta -v /path/to/badger:/data -p 8080:8080 -p 8081:8081 doingodswork/imdb2meta-service -badgerPath "/data"
-
Note:
Ctrl-C
only detaches from the container. It doesn't stop it. - When detached, you can attach again with
docker attach imdb2meta
-
- To stop the container:
docker stop imdb2meta
- To start the (still existing) container again:
docker start imdb2meta
After starting the web service you can query it via HTTP or gRPC:
Example request: curl "http://localhost:8080/meta/tt1254207"
Example response:
{
"id": "tt1254207",
"titleType": "SHORT",
"primaryTitle": "Big Buck Bunny",
"startYear": 2008,
"runtime": 10,
"genres": [
"Animation",
"Comedy",
"Short"
]
}
Example request (using grpcurl): grpcurl -plaintext -d '{"id":"tt1254207"}' localhost:8081 imdb2meta.MetaFetcher/Get
(In Windows/PowerShell you have to use '{\"id\":\"tt1254207\"}'
)
Example response:
{
"id": "tt1254207",
"titleType": "SHORT",
"primaryTitle": "Big Buck Bunny",
"startYear": 2008,
"runtime": 10,
"genres": [
"Animation",
"Comedy",
"Short"
]
}
To re-generate the meta.pb.go
file from the meta.proto
file, run: protoc -I="./protos" --go_out=./pb --go_opt=paths=source_relative meta.proto
To re-generate the service.pb.go
and service_grpc.pb.go
files from the service.proto
file, run: protoc -I="./protos" --go_out=./pb --go_opt=paths=source_relative --go-grpc_out=./pb --go-grpc_opt=paths=source_relative service.proto
IMDb.com, Inc
is the copyright owner of the data in the IMDb datasets. You may only use the data for personal and non-commercial use. For more info see "Can I use IMDb data in my software?" and their copyright/conditions of use statement.