A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.
At the moment, it’s only got a transformer for one dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata. If you do implement transformers for others, please do submit them as PRs!
xtdb-kaggle
is a REPL based tool at the moment. To get set up:
-
Clone the repo
-
Get yourself a Kaggle API key file - create an account, head to your account settings and download an API key JSON file
-
Set a
KAGGLE_KEY_FILE
environment variable pointing to the key file -
Start a REPL and connect to it in your usual way.
Then, find yourself an interesting dataset on Kaggle
You need to tell xtdb-kaggle
which files you’d like to download, and then how to turn each file into XTDB operations - this is done using multimethods.
Using that movie dataset as an example - we have an :owner-slug
of "tmdb"
, a :dataset-slug
of "tmdb-movie-metadata"
, and two files: "tmdb_5000_movies.csv"
and "tmdb_5000_credits.csv"
.
We define dataset-file-names
to specify the files, and one instance of csv-row→ops-fn
for each file:
(defmethod dataset-file-names ["tmdb" "tmdb-movie-metadata"] [_]
#{"tmdb_5000_movies.csv" "tmdb_5000_credits.csv"})
(defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_movies.csv"] [_]
(fn [{:strs [id title runtime budget revenue keywords genres] :as row}]
[[::xt/put {:xt/id (keyword (name 'tmdb.movie) id)
:tmdb/type :movie
:tmdb.movie/id (Long/parseLong id)
:tmdb.movie/title title
:tmdb.movie/budget (some-> budget Long/parseLong)
:tmdb.movie/revenue (some-> revenue Long/parseLong)
:tmdb.movie/keywords (->> (json/read-value keywords)
(into #{} (map #(get % "name"))))
:tmdb.movie/genres (->> (json/read-value genres)
(into #{} (map #(get % "name"))))}]]))
(defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_credits.csv"] [_]
(fn [{:strs [movie_id cast] :as row}]
(let [movie-id (Long/parseLong movie_id)]
(->> (for [{cast-name "name", :strs [credit_id id character]} (json/read-value cast)]
[[::xt/put {:xt/id (keyword (name 'tmdb.cast) (str id))
:tmdb/type :cast
:tmdb.cast/id id
:tmdb.cast/name cast-name}]
[::xt/put {:xt/id (keyword (name 'tmdb.credit) credit_id)
:tmdb/type :credit
:tmdb.movie/id movie-id
:tmdb.cast/id id
:tmdb.cast/character character}]])
(apply concat)))))
Then, we can stream the dataset to a local file of XTDB transaction ops using:
(->> (dataset->ops {:owner-slug "tmdb", :dataset-slug "tmdb-movie-metadata"})
(ops->stream (io/output-stream (io/file "/tmp/movies.edn"))))
Have fun!