Following up from But why? lets look at the USDA foundation foods 24MB CSV, together with the PDF description also mentioned in the website. I've added them both here in the demo folder.
I looked around with Apple Numbers. A couple of observations:
- looks like db tables with ids to other tables
- the main tables seem to be food.csv, nutrient.csv, and food_nutrient.csv
- in nutrient.csv the most interesting ids seem to be 1003 (protein), 1004 (fat), 1005 (carbohydrates), 1063 (sugars total), 1235 (added sugars), 1079 (fibre), 1008 (energy), but there's lots of entries and some like energy and carbohydrates seem to be repeated
- in food.csv, the important ones seem to be data_type=foundation_food
- food_nutrient.csv links food.csv and nutrient.csv and the PDF says amount of nutrient is per 100g
- filtering food_nutrient.csv by food freezes my Numbers app
At this point I think it'd be pretty easy to load up data from these CSVs into memory and do some basic filtering to only get foundation food nutrient for the ones I identified.
But I don't really know if the nutrients I picked are the right ones. And I'm a bit resentful of Numbers crashing on some basic filtering over a 8mb CSV, which meant I had to eyeball stuff and scroll back and forth a lot. Surely I can do better than scrolling and cmd+f with my high powered laptop.
Lets load this data up into a in-memory database and take a look around.
First thing we'll need is a CSV parsing lib, and clojure has a first party one.
Add it to fdbconfig.edn
under extra-deps
.
There's no need to restart fdb watch
, it will see that fdbconfig.edn
changed and reload itself.
:extra-deps {org.clojure/data.csv {:mvn/version "1.1.0"}}
Lets take it for a spin in ~/fdb/demos/reference/nutrition/fdb.repl.edn
:
(require '[babashka.fs :as fs]
'[clojure.data.csv :as csv]
'[clojure.java.io :as io]
'[fdb.call :as call])
;; From https://github.com/clojure/data.csv?tab=readme-ov-file#parsing-into-maps
(defn csv-data->maps [csv-data]
(map zipmap
(->> (first csv-data) ;; First row is the header
(map keyword) ;; Drop if you want string keys instead
repeat)
(rest csv-data)))
(def csv-dir (-> (call/arg)
:self-path
fs/parent
(fs/path "FoodData_Central_foundation_food_csv_2023-10-26")))
(defn read-csv [filename]
(with-open [reader (->> filename (fs/path csv-dir) str io/reader)]
(doall
(csv-data->maps (csv/read-csv reader)))))
;; defonce so we don't do this again each time we eval
;; if you need to reset them, just change it to def
(defonce food (read-csv "food.csv"))
(defonce nutrient (read-csv "nutrient.csv"))
(defonce food-nutrient (read-csv "food-nutrient.csv"))
;; Take a peek at the data we got
(take 5 food)
You should see a some data printed in ~/fdb/demos/reference/nutrition/repl-out.fdb.clj
.
If you have a code editor configured for Clojure development you can connect to the fdb watch
nREPL server on port 2525 and eval the fdb.repl.edn
there instead.
;; => ({:fdc_id "319874",
;; :data_type "sample_food",
;; :description "HUMMUS, SABRA CLASSIC",
;; :food_category_id "16",
;; :publication_date "2019-04-01"}
;; {:fdc_id "319875",
;; :data_type "market_acquisition",
;; :description "HUMMUS, SABRA CLASSIC",
;; :food_category_id "16",
;; :publication_date "2019-04-01"}
;; {:fdc_id "319876",
;; :data_type "market_acquisition",
;; :description "HUMMUS, SABRA CLASSIC",
;; :food_category_id "16",
;; :publication_date "2019-04-01"}
;; {:fdc_id "319877",
;; :data_type "sub_sample_food",
;; :description "Hummus",
;; :food_category_id "16",
;; :publication_date "2019-04-01"}
;; {:fdc_id "319878",
;; :data_type "sub_sample_food",
;; :description "Hummus",
;; :food_category_id "16",
;; :publication_date "2019-04-01"})
Results look good enough to shove into a db and query. Let's make a in-memory XTDB node and just push everything there. This isn't going to be in the FileDB database itself - it's just a temporary db we're making to explore the data. Making temporary in-memory databases in Clojure is pretty easy.
We're going to have to pay some attention to ids.
We want to map them to :xt/id
so we can use pull to scoop up referenced data.
food
and food-nutrient
have their own ids and they don't seem to collide, so we can use those directly.
food
has fdc_id, and also doesn't doesn't seem to collide with the other ids.
Add this to the end of the repl file:
(require '[xtdb.api :as xt])
(defonce node (xt/start-node {}))
(defn add-xtid-and-submit
[coll from-k]
(->> coll
(mapv (fn [m]
[::xt/put (assoc m :xt/id (get m from-k))]))
(xt/submit-tx node)))
(defonce food-tx (add-xtid-and-submit food :fdc_id))
(defonce nutrient-tx (add-xtid-and-submit nutrient :id))
(defonce food-nutrient-tx (add-xtid-and-submit food-nutrient :id))
;; Wait for the db to catch up to the txs before querying
(xt/sync node)
(xt/q (xt/db node)
'{:find [(pull ?e [*])]
:where [[?e :data_type "foundation_food"]]
:limit 3})
And in the outputs you should see:
;; => [[{:fdc_id "1104647",
;; :data_type "foundation_food",
;; :description "Garlic, raw",
;; :food_category_id "11",
;; :publication_date "2020-10-30",
;; :xt/id "1104647"}]
;; [{:fdc_id "1104705",
;; :data_type "foundation_food",
;; :description "Flour, soy, defatted",
;; :food_category_id "16",
;; :publication_date "2020-10-30",
;; :xt/id "1104705"}]
;; [{:fdc_id "1104766",
;; :data_type "foundation_food",
;; :description "Flour, soy, full-fat",
;; :food_category_id "16",
;; :publication_date "2020-10-30",
;; :xt/id "1104766"}]]
Let's now get all the nutrients too. XTDB is pretty good at this with the pull projection.
(xt/q (xt/db node)
'{:find [(pull ?e [:description
{:_fdc_id [:amount
{:nutrient_id [:name
:unit_name]}]}])]
:where [[?e :data_type "foundation_food"]]
:limit 1})
Which gets us:
;; => [[{:description "Garlic, raw",
;; :_fdc_id
;; ({}
;; {:amount "2.7",
;; :nutrient_id {:name "Fiber, total dietary", :unit_name "G"}}
;; {:amount "63.1", :nutrient_id {:name "Water", :unit_name "G"}}
;; {:amount "9.8",
;; :nutrient_id {:name "Selenium, Se", :unit_name "UG"}}
;; {:amount "1.06", :nutrient_id {:name "Nitrogen", :unit_name "G"}}
;; {:amount "1.71", :nutrient_id {:name "Ash", :unit_name "G"}}
;; {:amount "0.38",
;; :nutrient_id {:name "Total lipid (fat)", :unit_name "G"}}
;; {:amount "10.0",
;; :nutrient_id
;; {:name "Vitamin C, total ascorbic acid", :unit_name "MG"}}
;; {:amount "6.62", :nutrient_id {:name "Protein", :unit_name "G"}}
;; {:amount "28.2",
;; :nutrient_id
;; {:name "Carbohydrate, by difference", :unit_name "G"}}
;; {:amount "143.0", :nutrient_id {:name "Energy", :unit_name "KCAL"}}
;; {:amount "597.0", :nutrient_id {:name "Energy", :unit_name "kJ"}}
;; {:amount "130",
;; :nutrient_id
;; {:name "Energy (Atwater Specific Factors)", :unit_name "KCAL"}}
;; {:amount "143",
;; :nutrient_id
;; {:name "Energy (Atwater General Factors)", :unit_name "KCAL"}})}]]
The pull projection is a bit gnarly:
(pull ?e [:description
{:_fdc_id [:amount
{:nutrient_id [:name
:unit_name]}]}])
It means:
- for the entity (each foundation food) get
:description
and what's in the{}
:_fdc_id
pulls all references back to this entity through the:fdc_id
key (you can do this with any key by prefixing it with_
)- for those we're getting
:amount
what's in the{}
:nutrient_id
pulls what's referenced in this entity in the:nutrient_id
key- for those we're getting
:name
and:amount
It's a bit easier if we look at it as built iteratively, each time replacing one of the pulled keys by {:key pull-pattern}
:
;; pull description and all the id of all the entities that references ?e (backrefs) via the :fdc_id key
;; NB: this won't actually return the backrefs ids, it's just easier to think about
(pull ?e [:description
:_fdc_id])
;; now for those entities, pull :amount and :nutrient_id keys
(pull ?e [:description
{:_fdc_id [:amount
:nutrient_id]}])
;; for the entity referenced in :nutrient_id, pull :name and :unit_name
(pull ?e [:description
{:_fdc_id [:amount
{:nutrient_id [:name
:unit_name]}]}])
It might feel daunting but at the end of the day it's an impressive amount of power and expressiveness in 4 lines. This is exactly what I'm looking for when hacking together my own tools.
At this point we've ascertained we can get the data we want from the CSV sources, in about 60 lines of code and pretty fast feedback cycles. I didn't make the code I put here right on the first try, but I did iterate and debug it with these fast feedback cycles until it was doing what I wanted, and it didn't take a long time.
Now I feel pretty confident about what I want to extract from these CSVs. What do I do with the data after extracting it though?
I want to:
- look it up offline on both desktop and mobile
- reference it
- calculate meal, daily, and weekly nutrition totals
I've been able to do some of these things from apps but never all of these things. Point 2 is particularly hard since apps are so isolated and I don't really own my data in most of them.
I'm a heavy Obsidian user so my thinking right now is that I should make a markdown file for each foundation food with its nutrition data in YML properties. Then I can reference it easily (point 1), and I have a little food database on desktop and mobile (point 2).
We could transact this data directly into the fdb node instead of putting it in a file. But that would mean we couldn't do point 1 or 2 as easily, and it wouldn't survive a database wipe or file sync to another machine.
I don't see any obvious way of calculating stuff (point 3) in Obsidian from YML properties though, nor from any data stored in markdown. But we have triggers so we should be able to somehow trigger the calculation and write back to the file. I'm a bit fuzzy on this part right now, but it seems doable, and it's a starting point.
You don't have to use Obsidian if you don't want to. You can use just markdown files on disk. Or something else. You do you. Even the most budget computer or phone around right now is more than capable of working through all of the personal data you can accumulate during your whole lifetime. Make it work for you.
Obsidian with YML is pretty straightforward: shove the yml into ---
fences at the start of the doc.
Clojure has a first party yml lib we can use so we don't have to think too much about it, add it to fdbconfig.edn
under extra-deps
:
clj-commons/clj-yaml {:mvn/version "1.0.27"}
And add this code to make a markdown file with yml:
(require '[clj-yaml.core :as yaml])
(defonce foods
(->> '{:find [(pull ?e [:description
{:_fdc_id [:amount
{:nutrient_id [:xt/id
:name
:unit_name]}]}])]
:where [[?e :data_type "foundation_food"]]}
(xt/q (xt/db node))
(map first)))
(defonce nutrients #{"1003" "1004" "1005" "1008" "1063" "1079"})
(defn to-markdown-yml [food]
(str "---\n"
(yaml/generate-string
(->> (:_fdc_id food)
(filter #(-> % :nutrient_id :xt/id nutrients))
(map (fn [{:keys [amount] {:keys [name unit_name]} :nutrient_id}]
[(format "%s (%s)" name unit_name) (parse-double amount)]))
sort
(into {}))
:dumper-options {:flow-style :block})
"---\n"))
(println (->> foods
first
to-markdown-yml))
This should print:
;; ---
;; Carbohydrate, by difference (G): 4.91
;; Energy (KCAL): 50.0
;; Protein (G): 3.35
;; Sugars, Total (G): 4.89
;; Total lipid (fat) (G): 1.9
;; ---
Which looks about right, considering the ;;
are just from the output format.
Now we write all these markdown files to disk.
(require '[fdb.utils :as u])
(defonce food-folder (u/sibling-path (:self-path (call/arg)) "foods/"))
(fs/create-dirs food-folder)
(defn write-to-md
[{:keys [description] :as food}]
(spit (str food-folder "/" (u/filename-str description) ".md")
(to-markdown-yml food)))
(run! write-to-md foods)
TODO: not all these foods have energy, but they seem to have protein/fat/carbs, maybe add a step to verify? think I need a fallback to the atwater ones
TODO: lean into obsidian, put something that can be used as vault in demo, but work over md to allow everyone to follow TODO: put in md, reference, trigger compute TODO: u/file-replace, for regex replacement on files TODO: some cool datalog queries for nutrition data, like foods with low carbs or high protein (hard without md reader)