The dataset is an open-source dataset provided by Instacart (source)
This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.
Below is the full data schema
orders(3.4m rows, 206k users):
order_id: order identifieruser_id: customer identifiereval_set: which evaluation set this order belongs in (seeSETdescribed below)order_number: the order sequence number for this user (1 = first, n = nth)order_dow: the day of the week the order was placed onorder_hour_of_day: the hour of the day the order was placed ondays_since_prior: days since the last order, capped at 30 (with NAs fororder_number= 1)
products(50k rows):
product_id: product identifierproduct_name: name of the productaisle_id: foreign keydepartment_id: foreign key
aisles(134 rows):
aisle_id: aisle identifieraisle: the name of the aisle
deptartments(21 rows):
department_id: department identifierdepartment: the name of the department
order_products__SET(30m+ rows):
order_id: foreign keyproduct_id: foreign keyadd_to_cart_order: order in which each product was added to cartreordered: 1 if this product has been ordered by this user in the past, 0 otherwisewhere
SETis one of the four following evaluation sets (eval_setinorders):
"prior": orders prior to that users most recent order (~3.2m orders)"train": training data supplied to participants (~131k orders)"test": test data reserved for machine learning competitions (~75k orders)