Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Structured metadata #266

Closed
bakkot opened this issue Aug 31, 2022 · 85 comments
Closed

RFC: Structured metadata #266

bakkot opened this issue Aug 31, 2022 · 85 comments
Labels
enhancement New feature or request

Comments

@bakkot
Copy link
Contributor

bakkot commented Aug 31, 2022

I have a proposal for a spec for metadata, laying out goals and a formal spec.

I'm happy to implement this if there's buy-in.

Thoughts?


RFC: Structured metadata

Currently when generating images from the CLI (but not the web), metadata for that is stored as a string kind-of corresponding to the prompt. That metadata is enough to reproduce the original image... sometimes.

I'd like to:

  • be more precise about the metadata which gets stored
  • allow reproducing any output just from the metadata and necessary input files
    • necessary input files meaning the model weights, the image for img2img, and the embeddings if using embeddings
    • metadata should allow you to confirm that you have the right inputs, by storing hashes of all of those files
    • "any" output includes outputs from seed fuzzing and interpolations (which I haven't written yet, in part because I wanted to work out the metadata format first)
  • store it in a structured format, namely JSON
  • expand the metadata so it works with grids
  • expand the metadata so it works with stuff like variations and interpolations

To that end, I'd like to propose the following spec for metadata.

In this doc, "hash" means "the first 8 characters of the hex-encoded sha-256".

Data location

Metadata is a JSON string following the "top-level data" schema, stored in an uncompressed PNG tEXt or iTXt chunk named "sd-metadata". (This corresponds to what PIL does already when adding text data - it will choose tEXt or iTXt depending on whether it contains non-latin-1 characters. I just figure it's worth writing this down.)

Top-level data

The top-level metadata should have the following fields:

  • model: "stable diffusion"
  • model_id: string identifying the model. must by the model_id field of a Model card. Optional; there is no default value, but consuming applications may infer a value from model_hash if they recognize that value.
  • model_url: a string giving a URL where the model can be downloaded (if public) or read about (if not). Optional, does not have a default.
  • model_hash: hash of the weights [precise format TBD depending on implementation feasibility]; see the "model information" section below
  • app_id: a string identifying the application consuming the model. It is recommended, but not required, that applications hosted on GitHub use the username/repo_name of the repository in this field; for example, the fork we're on would use lstein/stable-diffusion.
  • app_version: a string giving the version of the app from app_id. It is recommended, but not required, that projects with numbered versions use a string of the form v1.0, and that projects built from git repos use the short-form git hash of the commit. Optional, defaults to "unknown".
  • app_url: a string giving the canonical location of the application on the web. Optional, does not have a default.
  • embeddings_hashes: an an array of the hashes of any textual-inversion embeddings in use. Optional, defaults to an empty array.
  • arch: "cuda", "MPS", or another helpful value indicating the GPU architecture. Optional, defaults to "unknown".
  • grid: a boolean, whether this was a grid. Optional, defaults to false.
  • metadata_version: the string "1.0". Optional, defaults to "1.0". Breaking changes to this metadata format should update this field.

and then also one of the following two fields, depending on whether this is a grid:

  • image: an object in one of the formats specified below
  • images: an array of such objects

Image data

Every image has the following fields:

  • type: either "txt2img" or "img2img"
  • postprocessing: either null, indicating no postprocessing was done, or an arbitrary object representing the postprocessing performed. Spec for this will depend on individual postprocessors, but I'll write something up for the ones we support. Optional, defaults to null.
  • sampler: one of these samplers
  • prompt: a nonempty array of { prompt: string, weight: number } pairs. The single-prompt case is [{ prompt: prompt, wieght: 1 }]
  • seed: a seed
  • variations: an array of { seed: number, weight: number } pairs used to generate variations. Optional, defaults to an empty array.
  • steps: the number of steps configured to be taken
  • cfg_scale: the unconditional guidance scale
  • step_number: the number of steps actually taken. Normally this will be the full number of steps, but for intermediate images it may be less. Optional, defaults to steps (or strength_steps in the case of img2img).
  • width: the specified width (as a number of pixels). Optional only when this metadata is embedded in am image whose width is the same as this value would be, in which case it defaults to that image's width.
  • height: the specified height (as a number of pixels). Optional only when this metadata is embedded in am image whose height is the same as this value would be, in which case it defaults to that image's height.
  • extra: an object containing any necessary additional information to generate this image. Not to be used for other data, like contact information. Optional, defaults to the empty object.

Images of type img2img also have the following fields:

  • orig_hash: hash of the input image
  • strength_steps: the configured strength for running img2img (as an integer; as discussed here, that's what it actually is).

Height/width are not stored since you can infer those from the file.

Thoughts on storing the model information

I am proposing to store a hash of the loaded model, which is a lot faster than reading the file from disk a second time, but the hash correspond to the file on disk. Better than nothing, though.

Is it worth also storing a hash of the model config? I don't think so, since you're always going to need the original config for a given model weights file.

This was referenced Aug 31, 2022
@psychedelicious
Copy link
Collaborator

Perhaps fork could be a commit reference rather than a repo reference - if a repo changes its implementation, you may not be able to recreate the output.

@bakkot
Copy link
Contributor Author

bakkot commented Sep 1, 2022

I'd love to include the current commit, but unfortunately a lot of users get the source by downloading the zip file from github, which does not include the hash anywhere, as far as I can tell (in particular it doesn't include the .git directory).

I guess it could be an optional field to be included if the code generating the metadata can reasonably figure it out, though.

@fat-tire
Copy link
Contributor

fat-tire commented Sep 1, 2022

Thought: Could this be coordinated with the URI definition standard request I proposed here? So that attribute names match, etc.? And how might they work together? Would a URI point to the image and a second argument produce this structured metadata?

@fat-tire
Copy link
Contributor

fat-tire commented Sep 1, 2022

Also, fwiw, the single-file, cross-platform PyQt5-based SD GUI I was contributing to already saves its settings in json, but I was realizing that the json settings actually define the image itself in a way.... Dunno if this is something to build on for a proof-of-concept as it's super easy to add to.

horses2

@bakkot
Copy link
Contributor Author

bakkot commented Sep 1, 2022

@fat-tire URIs are inherently somewhat unsuited for structured data, and a full specification for a SD image (when you take into account stuff like variations and grids) is inherently structured. So if you want to do a URI, I think it would be best to have only a single key-value pair, where the value is JSON in the format of the spec proposed here. Then you don't need to try to coordinate two different formats for this specification.

@psychedelicious
Copy link
Collaborator

I'd love to include the current commit, but unfortunately a lot of users get the source by downloading the zip file from github, which does not include the hash anywhere, as far as I can tell (in particular it doesn't include the .git directory).

I guess it could be an optional field to be included if the code generating the metadata can reasonably figure it out, though.

Ah, right, didn't think about that. Optional field sounds good - and now that I think about it, if we are including a commit reference, we ought to include a branch reference as well (or does a commit imply a specific branch? I don't know).

@fat-tire
Copy link
Contributor

fat-tire commented Sep 1, 2022

@fat-tire URIs are inherently somewhat unsuited for structured data, and a full specification for a SD image (when you take into account stuff like variations and grids) is inherently structured. So if you want to do a URI, I think it would be best to have only a single key-value pair, where the value is JSON in the format of the spec proposed here. Then you don't need to try to coordinate two different formats for this specification.

This is good, except it makes the URI super long (is the json value in your keypair further encoded/compressed in some way?) and is effectively a wrapper around the json. I'm wondering if there's an abbreviated but human-readable and easily-edited format that could be used to reference an image(s) resource?

Like imagine an image browser for SD-- if the complete json would show up in View Source, what would go in the URL/Address bar? On a reddit post featuring some cool approach to generating images, you wouldn't attach a json file, but what if you could put a short self-contained link that could be copy/pasted or tweaked by hand by any non-technical person-- what would it ideally look like? Or say I wanted a single text file or google doc full of accumulated copy/pasted references images, say, for some art project-- what's the smallest, one-line-per-image (for example) way to do collect them? What might fit in a small QR code?

These are the types of use cases I'm imagining. I feel like a structured json file, even though it was designed to be interpreted by humans, may be too large and unreadable for non-programmers to easily understand and make changes simply and quickly and that a compressed URI with &param=value and sane defaults is more familiar. Maybe I'm talkin' crazy tho, dunno.

@bakkot
Copy link
Contributor Author

bakkot commented Sep 1, 2022

Eh, they don't get that long unless you have a lot of data, and then, well, it is actually long. The data isn't compressed but there's simply not that much of it.

If you're really worried about length we could make some of the fields optional and specify default values - e.g. variations is assumed to be empty if omitted, etc. If we do that you end up with something like

{"ai-type":"stable diffusion","fork":"lstein","weights-hash":123456,"image":{"type":"txt2img","sampler":"k_euler","prompt":"the whole prompt goes here","seed":1234567890,"steps":50,"cfg_scale":0.7}}

which is pretty much as human-readable as a URL would be. More, arguably. There's not really a lot of overhead from the JSON format itself relative to URI-style k=v&k2=v2 style, just a few extra quotation marks (but fewer characters wasted on %20). And this really is the minimum information you need to unambiguously refer to a specific image.

I think copy-pasting things like the above is at least as easy as copy-pasting a URI, and has the benefit of not confusing people; URIs look like you should be able to point your browser at them, and that's not the case here. Plus URI encoding for spaces - which will come up a lot, because every prompt has spaces in it - is a huge pain for humans to deal with.

@fat-tire
Copy link
Contributor

fat-tire commented Sep 2, 2022

What you've said in principal makes sense to me, especially w/regard to %20 throwing people, but yes, sensible fallback defaults for omitted fields should be part of this standard to make it compact and easy for normal use by non-techies. However, as I think more about this, any defaults would need to be standard to a particular scenario-- ie,, they'd have to be "known" for consistency across apps supporting this metadata structure to agree to all fallback to defaults, and in the same way. But how would everyone know that say with the v4 model, "512x512 is the agreed-upon size default, so always assume that"? Is there some rule, like "the default size for any model is always the shape of the training images"? That seems too rigid. Who decides the defaults, and where would "sensible defaults" be published that anyone writing an application would know to find them?

Along the lines of defaults, I wonder if some fields should be deemed "required" vs "optional" or something-- the required ones would contain the bare minimum needed to generate an image-- people can then build out from there with greater specificity.

Also, you mention that height/width could be determined from the image (I assume, as this json is metadata embedded within an image) or when offered alongside it-- but if the json metadata alone is defined as sufficient to produce an image from scratch, I'd think you need to include width/height in the metadata. ie, don't assume anything should be inferred from the image, as an image may not be included or may have been resized or screenshotted or whatever (I acknowledge, I've expanded the use case beyond metadata in the file as originally conceived, but if this is a format also meant for copy/pasting in forums or via other human-means, it needs to be complete)

Also-- aside from the hash of the weights file, don't you want to have something to indicate about which version of a model is to be (or was) used, which at the very least would be helpful in providing feedback to the end user when they don't match or when a model is missing? Example-- SD is about to release v1-5 right? The model card already contains this metadata in the form of the model_id, such as "CompVis/stable-diffusion-v1-5"

Other random unformed ideas/thoughts:

  • "Ai-Type"- maybe "AI-Family"? "type" makes me think of data types or model types
  • "fork" -- maybe "variant"? Perhaps always using the model card model_id (if it exists) might be more consistent. (example: "CompVis/stable-diffusion-v1-4" and even "lstein/stable-diffusion-v1.5" if those are meant to be used together. (for that matter, would branch also be useful as an option-- some experimental branch for example might have some special property or function that isn't on main...)
  • Not sure if "fork" should always refer to a fork of the code (CompVis vs lstein) or a fork of a model, such as a retrained model which is almost certain to emerge (stable-diffusion-v1-5+-with-extra-nsfw). Perhaps this can be broken into variant and model_id.
  • Timestamp? Since this won't be able to be derived from an image when that image doesn't exist, it might be nice to have a time/date for organizing or even versioning different takes on the same image. This also may or may not match the time of the created image.
  • contact_info: string or object
  • image_name: string (preferred name to be created-- if the referenced image doesn't already exist)
  • When grid=True, is it the grid image that has this metadata? Do the individual output images contain the tEXt chunks as well? Does the grid image contain references to the names of the child images in /samples, and do the child images know about which grid image they belong to?
  • grid_name : string (if this image is part of a grid- ie, which numbered image is it?)
  • Description: string? Keywords for searching through images apart from what's in the prompt?
  • Citation: string? The model card for sd contains additional data about sd itself that may (or may not) be appropriate to include in an SD-created image. Maybe it doesn't belong there, but I wanted to at least offer it for consideration
  • Licensing info? either a link to the model card or a string with whatever license might be relevant.
  • Notes :string ? -- perhaps for the artist/developer who created it? "I tried to emphasize french expressionists to bring out X,Y,Z. Feel free to ask questions on twitter at.."
  • NSFW: boolean- again, probably doesn't belong here, but since it's emphasized so strongly in the sd repository and enabled by default, this flag might make sense.

Sorry there's a lot here and apologies if some of these don't quite relate to the lstein fork-- I'm only just getting familiar with the main repo. I don't wanna get to big and unweildy either, but I think some basics like H&W and model name/version have to be there if it's going to be used as a simple text-based generate-from-scratch "trading card", useful for learning, sharing, research, etc.

And I guess once everything is hammered out, if this turns out to be too heavy a way to do this, someone can always "URI-ize" it, especially once the field names are deemed stable and a good standard.

Anyway, this is pretty exciting stuff! Thanks.

ft

@bakkot
Copy link
Contributor Author

bakkot commented Sep 2, 2022

Along the lines of defaults, I wonder if some fields should be deemed "required" vs "optional" or something-- the required ones would contain the bare minimum needed to generate an image-- people can then build out from there with greater specificity.

Yeah, I was imagining only certain field could be omitted. As to how people would know how to interpret missing fields, we'd write it down. So for examples you could omit grid, and that would be defined to mean false.

Also, you mention that height/width could be determined from the image (I assume, as this json is metadata embedded within an image) or when offered alongside it-- but if the json metadata alone is defined as sufficient to produce an image from scratch, I'd think you need to include width/height in the metadata

Yeah that's a good point. I'll update the OP to add those as fields which are optional only when the metadata is embedded in an image file from which it is possible to derive those values.

Also-- aside from the hash of the weights file, don't you want to have something to indicate about which version of a model is to be (or was) used, which at the very least would be helpful in providing feedback to the end user when they don't match or when a model is missing?

The hash of the weights file is sufficient to uniquely determine the model.

For the sake of giving users helpful feedback, rather than just "the weights file you have doesn't match", it would be kind of nice to also have a version to report, but we don't necessarily know that - for example, the installation instructions for this repo suggest putting the weights at models/ldm/stable-diffusion-v1/model.ckpt. There's nothing there to indicate what version of the model is in use, as far as I'm aware.

That said, I do think that having a list of known hashes would be helpful. That wouldn't be part of the metadata spec per se, though it might live along side it, and tools could hardcode that list to give more useful feedback when they see a hash they know.

"fork" -- maybe "variant"?

I'm fine with either name. I don't just want to use the mode card ID because the point of this field is to distinguish this repo from others. Including the commit hash or branch name would be nice, but as discussed in the comments above it's often not possible.

When grid=True, is it the grid image that has this metadata?

Yes.

Do the individual output images contain the tEXt chunks as well?

At least in this repo, there are not any individual output images. There's just the grid.

Does the grid image contain references to the names of the child images in /samples, and do the child images know about which grid image they belong to?

Per above answer, there are no such images. But if there were, the answer to both of these would be no.

Timestamp, contact_info, image_name , Description, Citation, Licensing, NSFW

For all of this stuff, I don't think it belongs in this spec - this is a specification specifically for the metadata for images to tell you the settings used to generate the image. If you want to include other information alongside the image, put it somewhere else. E.g., wrap the data from this spec: so { contact_info: 'whatever', name: 'whatever', metdata: { [this spec] } }.

That way we don't have to keep a registry of additional optional fields, which in my experience never works, and all of the fields in this spec can be automatically derived, which is important.

@fat-tire
Copy link
Contributor

fat-tire commented Sep 2, 2022

Yeah, I was imagining only certain field could be omitted. As to how people would know how to interpret missing fields, we'd write it down. So for examples you could omit grid, and that would be defined to mean false.

That makes sense, so long as everyone who is implementing this standard can agree on what the fallback defaults are.

Also-- aside from the hash of the weights file, don't you want to have something to indicate about which version of a model is to be (or was) used, which at the very least would be helpful in providing feedback to the end user when they don't match or when a model is missing?

The hash of the weights file is sufficient to uniquely determine the model.

Yeah, but it's only practical for confirming that the model you have is valid, which can be determined via other methods (usually whoever provides the model in the first place will offer a hash). Wouldn't it make more sense to name the model, version, and creator (the model_id from the card)? I mean, that's what it's for, it would make it easy to find the model or even the latest version of the model, and you wouldn't need to have to maintain a giant table of hashes and their names, versions, and sources.

That said, I do think that having a list of known hashes would be helpful. That wouldn't be part of the metadata spec per se, though it might live along side it, and tools could hardcode that list to give more useful feedback when they see a hash they know.

To me, that seems backwards-- the metadata should point to the model used, not the hash of the model- then you don't need any table that has to be maintained and updated (see below re generic model.ckpt). Validating the file's integrity seems outside the scope of what you would expect-- a regular image file doesn't offer the hash of the binary of photoshop that created it-- though it may have a string that indicates the tool and version used..

At the very least I'd expect a model_id should be included in addition to the model hash (for verifying integrity, I guess). I believe that's what the model_id is specifically intended for-- as a unique id meant to identify the model's name, origin, and version.

For the sake of giving users helpful feedback, rather than just "the weights file you have doesn't match", it would be kind of nice to also have a version to report, but we don't necessarily know that - for example, the installation instructions for this repo suggest putting the weights at models/ldm/stable-diffusion-v1/model.ckpt. There's nothing there to indicate what version of the model is in use, as far as I'm aware.

That's why I'm suggesting using model_id to indicate the version of the model. It seems to me to be a responsibility for the sd application implimenting this spec to know what model it's using (by name), not just offer a hash of whatever's there and say "good luck figuring out what this actually is!". Sure, maybe you can use the hash as a checksum to verify you have the right model once you know which model to use, when you're about to recreate the image.

To me, only providing the mdoel hash is a bit like say "to bake this cake, go to the store and buy the one ingredient they have in a red box that weighs 12 oz and costs exactly $13.54". Okay, I guess if I had a list of the store's inventory with associated prices I could find it, but it would have to be an always maintained, up-to-date list. Why not just tell me the exact ingredient I need, so I can ask for it directly? (Yes, as a safety, I can verify the product with the price and weight, but that's not how I want to look for it.) And going back to the error message for the user, it's a lot clearer to say "To bake this cake you need a 12oz bag of Whitman's Quality Flour." vs "Sorry, you are missing a product in a 12oz red box that costs $13.54." and cross-check that with a hopefully current list of all possible ingredients.

"fork" -- maybe "variant"?

I'm fine with either name. I don't just want to use the mode card ID because the point of this field is to distinguish this repo from others. Including the commit hash or branch name would be nice, but as discussed in the comments above it's often not possible.

But wouldn't a variant called lstein/stable-diffusion do exactly that? That is, it distinguishes this repo from fat-tire/stable-diffusion or others-it tells you where to get it, and what it is-- or maybe I'm missing something?

Do the individual output images contain the tEXt chunks as well?

At least in this repo, there are not any individual output images. There's just the grid.

I've not used lstein, but upstream when you create a grid with say, 4 images, you also specify the number of rows (which has a default) and you get 5 images back-- a "grid" image containing the four images arranged in, well, a grid, and the 4 individual images in the /samples folder contained within the outputs folder. So I think of the grid as more of a "preview" image, and if you want any image independently you can grab it from the /samples. It would be nice at one point to have the grid do smaller versions of the originals-- right now they are full-sized.

Does the grid image contain references to the names of the child images in /samples, and do the child images know about which grid image they belong to?

Per above answer, there are no such images. But if there were, the answer to both of these would be no.

Okay- sounds like someone took out the child images in the lstein fork... fwiw, if it's to be compatible with the upstream repository- might want the child images to look like any other generated individual image. Having an --n_iter of 1 and grid of trueI believe will give you BOTH a grid of one image and a child image in /samples, but don't hold me to that.

Timestamp, contact_info, image_name , Description, Citation, Licensing, NSFW

For all of this stuff, I don't think it belongs in this spec - this is a specification specifically for the metadata for images to tell you the settings used to generate the image. If you want to include other information alongside the image, put it somewhere else. E.g., wrap the data from this spec: so { contact_info: 'whatever', name: 'whatever', metdata: { [this spec] } }.

That way we don't have to keep a registry of additional optional fields, which in my experience never works, and all of the fields in this spec can be automatically derived, which is important.

Two more thoughts then, for possible expansion at a later time:

spec_version : int or string -- a way to identify version 1.0 of this spec to a version 10 years from now, for future backward compatibility
future : object -- a place to attach more data that might become important later or to be used for "whatever" someone wants to stick in there.

@bakkot
Copy link
Contributor Author

bakkot commented Sep 2, 2022

Wouldn't it make more sense to name the model, version, and creator (the model_id from the card)?

That information simply isn't available for most users, is the problem. Users just download a weights file and use it. So the hash of the weights file is literally the only information we can use here.

But wouldn't a variant called lstein/stable-diffusion do exactly that?

Yes, like I said I'm fine with calling this field "variant" instead of "fork". If you are asking for some other difference from what I've proposed, I don't know what it is you're asking for.

(We can't use the commit because that information often isn't available; I'd include it if I could, and I think I probably will add an optional field to store that.)

spec_version

Interesting thought. I'm fine with it because I'm not worried about extra space, but if we're trying to minimize fields it's not strictly necessary - we could just say that any new versions will add a new field. (I'd call it metadata_version, though.)

future: object a place to attach more data that might become important later or to be used for "whatever" someone wants to stick in there

I actually just modified the spec to include an extra field! But it's explicitly not supposed to be used for arbitrary other data, only things necessary to reproduce the image; I do feel that data not related to that goal doesn't belong in this metadata and should live somewhere else.

@fat-tire
Copy link
Contributor

fat-tire commented Sep 2, 2022

Wouldn't it make more sense to name the model, version, and creator (the model_id from the card)?

That information simply isn't available for most users, is the problem. Users just download a weights file and use it. So the hash of the weights file is literally the only information we can use here.

For that use case, a hash makes sense if you somehow need to reference an otherwise unknown model file. But looking forward-- there is a near-universal, industry-standard way of uniquely identifying a specific model, adopted across all disciplines of ui-- the model_id. To not include this string, even as an optional field, would be a major omission, IMO.

Any retrained models stemming from a single architecture would have its own name in most cases. If not for some reason, then fallback to the hash.

Also, when you say "users just download a weights file and use it"... how so? Presently, people are sophisticated enough to download stable-diffusion or a GAN or whatever but somehow have no idea what model they are using or where it originated? I get that right now sd has a single place to put a model named model.ckpt or whatever, but that model got put there by someone who had to understand where it came from and what the license was, etc. Using unknown binaries (even as a model) isn't a very good idea, generally speaking.

But wouldn't a variant called lstein/stable-diffusion do exactly that?

Yes, like I said I'm fine with calling this field "variant" instead of "fork". If you are asking for some other difference from what I've proposed, I don't know what it is you're asking for.

I'd suggest calling it model_id to be consistent with the customary nomenclature and to know you are referring to the model and not the codebase. And define it to correspond to a model_id from a model card. Make it optional, in case this info is unknown, in which case the hash would be a secondary way to try to identify it and a primary way to validate that the hash is the expected one.

For a specific codebase, maybe use variant or something more helpful like a pointer to where the application or its source can be found

This schema needs to account for both the codebase and the model, both of which have (a) a name, (b), a source/author, (c) a version/tag

Additionally, both can have hashes/signatures and I guess you want to track it for the model as you're accounting for scenarios in which a model's (a), (b), and (c) are all uknown but you still want to build the image from "scratch"

how about:

model_id -- string -- from the model card (generally contains a/b/c)
model_hash -- to satisfy your use case of the unknown weights file

app_id (formerly app_variant) -- string -- this can be a description "lstein 1.2" or whatever human readable identifier for the code that you need to use the model"
app_uri -- the remote repository used to build this. Could contain a URL which would include a/b/c, including the branch name. The same type of URL used, say, with git clone-- could be git:// https:// cvs:// svn:// file:// etc. If it's a closed-source program, this can link to a .zip file, an .exe, a .deb, an installer, a home page, a finger://, a deep link for mobile apps, or a web front end for a cloud service at huggingface, etc., or awasm application.

(We can't use the commit because that information often isn't available; I'd include it if I could, and I think I probably will add an optional field to store that.)

Well a commit is only a single change to the repository, and a single commit might be on multiple branches anyway. But a tagged branch is usually not meant to change-- adding branch or tag as an optional field might make sense.

spec_version

Interesting thought. I'm fine with it because I'm not worried about extra space, but if we're trying to minimize fields it's not strictly necessary - we could just say that any new versions will add a new field. (I'd call it metadata_version, though.)

Yeah I didn't mean literally spec_version as that's super ambiguous.

future: object a place to attach more data that might become important later or to be used for "whatever" someone wants to stick in there

I actually just modified the spec to include an extra field! But it's explicitly not supposed to be used for arbitrary other data, only things necessary to reproduce the image; I do feel that data not related to that goal doesn't belong in this metadata and should live somewhere else.

Sounds good-- you're underlining my primary goal-- to have something short and sweet to paste in something like reddit or wherever.

Imagine the stringified json object had a name, and for lack of a better term I'll just call it ailink for now.

So imagine a post that says "See this cool picture? if you paste this ailink in your GUI too you'll get the image I made"..

It would parse the data, and with a tap of a button start popping out the image(s) in as reproducible a way as possible.

Then maybe you just made the image but think it could be better. So you make a tweak to the prompt and get a great result. You'd hit the "Copy ailink" button and then paste THAT into your reddit thread. Or maybe just hit the copy to QRCode and paste that along with your beautiful image so people can see how you did it.

Similarly, someone sends you a cool image and want to know what exactly went into making it, you'd load in that image, hit the Get info shortcut key and see the same formatted, editable ailink info, exactly as it appeared in the above example. You make some more tweaks, copy the ailink (again, this probably needs a way better name), paste it in a post, and everyone can see your improvements and how you did it.

But as time passes, it may be the case that you grab someone's ailink and paste it and it goes-- oh dang, for this you need a specific upgraded model-- here's where you can get it." Or it may say "this image can only be built with version 2.1 of of this app-- here's where to get that".

Again, hope this is all making sense.

@bakkot
Copy link
Contributor Author

bakkot commented Sep 2, 2022

For that use case, a hash makes sense if you somehow need to reference an otherwise unknown model file. But looking forward-- there is a near-universal, industry-standard way of uniquely identifying a specific model, adopted across all disciplines of ui-- the model_id. To not include this string, even as an optional field, would be a major omission, IMO.

Sure, I'm happy to have this as an additional, optional field, though no existing repository will be able to use it because no existing repository has the model card. (I guess I could, and probably should, hardcode specific hashes and their corresponding model cards, though.)

Also, when you say "users just download a weights file and use it"... how so?

This is an empirical description of the way people are currently using stable diffusion. They download the weights and use that. They are not currently in the habit of additionally downloading a model card, nor would I want to complicate the setup instructions by requiring that they do so.

It's not that people couldn't download this information alongside the weights, it's that right now they don't, and I really don't want to add additional setup burden. Were it me I'd've embedded this information in the checkpoint file, but as far as I'm aware this is not currently done.

model_id -- string -- from the model card (generally contains a/b/c)
model_hash -- to satisfy your use case of the unknown weights file
app_id (formerly app_variant) -- string -- this can be a description "lstein 1.2" or whatever human readable identifier for the code that you need to use the model"
app_uri -- the remote repository used to build this. Could contain a URL which would include a/b/c, including the branch name. The same type of URL used, say, with git clone-- could be git:// https:// cvs:// svn:// file:// etc. If it's a closed-source program, this can link to a .zip file, an .exe, a .deb, an installer, a home page, a finger://, a deep link for mobile apps, or a web front end for a cloud service at huggingface, etc., or awasm application.

That mostly sounds reasonable - with the caveat that model_id, but not model_hash, would be optional - but I don't think having "app id" and also "app uri" is necessary; I think just saying "this is a string which identifies the app used" is sufficient, and I would leave it up to the app to decide how to do that. For repositories on GitHub, as almost all projects are, I think username/repo_name would be sufficient - that's what I'd use here - but it would be up to each codebase to decide the appropriate way to identify itself. And just one field is enough for that.

Well a commit is only a single change to the repository, and a single commit might be on multiple branches anyway. But a tagged branch is usually not meant to change-- adding branch or tag as an optional field might make sense.

A commit hash unambiguously refers to a single state of the code, which is the important part. Neither branches nor tags have that property, so I don't much care about them. But sure, I am fine adding repo_tag as additional, optional field, alongside repo_commit (or whatever).

Again, hope this is all making sense.

Yup, I think we're mostly on the same page.

I'll update the OP later today.

@bakkot
Copy link
Contributor Author

bakkot commented Sep 2, 2022

OK, updated the spec in the OP. @fat-tire want to take another look?

@fat-tire
Copy link
Contributor

fat-tire commented Sep 3, 2022

Sure, I'm happy to have this as an additional, optional field, though no existing repository will be able to use it because no existing repository has the model card. (I guess I could, and probably should, hardcode specific hashes and their corresponding model cards, though.)

Sorry I'm a little confused-- I meant that the model card is just for humans to read to associate the model with an id-- so either it's bundled with an sd application, or the user has explicitly installed it, or the app has noted a specific model as a named requirement for recreating an image from its metadata.

Also, when you say "users just download a weights file and use it"... how so?

This is an empirical description of the way people are currently using stable diffusion. They download the weights and use that. They are not currently in the habit of additionally downloading a model card, nor would I want to complicate the setup instructions by requiring that they do so.

Oh of course not. They wouldn't need the actual card- but like you said, if there's a mystery .ckpt there, they could verify they have the "right one" for a-- (sigh) ailink by its hash as you suggest. But if they don't have it, and need to get it, that's where the id/uri come in.

It's not that people couldn't download this information alongside the weights, it's that right now they don't, and I really don't want to add additional setup burden. Were it me I'd've embedded this information in the checkpoint file, but as far as I'm aware this is not currently done.

Oh I never meant to suggest they would have to...

That mostly sounds reasonable - with the caveat that model_id, but not model_hash, would be optional - but I don't think having "app id" and also "app uri" is necessary; I think just saying "this is a string which identifies the app used" is sufficient, and I would leave it up to the app to decide how to do that.

But an app would have to know that it's incapable of supporting a particular model. I would suggest that model_hash would be optional if model_id is provided- but I guess it can't hurt to have a hash of the model for verification that it's the right one.

For repositories on GitHub, as almost all projects are, I think username/repo_name would be sufficient - that's what I'd use here - but it would be up to each codebase to decide the appropriate way to identify itself. And just one field is enough for that.

You mean for model_id right? the URL though would be needed to distinguish between "bob/mymodel" on github and "bob/mymodel" on gitlab or "bob/mymodel, tag release-2" on bitbucket or huggingface, or wherever else... I don't see how a URI wouldn't be indespensible. It even directs you directly to the specific tag and would be a virtual requirement for someone building from scratch who doesn't know the model or the ai community to know where to find the specific model required. (a direction to a commit hash would be good as well.)

Again, hope this is all making sense.

Yup, I think we're mostly on the same page.

I'll update the OP later today.

Cool thanks-- yeah it seems we're going in the same direction. Hopefully others will chime in as there may be uses cases or scenarios neither of us have contemplated.

Update-- took a look at the spec now-- looks great! My only holdouts are about the URI and possible confusion between github/gitlab/gitea/huggingface/etc. A URI pointing to a hash would clear up any ambiguity and offer clear direction for a user or automated process searching for the correct model or app w/o having to download and then check hashes.

Also to throw a wrench in this-- we're assuming cross platform support for the apps. I think we SHOULD care that we have the correct app, say some experimental new feature is supported here, but what if this won't work on my platform? What should happen in this case? Or maybe there's a Mac version of this windows program that WOULD work... what then? Maybe you're SOL- same as if you don't have enough memory or the right graphics card or whatever... or no?

again, nice work. Don't hate me-- I'm just playiing devil's advocate here :)

@bakkot
Copy link
Contributor Author

bakkot commented Sep 3, 2022

Sorry I'm a little confused-- I meant that the model card is just for humans to read to associate the model with an id-- so either it's bundled with an sd application, or the user has explicitly installed it, or the app has noted a specific model as a named requirement for recreating an image from its metadata.

I'm thinking about how an application like this one would populate the metadata, not how a user would consume it. Right now, there is no reasonable way for an application like this to populate model_id, in general, without adding an additional step to the installation instructions which requires the person using the application to input that ID. And I don't want to add an additional step to the installation instructions. So we can't require model_id in the metadata.

I'm fine with having model_id as an optional field, and I've added it to the current draft. I just don't think it will get much use, because I don't know see applications could possibly populate it except for a few known model weights unless they ask users of the application to provide that information as an additional step.

I would suggest that model_hash would be optional if model_id is provided- but I guess it can't hurt to have a hash of the model for verification that it's the right one.

Making model_hash optional is only sensible if we can actually trust users' manually input model_id, and we definitely can't - I absolutely guarantee users will copy-past wrong or forget to update when switching out weights. The hash is something the application derives for itself, so it's trustworthy. It can't be optional.

You mean for model_id right?

No, I mean the app_id. I agree that in theory "username/repo_name" could be ambiguous, but having a convention that "username/repo_name" means specifically that username/repo on GitHub is fine. This is a convention other specifications use without issue. Requiring that you prefix the common case with https://github.com/ doesn't add any benefit except making the ID larger and harder for humans to distinguish at a glance, given such a convention.

I guess I am OK with having an extra app_url field, but it just seems like needless overhead to me.

Of course if your project is hosted somewhere other than GitHub you can put a full URL in the app_id. But many projects won't have a URL to use; many people have private forks they're tinkering with, and it's still useful to uniquely identify those. So I don't want to require a URL.

Also to throw a wrench in this-- we're assuming cross platform support for the apps. I think we SHOULD care that we have the correct app, say some experimental new feature is supported here, but what if this won't work on my platform? What should happen in this case? Or maybe there's a Mac version of this windows program that WOULD work... what then? Maybe you're SOL- same as if you don't have enough memory or the right graphics card or whatever... or no?

I don't think we're assuming cross platform support, really? We're just saying "here's how this was generated". Nothing is stopping you from inputting the same settings into a different application; you're just not guaranteed to get the same output.

again, nice work. Don't hate me-- I'm just playiing devil's advocate here :)

Not to worry: I work on a standards committee; I am extremely used to working with this kind of feedback. And it's helpful to getting the best version of the spec. Doing it before finalizing the spec is the best time for that!

@fat-tire
Copy link
Contributor

fat-tire commented Sep 3, 2022

I'm thinking about how an application like this one would populate the metadata, not how a user would consume it. Right now, there is no reasonable way for an application like this to populate model_id, in general, without adding an additional step to the installation instructions which requires the person using the application to input that ID. And I don't want to add an additional step to the installation instructions. So we can't require model_id in the metadata.

Well, for me this started when I was contributing to this qt-based GUI repo, and when looking at the settings file I was like- wait everything currently in the settings would be all you needed to define the image. It was already being saved as json and I was like-- this- this in some way IS the image and the user could effectively swap out settings files as different images and trade them...

I thought a URL-encoded version might be simpler, but regardless, that's where I started.

But if the-- and I'm going to stop calling it an ailink right now, but if this whatever-text is universal enough between programs, it has to be able to direct a user that loads the json into an app how to get anything that's missing. This appears to be the opposite of your scenario, where you're more concerned with how the application would create the metadata. I want to know how it can be consumed and then any missing components like the model (let alone a post-processing step or whatever) most clearly be addressed.

I'm fine with having model_id as an optional field, and I've added it to the current draft. I just don't think it will get much use, because I don't know see applications could possibly populate it except for a few known model weights unless they ask users of the application to provide that information as an additional step.

The first thing I plan to do once this is settled as a working standard is to implement it in that GUI, so I'll be using the model_id probably always, in concert with the hash. It can be populated by hand or as you suggest from a lookup table, or derived from checking the URI, etc.

I would suggest that model_hash would be optional if model_id is provided- but I guess it can't hurt to have a hash of the model for verification that it's the right one.

Making model_hash optional is only sensible if we can actually trust users' manually input model_id, and we definitely can't - I absolutely guarantee users will copy-past wrong or forget to update when switching out weights. The hash is something the application derives for itself, so it's trustworthy. It can't be optional.

Okay.

You mean for model_id right?

No, I mean the app_id. I agree that in theory "username/repo_name" could be ambiguous, but having a convention that "username/repo_name" means specifically that username/repo on GitHub is fine. This is a convention other specifications use without issue. Requiring that you prefix the common case with https://github.com/ doesn't add any benefit except making the ID larger and harder for humans to distinguish at a glance, given such a convention.

As long as there is a URI to clarify the actual source, to make it easy to find or download, that's fine.

I guess I am OK with having an extra app_url field, but it just seems like needless overhead to me.

Of course if your project is hosted somewhere other than GitHub you can put a full URL in the app_id. But many projects won't have a URL to use; many people have private forks they're tinkering with, and it's still useful to uniquely identify those. So I don't want to require a URL.

Okay, don't require it, but I do think it will prove to be extremely useful, especially as models splinter and are retrained, etc.

I don't think we're assuming cross platform support, really? We're just saying "here's how this was generated". Nothing is stopping you from inputting the same settings into a different application; you're just not guaranteed to get the same output.

That's fair enough.

Not to worry: I work on a standards committee; I am extremely used to working with this kind of feedback. And it's helpful to getting the best version of the spec. Doing it before finalizing the spec is the best time for that!

I agree!

@lstein
Copy link
Collaborator

lstein commented Sep 3, 2022

I turn my back for a few days and this thread has grown to 18 comments!

Is the current RFC still the first posting, or is it an external file somewhere? It might be good to put it into the repository so that we can track version changes. Or even a Google doc.

@tildebyte
Copy link
Contributor

Or even a Google doc.

Agreed, given the (seeming; I don't have time to read it all 😁) length and breadth of the discussion in here.

@bakkot
Copy link
Contributor Author

bakkot commented Sep 3, 2022

I've been editing the original message; it's current with what I am proposing. You can look at the revision history if you want but it's not very exciting.

I might need to tweak it a couple more times - once to accommodate a richer variations format, for interpolations, and once to add a field for seamless, plus to note that individual applications may add additional fields if there is required information to generate an image. But other than those changes I'm happy with it, and I'll probably start implementing it (starting with a refactoring of the existing metadata code, probably in its own PR) later this weekend.

When I submit the PR implementing this format for metadata I'll include the spec as a markdown file and link to it from the readme. Don't want to formalize it before then because very few specs survive contact with first implementation, in my experience; I want to make sure it's at least possible to implement before I declare it good.

@tildebyte
Copy link
Contributor

@bakkot; I'm sure that you know this, but if we also have a JSON schema, it makes everyone's life much easier (if nothing else, the schema IS the spec)

@bakkot
Copy link
Contributor Author

bakkot commented Sep 3, 2022

I'll be sure to provide a JSON schema also.

@tildebyte
Copy link
Contributor

@fat-tire; I hope I don't come across as rude, but "See this cool picture? if you paste this ailink in your GUI too you'll get the image I made" is pixie dust.

This fork is several hundred commits ahead of upstream (which btw for all intents and purposes is dead: in the last 2-ish weeks, it's had - README updates - fixes to deps - safety tweaks (ew) - a license change) - there are two other forks which are "leading" forks. This one is No. 2 in terms of stars and forks, but has twice the commits of No. 1.

I say all that to say: no WAY is someone going to be able to take unedited generation parameters straight from here, and run then through one of the other forks and get anything like the same image (this repo itself has issues with reproducibility on macOS!).

I don't mean to say that reproducibility isn't a goal, or should be ignored; more that it doesn't really exist now (across forks), and I don't believe that anyone is coordinating anything like it (across forks)

@fat-tire
Copy link
Contributor

fat-tire commented Sep 3, 2022

I guess I am OK with having an extra app_url field, but it just seems like needless overhead to me.

@bakkot you're still going to add a model_uri or model_url, as an optional field right? I see it as critical. Once the spec (or json schema) is good to go I plan to implement it in that quick qt-based GUI, and use the field to direct the user to download the model if model.ckpt is not there or does not contain the correct hash. Ideally, the app would auto-download & install it, but it needs to know where to find it. Relying on a up-to-date hash-lookup I think is not practical. If the model_url is missing (since it's optional) then it can try a table of hashes either in the app or some central place or just say "here's the has of the model that's missing, good luck finding it!"

I say all that to say: no WAY is someone going to be able to take unedited generation parameters straight from here, and run then through one of the other forks and get anything like the same image (this repo itself has issues with reproducibility on macOS!).

@tildebyte Of course no one is expecting you to be able to do that-- that is the very reason for providing the optional app_id field and app_version -- to tell you specifically how you CAN reproduce the image-- or at least what was used to create the image you're looking at. If you load it into the wrong program it says "No, that won't work. you need version 2.1 of XYZ app to create the image". I was hoping for an app_url too to save you from googling and so you'd know specifically which fork/commit/web site to get it from.

Incidentally, was trying to come up with a good name to replace 'ailink':

  • Flexible, AI Resource Text (FAIRT)
  • AI Resource Card (AIRCard)
  • AIMF (AI Metadata File)

Meh. I'll keep thinking.

@bakkot
Copy link
Contributor Author

bakkot commented Sep 3, 2022

I say all that to say: no WAY is someone going to be able to take unedited generation parameters straight from here, and run then through one of the other forks and get anything like the same image

I actually think it's feasible, believe it or not! This fork does have tons of features, but most of them are optional, and we've been pretty good about not making "breaking changes" in the sense of "changing the output for a previously-working prompt"; if you aren't doing anything "fancy" like upscaling or variations, you probably actually can get images to reproduce on other forks. Indeed part of the point of this spec was the hope that other forks would implement the same format and support loading metadata from other applications, when the metadata indicated the image was generated from the subset of features which they support.

@bakkot
Copy link
Contributor Author

bakkot commented Sep 3, 2022

@fat-tire

you're still going to add a model_uri or model_url, as an optional field right?

Ah, right. I've just added model_url and app_url as optional fields.

@psychedelicious
Copy link
Collaborator

@bakkot did not include the filename for privacy reasons and instead opted for a hash: #266 (comment)

But that does not allow for metadata to be used to reproduce img2img images. You'd have to figure out which image was hashed. Maybe I'm missing something but that makes metadata worthless for img2img unless you remember which image you used.

IMO a filename and hash should be used. Privacy concerns are handled by the implementation e.g.t the back end copies init images to wherever they need to be and, if privacy mode is enabled, changes the filename to a UUID and strips metadata from them.

@fat-tire
Copy link
Contributor

IMO a filename and hash should be used

Maybe a URL instead of a filename or path? Even if it's file:// would be nice to reference an image resource accessible via http:// or ssh:// or whatever.

@psychedelicious
Copy link
Collaborator

Perhaps we have filename (e.g. my_init_image.png), location (e.g. path/to/my_init_image.png or https://www.website.com/my_init_image.png or whatever is appropriate), and hash.

@fat-tire
Copy link
Contributor

Since file://path/to/image.jpg is a legit URL just like https://www.example.com/image.jpg, maybe just a single image_url field would work, which could cover both "filename" and "location" and tell you how to get it. In the case of file:// it might prefer relative over absolute paths (to avoid inclusion of user or account names). There is of course a difference between posix-style paths and windows-style paths, but that's easily translated, especially for relative paths.

I was also just thinking that a nice thing about the node-based pipeline description of how to generate an image is that if img2img is used with an image which in turn contains metadata about how it was generated, that image could theoretically be imported and hooked into the graph to be recreated from its metadata too.

lstein added a commit that referenced this issue Sep 16, 2022
* Feature complete for #266 with exception of several small deviations:
1. initial image and model weight hashes use full sha256 hash rather than first 8 digits
2. Initialization parameters for post-processing steps not provided
3. Uses top-level "images" tags for both a single image and a grid of images. This change was suggested in a comment.

* Added scripts/sd_metadata.py to retrieve and print metadata from PNG files
* New ldm.dream.args.Args class is a namespace like object which holds all defaults and can be modified during exection to hold current settings.
* Modified dream.py and server.py to accommodate Args class.
@psychedelicious
Copy link
Collaborator

I'd like to question grid being relevant. Why do we need that? Yes, grids were in the initial scripts, but they aren't an inherent part of SD or other generation technology. Grid belongs to the presentation, not the image itself.

@codedealer
Copy link
Contributor

I'd like to question grid being relevant.

I agree. Also not clear on the purpose of variations in image_data, why does it matter which variations were created from that image or is it necessary for the reproduction?

The prompt field currently doesn't account for so called "negative prompts" which is different from prompts with negative weights, see: Sygil-Dev/sygil-webui#999

Lastly I'd like to voice my concerns about putting file paths of any nature into metadata. Privacy aside this is too unreliable a feature in my opinion. If we allow (and I think we should) these images to be shared across the community the paths (both relative and absolute) can change arbitrarily but it shouldn't impede reproducibility of an image in any case. The original spec already has orig_hash and that should be enough to verify that the image supplied is init_image. Yes it falls onto a user to remember which image was the original and to supply it with the shared one. An alternative to that would only be embedding the entire init_image into metadata itself.

@Kyle0654
Copy link
Contributor

We may want to consider including a handful of fields in the file metadata (separately from this) to indicate metadata spec version and format. I can imagine we'll eventually want to gzip or otherwise compress the metadata (if not come up with a binary format).

@bakkot
Copy link
Contributor Author

bakkot commented Sep 17, 2022

The PNG format where we're sticking this metadata already supports gzip'd text in tEXt and iTXt chunks, marked by a bit in the header for that chunk, so that's already future-proof.

(We almost certainly do not want to come up with a binary format.)

@psychedelicious
Copy link
Collaborator

psychedelicious commented Sep 17, 2022

Ok, so suppose the client has the responsibility of keeping track of init images. Aren't we, in practice, deferring a part of the spec (association of init image to result image) to client implementation?

Scenario 1: I have generated a lot of images via SD img2img. My computer crashes but thankfully I had a data backup. I reinstall whatever software I used to create the images. How does my software figure out which init images go with which results?

Works

  • Embedding init image
  • non-cryptographic file identifer (filename) + including the files
  • database keeping track of everything (if it wasn't lost in the crash)

Scenario 2: I have generated a lot of images via SD img2img. A new, vastly improved UI is created and I want to migrate to it. How does that happen?

Works

  • Embedding init image
  • non-cryptographic file identifer (filename) + including the files

Doesn't work

  • database keeping track of everything (new software has had to roll their own way of keeping track)

Scenario 3: I have generated a lot of images via SD img2img. My friend wants to iterate on my work. They don't use the same software I use. How do I send them my best results and include the init images?

Works

  • Embedding init image
  • non-cryptographic file identifer (filename) + including the files

Doesn't work

  • database keeping track of everything

Scenario 4: A new img2img method is invented in which an arbitrary number of init images are provided and you get an cool mix of all of them. How do we indicate which images are used?

Works

  • array of non-cryptographic file identifers (filenames) + including the files
  • database keeping track of everything

Doesn't work

  • Embedding (arbitrary number of files)

I understand that not including a reference to the init image besides a hash may be "correct", but I don't think it's functional. We're not building a metadata spec to be correct, we're building it to be used in the real world, right?

@psychedelicious
Copy link
Collaborator

psychedelicious commented Sep 17, 2022

After making a tea and doing some testing, I think embedding the init images as base64 is probably a good enough solution. I embedded 50x base64 images in a PNG's metadata without issues writing or reading the data back. The PNG is now 34 MB but well that comes with the territory.

Edit: according to this official-looking website, the max chunk length is a Very Large Number™️. So we can almost fit an abritrary number of init images. http://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html

@fat-tire
Copy link
Contributor

To save space, perhaps only include the images in leaf/terminal nodes as any intermediate images (flipped, rotated, combined, tiled, etc.) should be able to be derived from those, right?

@lstein lstein closed this as completed Sep 17, 2022
@codedealer
Copy link
Contributor

@psychedelicious I'm all for embedding an arbitrary number of images inside metadata as base64 if they all are needed to regenerate the image. You are going to need them anyway, whether they come packed into one file or several. Sharing/uploading just one file is easier in terms of general UX.

As a side note: why was this marked as completed? Doesn't feel like a conclusive solution was achieved?

@Kyle0654
Copy link
Contributor

It may be a good idea to provide a way to get the images with the metadata stripped out, especially if they're significantly larger because of it.

@psychedelicious
Copy link
Collaborator

The client could handle exporting an image without metadata e.g. "Share Image" vs "Share Image with SD Metadata"

So when init images (or masks or anything else that is invented) get embedded, we will need to strip them of their metadata, else when you chain img2img's, you end up with massive metadata. This goes back to considering init images and any other input to the current working image as atoms which come with no context of their own. Hope that makes sense

@psychedelicious
Copy link
Collaborator

I appear to have a "reopen" button that works. I have used it. This must have been closed by mistake, @lstein was doing some out of season spring cleaning.

@fat-tire
Copy link
Contributor

I was thinking-- so we're going to embed required images but not, say, embed the actual weights file, right? (of course not)

Since multiple images can share a single weights file, it's reasonable that maybe one init image will be shared between images too. There's a value in including that init file as a "standalone" image, but if you're grabbing 50 images that all share the same init image, you don't want the size of that image repeated 50x.

So maybe the notion of "static" vs "shared" (as in libraries) might be applicable... just to make things simpler (or maybe more confusing).

Maybe to manage such scenarios, have something like;

resources: Array of resource-- can be 0-N of them

A resource would have:
resource_id : string : this is the reference "handle" that will be used by the pipeline nodes in referencing this resource. Required and must be unique. Could be a hash of the file, but then we won't need the next bit.
resource_hash: string : Assuming this is a sha512 hash of the final binary-form of the resource, they can be tracked to see if it is locally available (as a file or even within other files), and if not, retrieved "on demand".
resource_mime: string: A description of the type of resource ("image/jpeg") : optional, default would be maybe just "image/*"
resource_url: string: A URL pointing to the resource. Can use any scheme-- ssh://, file://, http://, etc. Optional.
resource_content: base64 encoded resource binary. I presume this gets compressed. Very optional.

I'm probably missing something, and as always it's important to consider security implications of throwing a big blob in there.

I figured call this resource instead of image as to not confuse it with the final image(s) that are produced from the pipeline and for expandability as perhaps someday such resources will include more than just plain images.

In fact, the weights file itself can be seen as a resource-- not that you'd shove that 10G file in an image-- but instead of doing this:

model: "stable diffusion"
model_id: string identifying the model. must by the model_id field of a Model card. Optional; there is no default value, but consuming applications may infer a value from model_hash if they recognize that value.
model_url: a string giving a URL where the model can be downloaded (if public) or read about (if not). Optional, does not have a default.
model_hash: hash of the weights [precise format TBD depending on implementation feasibility]; see the "model information" section below

You could just make it a resource :

resource_id : CompVis/stable-diffusion-v1-4" from model card
resource_hash : [hash of stable-diffusion-v4 weights file]
resource_mime: model/pytorch (or something. I can't find any MIME types for model checkpoints)
resource_url: https://huggingface.co/CompVis/stable-diffusion-v1-4 or a file://path/to/model.ckpt (?)
resource_content: null [you get it separately. It's not in the image.]

One advantage to this is that for a pipeline with nodes, you might be using several models-- stable-diffusion to generate the image, then ESRGAN or something else to do more processing.

Anyway, this is just typing out loud, so maybe none of this is good... dunno. Maybe we're trying to do too much all at once. But it can't hurt to think a few steps ahead about what may be possible so that it's not THAT hard to redo later.

@codedealer
Copy link
Contributor

So when init images (or masks or anything else that is invented) get embedded, we will need to strip them of their metadata, else when you chain img2img's, you end up with massive metadata. This goes back to considering init images and any other input to the current working image as atoms which come with no context of their own. Hope that makes sense

I can't know what will be invented in the future but at least in regards to img2img I don't expect the size of the generated PNG to be sufficiently large (unless upscaled). An image that is the result of a chain of 100 img2img generations still needs to embed only the 99th image because that is the only one that is needed to regenerate it.

I don't propose to store all of the chain in the init_image field, only the previous one, even if that one itself embeds an init_image it should be stripped out. Only the actual image data of the initial image's PNG is relevant for the regeneration not how said image was produced (it could have been generated with SD or downloaded from a hosting or it's a photo from a phone it shouldn't matter).

@psychedelicious
Copy link
Collaborator

Yeah we are suggesting the same thing here. I brought it up in reference to a past conversation somewhere on this repo in wihch this same question was raised i.e. if we embed/store an init image as metadata for a result image, should we store that init image's metadata.

@lstein
Copy link
Collaborator

lstein commented Sep 19, 2022

I am going crazy. I cannot see this discussion in GitHub GUI. The only way I can find it is to manually type the full URL. I've also tried to pin it, but it doesn't show up. Does someone understand what's going on here?

@lstein lstein added the enhancement New feature or request label Sep 19, 2022
@invoke-ai invoke-ai locked and limited conversation to collaborators Sep 19, 2022
@lstein lstein converted this issue into discussion #689 Sep 19, 2022
@blessedcoolant blessedcoolant unpinned this issue Sep 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants