diff --git a/README.md b/README.md index 263aef8..9577d0b 100644 --- a/README.md +++ b/README.md @@ -232,6 +232,41 @@ Details: * As described [above](#customizing-the-role-per-prompt), you can also supply a `role` value in these objects, so that the full form is `{ role, type, content }`. However, for now, using any role besides the default `"user"` role with an image or audio prompt will reject with a `"NotSupportedError"` `DOMException`. (As we explore multimodal outputs, this restriction might be lifted in the future.) +### Multiple modalities in a single message + +Consider a prompt such as `Here is an image: <<>>. Please describe it.`. This is intended to be a single prompt from the user role. To express this, you can use an array for the `content` value: + +```js +const response = await session.prompt({ + type: "user", + content: [ + "Here is an image: ", + { type: "image", content: imageBytes }, + ". Please describe it." + ] +}); +``` + +This has _different semantics_ than prompting with multiple user messages: + +```js +// THESE ARE PROBABLY NOT WHAT YOU WANT +const probablyWrongResponse = await session.prompt([ + "Here is an image": ", + { type: "image", content: imageBytes }, + ". Please describe it." +]); + +// Equivalent (and also probably wrong) +const probablyWrongResponse2 = await session.prompt([ + { role: "user", type: "text", content: "Here is an image: " }, + { role: "user", type: "image", content: imageBytes }, + { role: "user", type: "text", content: ". Please describe it." } +]); +``` + +Those examples involve three separate user messages, which the underlying model will likely interpret differently. (To see this, compare with [our above multi-user example](#customizing-the-role-per-prompt), or with how you react when someone texts you three messages in a row vs. a single message.) + ### Structured output or JSON output To help with programmatic processing of language model responses, the prompt API supports structured outputs defined by a JSON schema. @@ -618,17 +653,36 @@ dictionary LanguageModelExpectedInput { typedef (LanguageModelPrompt or sequence) LanguageModelPromptInput; -// Prompt lines - typedef ( - DOMString // interpreted as { role: "user", type: "text", content: providedValue } - or LanguageModelPromptDict // canonical form + // canonical form + LanguageModelPromptDict + // interpreted as { role: providedValue.role, content: [{ type: providedValue.type, content: providedValue.content }] } + or LanguageModelPromptDictFlattened + // interpreted as { role: "user", content: [{ type: "text", content: providedValue }] } + or DOMString ) LanguageModelPrompt; +typedef ( + // canonical form + LanguageModelPromptContentDict + // interpreted as { type: "text", content: providedValue } + or DOMString +) LanguageModelPromptContent; + dictionary LanguageModelPromptDict { + LanguageModelPromptRole role = "user"; + required (LanguageModelPromptContent or sequence) content; +}; + +dictionary LanguageModelPromptDictFlattened { LanguageModelPromptRole role = "user"; LanguageModelPromptType type = "text"; - required LanguageModelPromptContent content; + required LanguageModelPromptContentValue content; +}; + +dictionary LanguageModelPromptContentDict { + LanguageModelPromptType type = "text"; + required LanguageModelPromptContentValue content; }; enum LanguageModelPromptRole { "system", "user", "assistant" }; @@ -640,7 +694,7 @@ typedef ( or AudioBuffer or BufferSource or DOMString -) LanguageModelPromptContent; +) LanguageModelPromptContentValue; ``` ### Instruction-tuned versus base models