Recommended/Best practice for chat implementation. Extend input_ids or _gen_begin_reuse()/_gen_feed_tokens()? #770

karlsolomon · 2025-04-14T02:59:33Z

karlsolomon
Apr 14, 2025

I'm looking at the example "minimal_chat.py". In order to establish a running chat history/context, every interaction the example appends to an increasing context_ids list. Meanwhile, if I understand correctly, previous input_ids and generations are already stored in KV Cache. Are repeated input/context_ids skipped over somehow, or is this ever-growing list of inputs being re-tokenized/encoded every time? If re-encoded every time, wouldn't this slow down TTFT as the chat context gets longer? Would it be better practice to use implementation akin to say _gen_feed_tokens() or _gen_begin_reuse()? Is there a way to leverage Dynamic Generator in the same way as Streaming Generators's _gen_feed_tokens() and/or _gen_begin_reuse()? Apologies if any of these questions are obvious. I'm neither experienced with (local) LLMs nor am I a strong python developer. Thanks! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended/Best practice for chat implementation. Extend input_ids or _gen_begin_reuse()/_gen_feed_tokens()? #770

{{title}}

Replies: 0 comments

Select a reply

Recommended/Best practice for chat implementation. Extend input_ids or _gen_begin_reuse()/_gen_feed_tokens()? #770

karlsolomon Apr 14, 2025

Replies: 0 comments

karlsolomon
Apr 14, 2025