Recommended/Best practice for chat implementation. Extend input_ids or _gen_begin_reuse()/_gen_feed_tokens()? #770
karlsolomon
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm looking at the example "minimal_chat.py". In order to establish a running chat history/context, every interaction the example appends to an increasing context_ids list. Meanwhile, if I understand correctly, previous input_ids and generations are already stored in KV Cache. Are repeated input/context_ids skipped over somehow, or is this ever-growing list of inputs being re-tokenized/encoded every time? If re-encoded every time, wouldn't this slow down TTFT as the chat context gets longer? Would it be better practice to use implementation akin to say _gen_feed_tokens() or _gen_begin_reuse()? Is there a way to leverage Dynamic Generator in the same way as Streaming Generators's _gen_feed_tokens() and/or _gen_begin_reuse()? Apologies if any of these questions are obvious. I'm neither experienced with (local) LLMs nor am I a strong python developer. Thanks! :)
Beta Was this translation helpful? Give feedback.
All reactions