Chatbots should let users select their preferred reading speed, defined by words per minute.
By dynamically adjusting batch sizes based on user-defined reading speeds, you could more effectively distribute requests, especially in large-scale distributed systems. For users preferring slower token generation, larger batches can be processed concurrently, maximising GPU throughput without compromising user experience (as these users have expressed they are indifferent to, or may even prefer, higher latency).
For the user, for different tasks the user may prefer different reading speeds. When generating code, I want responses as quickly as possible. But when I'm bouncing ideas off an LLM, I'd prefer a more readable pace rather than a wall of text.
By dynamically adjusting batch sizes based on user-defined reading speeds, you could more effectively distribute requests, especially in large-scale distributed systems. For users preferring slower token generation, larger batches can be processed concurrently, maximising GPU throughput without compromising user experience (as these users have expressed they are indifferent to, or may even prefer, higher latency).
For the user, for different tasks the user may prefer different reading speeds. When generating code, I want responses as quickly as possible. But when I'm bouncing ideas off an LLM, I'd prefer a more readable pace rather than a wall of text.