In this talk, Arnav Garg, ML Eng Leader at Predibase, discusses new innovations in fine-tuned model inference. Specifically, he deep dives on Turbo LoRA, a new parameter-efficient fine-tuning method pioneered at Predibase that increases text generation throughput by 2-3x while simultaneously achieving task-specific response quality in line with LoRA.
While existing fine-tuning methods focus only on improving response quality – often at the cost of lower throughput – Turbo LoRA actually improves throughput over both standard LoRA and even the base model. This ultimately means lower inference costs, lower latency, and higher accuracy, all in a single adapter that can be created with just one line of code. Watch the replay for a technical deep dive on how this new approach to GenAI inference works under the hood.
--------------------------------------------------------------------------------------------------------------------------------------
Session slides: [ Ссылка ]
Try Predibase for free: [ Ссылка ]
Ещё видео!