Latent Space Podcast 8/10/23 [Summary]: LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Explore the magic of MLC with Tianqi Chen: deploying 70B models on browsers & iPhones. Dive into XGBoost, TVM's creation, & the future of universal AI deployments.

Prof. Otto NomosOct 05, 2023 ∙ 6 min read

Original Link: LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Summary

About TQ and XGBoost

In the recent episode of the Latent Space podcast, hosts Alessio and Swyx sat down with Tianqi Chen (TQ), an assistant professor at Carnegie Mellon University and a leading figure in the machine learning community. Tianqi wears many hats, including being associated with Catalyst Group and OctoML, and has a significant footprint in the open-source ecosystem, especially with projects like Apache TVM, XGBoost, and MXNet.

In a candid conversation, TQ shared that beyond his technical persona, he has a unique hobby of sketching design diagrams in real sketchbooks, chronicling his journey through various projects. These sketches serve as a blueprint for his software projects and provide a tangible record of his thought processes over the years.

Tianqi’s acclaimed project, XGBoost, came up for discussion, highlighting its origins and unexpected success. Originally designed as an alternative to the rising trend of deep learning models, XGBoost ended up establishing its own niche, particularly for tabular data where tree-based models excel. The discussion gravitated toward the balance and potential amalgamation of tree-based models and deep learning. TQ believes in the lasting relevance of tree-based models, especially considering their natural rules, scalability, and interoperability. The talk wrapped up with a glimpse into the future, hinting at the merging of transformer models and tree-based algorithms for enhanced data processing.

TVM Compiler, MXNet

Alessio brought up Tianqi's development of the TVM compiler framework for models, which was released around the same time as ONNX, seeking clarity on their relationship. Tianqi recalled his history with deep learning, mentioning his work on ImageNet classification using convolutional restricted Boltzmann machines before the emergence of AlexNet. He shared challenges faced while handcrafting NVIDIA CUDA kernels, which took months, only to find the model wasn't highly effective. This experience introduced him to the complexities of optimizing performance on GPUs.

Following his work on XGBoost, Tianqi collaborated on MXNet, which came before frameworks like CAFE and PyTorch. Recognizing the difficulties in optimization for different hardware, Tianqi sought to create a more automated and general solution, leading to the development of TVM. The TVM compiler can intake machine learning programs, apply optimization techniques, and generate low-level code compatible with various backends, including NVIDIA and non-NVIDIA platforms.

While Tianqi's shift from XGBoost to TVM seemed significant to Alessio, Tianqi clarified his motivation was less about impact and more about enjoying the coding process and addressing challenges. He identified as a problem-solver, and when faced with challenges, he seeks out tools or creates new ones to address them. This approach, he mentioned, is in line with an emerging trend in machine learning systems that consider both algorithmic and system optimizations.

Discussing the community's growth, Tianqi highlighted MLsys, a conference focusing on machine learning systems. Swyx noted Tianqi's involvement in major conferences like ICML and NeurIPS, suggesting that community organization plays a role in his work, to which Tianqi responded affirmatively, noting it's part of his academic responsibilities.

MLsys, MCLLM & MLC

In a conversation between Swyx, Tianqi, and Alessio, the discussion revolves around MLsys, MLCLLM, and machine learning compilation (MLC). Here are the key takeaways:

MLsys and MLCLLM: Swyx notes Tianqi's recent venture in MLsys and its integration with MLCLLM on mobile phones. He mentions using Llama 2 and Vicuña but seeks clarity about other models on offer.
Tianqi's MLC Journey: Tianqi explains his venture into MLC as an evolution from their initial project TVM. The main goal is to build an effective machine learning compiler. From the experience gained with TVM, they embarked on a second iteration named TVM Unity. MLCLLM is essentially an MLC initiative to develop machine learning compilation technology that can be applied widely. One achievement is getting machine learning models to run on phones and other universal platforms, including Apple's M2 Macs.
Integration with PyTorch: Addressing Swyx's query about model integrations, Tianqi highlights that while many models are built in PyTorch, the aim is to bring them into the TVM's program representation called TVM script. The goal is to optimize the models across various platforms and ensure they're portable and efficient.
MLC as a Discipline: Swyx points out that while many people specialize in compilers, Tianqi's niche in MLC seems innovative. Tianqi believes machine learning compilation will grow as a field, drawing inspiration from existing compiler optimizations and incorporating knowledge from machine learning and systems.
Optimization and Libraries: Discussing the limitations of relying on existing libraries for optimization, Tianqi elaborates on TVM's approach, which combines using available libraries and automatically generating libraries. This method facilitates support for less well-supported hardware.
Core Optimization Techniques: Tianqi touches upon four essential optimization techniques: Kernel fusion (combining operations smartly), memory planning (allocating memory efficiently), loop transformation (ensuring code runs efficiently), and weight quantization (reducing memory usage). He explains that these methods allow for both efficiency and portability in running machine learning models across various platforms.

In essence, the conversation underscores the significance of MLC and the evolution of platforms and optimization techniques to make machine learning models more universally applicable and efficient.

LLM in Browser

In a discussion with Swyx, Tianqi sheds light on the emerging trend of academics, like himself, transitioning from solely publishing insights to building tangible products, such as open-source projects and applications. Tianqi believes this hands-on approach enables academics to confront real-world problems directly and to ensure that their research provides immediate value to the public. In his field, machine learning systems, Tianqi sees the potential of deploying these systems into users' hands to drive innovation and solve genuine problems.

He elaborates on his experience with running a 70 billion parameter model in a browser, specifically highlighting the challenges and requirements of executing such a feat. Using the latest MacBook with an M2 Max and WebGPU technology, Tianqi's team was able to successfully run the model, showcasing the possibility of operating powerful models on consumer devices without needing installations. He envisions diverse application scenarios, including hybrid models that run both on-edge and server components.

Alessio queries about browser model integrations, and Tianqi introduces an NPM package, WebILM, which allows developers to embed their models onto web apps. Additionally, an OpenAI compatible REST API is under development to streamline integration further.

Lastly, Swyx touches upon the challenges of model downloads, wherein Tianqi mentions the Chrome cache system which prevents redundant downloads for similar web apps. When asked about the proliferation of local model projects, Tianqi emphasizes the importance of enhancing API capabilities and encourages a collaborative ecosystem that focuses on universal deployment.

Octomel & Conclusion

Alessio initiates the conversation by discussing Tianqi's involvement as the co-founder of Octomel and its recent release of OctoAI, a compute service focused on model runtime optimization. He inquires about Octomel's evolution from being a traditional MLOps tool to its current stance with OctoAI, particularly in the context of the market's shift towards pre-trained generative models.

Tianqi explains that they identified challenges related to scalability, integration, and optimization. As the market shifts towards generative AI, OctoAI aims to simplify the process and alleviate the complexity for users, allowing them to focus on their models while Octomel handles the underlying infrastructure.

Alessio points out that a significant bottleneck in the market is around the execution of AI models. Earlier, the challenge was around building models due to a lack of talent, but now, with numerous models available, the challenge lies in running them efficiently.

Tianqi underscores the nuances associated with "running" AI models. Given the diversity in hardware availability and ever-changing user requests, execution challenges have multiplied. Efficiently managing model locations and ensuring proximity to the execution environment are paramount. The future, according to Tianqi, involves leveraging all available hardware to reduce costs and optimize the interplay between edge devices and the cloud.

When Alessio probes about the challenges of abstracting hardware details from end users, Tianqi emphasizes the importance of compatibility with various hardware and the ongoing iterative process of refining their product based on user needs and feedback.

Swyx steers the conversation towards the broader AI landscape, where Tianqi shares his enthusiasm for open-source projects, especially those that champion inter-model interactions, and the prospect of a diverse ecosystem of AI agents.

Swyx then inquires about potential architectures succeeding transformers. Tianqi mentions models like RWKV and other recurrent networks integrated with transformers, emphasizing the ongoing growth in the model space.

In the lightning round, Tianqi reveals his surprise at the swift emergence of conversational chatbots. When questioned about the most intriguing unsolved question in AI, he expresses his fascination with continuous learning and lifelong learning for AI.

As a final takeaway, Tianqi encourages listeners to adopt a holistic approach when building AI applications. A successful AI system demands the fusion of algorithms, system optimizations, and data curations.

Content:

Summary

About TQ and XGBoost

Share this article

/Related stories See All Stories

Latent Space Podcast 8/16/23 [Summary] - The Mathematics of Training LLMs — with Quentin Anthony of Eleuther AI

Prof. Otto NomosOct 05, 2023 ∙ 3 min read

Explore the math behind training LLMs with Quentin Anthony from Eleuther AI. Dive into the Transformers Math 101 article & master distributed training techniques for peak GPU performance.

Latent Space Podcast 8/4/23 [Summary] Latent Space x AI Breakdown crossover pod!

Prof. Otto NomosOct 05, 2023 ∙ 7 min read

Join AI Breakdown & Latent Space for the summer AI tech roundup: Dive into GPT4.5, Llama 2, AI tools, the rising AI engineer, and more!

Latent Space Podcast 7/26/23 [Summary] FlashAttention 2: making Transformers 800% faster - Tri Dao of Together AI

Prof. Otto NomosOct 05, 2023 ∙ 7 min read

Discover how FlashAttention revolutionized AI speed with Tri Dao, as he unveils the power of FlashAttention 2, dives into Stanford's Hazy Lab & future AI insights.

Latent Space Podcast 7/19/23 [Summary] - Llama 2: The New Open LLM SOTA (ft. Nathan Lambert, Matt Bornstein, Anton Troynikov, Russell Kaplan, Whole Mars Catalog et al.)

Prof. Otto NomosOct 05, 2023 ∙ 5 min read

Explore Llama 2, the latest AI breakthrough with experts Nathan Lambert, Matt Bornstein & more. Dive into datasets, benchmarks & AI predictions. Llama insights & drama await in this top podcast!

Subscribe For The Latest Updates Subscribe to the newsletter and never miss the new post every week.