Larger Models, Faster Inference: Cloudflare's AI Platform Gets a GPU Boost

By Greg Tavarez October 08, 2024

As large language models continue to advance, their size and computational demands are gradually decreasing. This trend is bringing about the development of smaller, more efficient LLMs that operate on a wider range of devices, from powerful servers to smartphones and even wearables.

With that said (and even with these advancements under consideration), network speeds will still play a crucial role in determining the overall user experience and customer adoption of AI-powered applications in the future.

The greater network-related bottleneck arises from the fact that LLMs require high amounts of data to be transferred between the user's device and the remote server hosting the model. If network speeds are slow or unreliable, it often leads to frustrating delays and interruptions during interactions. For example, a user's query might take several seconds or even minutes to be processed and responded to; this impacts the perceived responsiveness and effectiveness of the AI.

Additionally, the increasing complexity of AI applications (such as real-time translation, voice assistants and content generation) will demand even higher network bandwidths.

As LLMs become more accessible and widely integrated into various aspects of our lives, network infrastructure will need to evolve to meet the growing demands for speed and reliability.

So, Cloudflare is taking on that challenge.

Cloudflare helps businesses make their online operations faster and more secure. They use their technology to improve the speed and reliability of websites, applications, and networks, while also protecting them from cyber threats. This helps businesses reduce costs and improve their overall performance.

Recently, Cloudflare announced new capabilities for Workers AI, the serverless AI platform, and its suite of AI application building blocks to help developers build faster, more performant AI applications.

Workers AI is a platform that allows developers to build and deploy custom AI applications directly at the edge of the network. This means that AI processing happens closer to the user, resulting in reduced latency and improved performance.

Workers AI offers a wide range of capabilities, including natural language processing (NLP), image and video analysis, machine learning capabilities and more. Developers use Workers AI to create innovative applications such as real-time chatbots, personalized recommendations, content moderation and predictive analytics.

Now, Cloudflare is upgrading their network with more powerful GPUs for Workers AI to improve AI inference performance. This will allow them to run larger AI models, like Llama 3.1 and 3.2, which can handle more difficult tasks and provide faster response times. Workers AI now has GPUs in more than 180 cities around the world, built for global accessibility to provide low latency times for end users all over the world.

With this network of GPUs, Workers AI has one of the larger global footprints of any AI platform and is designed to run AI inference locally as close to the user as possible and help keep customer data closer to home. Applications built on Workers AI now benefit from faster inference, bigger models, improved performance analytics and more.

Cloudflare's AI Gateway now has a new feature that allows developers to store users' prompts and model responses for extended periods. This helps them better understand how their application performs and refine it based on user feedback. Over 2 billion requests have been processed through AI Gateway since its launch last year.

Additionally, Cloudflare's vector database, Vectorize, now generally available, has been improved to support larger indexes and faster query times. This allows AI applications to find relevant information more quickly and efficiently.

“As AI workloads shift from training to inference, performance and regional availability are going to be critical to supporting the next phase of AI,” said Matthew Prince, co-founder and CEO of Cloudflare. “Cloudflare is the most global AI platform on the market, and having GPUs in cities around the world is going to be what takes AI from a novel toy to a part of our everyday life, just like faster Internet did for smartphones.”

Be part of the discussion about the latest trends and developments in the Generative AI space at Generative AI Expo, taking place February 11-13, 2025 in Fort Lauderdale, Florida. Generative AI Expo covers the evolution of GenAI and will feature conversations focused on the potential for GenAI across industries and how the technology is already being used to create new opportunities for businesses to improve operations, enhance customer experiences, and create new growth opportunities.

Edited by Alex Passett

Get stories like this delivered straight to your inbox. [Free eNews Subscription]