• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

How Criteo scaled 290M QPS and cut its server footprint by 78%

Sascha Оффлайн

Sascha

Заместитель Администратора
Команда форума
Администратор
Регистрация
9 Май 2015
Сообщения
1,483
Баллы
155
Author: Steve Tuohy - Director of Product Marketing

Discover how Criteo scaled to 290M queries-per-second, swapped Couchbase + Memcached for Aerospike, cut servers by 78%, and kept sub-ms latency.



Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

serves more than 700 million users daily with personalized ads, processing billions of events in milliseconds. At its peak, Criteo’s infrastructure handles 290 million key-value queries per second (QPS).

Until recently, this real-time engine ran on a complex stack of

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

and Memcached, propped up by custom C clients and costly operational overhead. Performance during rebalancing was fragile, cache warming required manual traffic shaping, and

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

.

But when Criteo rebuilt its stack using Aerospike’s patented Hybrid Memory Architecture (HMA), the company consolidated two systems into one, simplified operations, and cut its server footprint by 78%, all while maintaining sub-millisecond latency at global scale.

Maintaining real-time access at AdTech scale


Criteo isn’t just serving banner ads. It runs real-time auctions on the open web, responding to 20 million bid requests per second. Each request requires dozens of micro-decisions: audience scoring, campaign pacing, frequency caps, and more.

"Our key-value storage system peaks at about 290 million queries per second, which is fairly large,” said Maxime Brugidou, vice president of engineering, Criteo.

All of that happens in under 100 ms. The stack is written in C, runs on premises across 40,000 servers and seven data centers, and operates through Kubernetes and Apache Mesos.

The demands on storage? Sub-millisecond latency, no downtime during rebalancing or upgrades, and a globally distributed footprint.

The legacy stack: Couchbase + Memcached + custom C logic


Before adopting Aerospike, Criteo’s real-time infrastructure relied on Couchbase for the data, with Memcached as a caching layer. A custom-built C client orchestrated dual writes and replication between tiers, and maintained consistency between the two systems. “It was very difficult to make it perform well during rebalancing or maintenance,” Brugidou said. “It was quite unstable.”


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



Operationally, this design required careful tuning and careful rerouting of traffic. Any node failure or rebalance event degraded performance and had to be dealt with.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



The turning point: Aerospike’s Hybrid Memory Architecture (HMA)

Aerospike’s HMA changed the game. It decouples index storage from data storage:

  • Indexes stay in RAM for fast access.
  • Data lives on SSDs, such as NVMe drives, in Criteo's case.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



This model gave Criteo Memcached-level low latency with the durability of disk and let the company collapse its architecture from two databases into one. "There was this nice design with the index being in memory and the data on disk,” Brugidou said. “It allowed for really good tradeoffs."

By keeping indexes in memory and serving records from fast SSDs, Aerospike hit sub-millisecond reads without needing RAM to hold entire datasets. For Criteo, this not only simplified its system design but also saved money.

"Aerospike combined with NVMe disks… we had basically Memcached performance except that we were using persistent storage,” Brugidou said. “That was quite impressive…Aerospike was able to keep reading steadily at high throughput and very, very low latency despite all the mess we were putting on the servers."

Kubernetes-native deployment: From custom scripts to automated ops


Criteo replaced both Couchbase and Memcached with Aerospike. The company removed its custom client drivers, adopted Aerospike’s native C client, and rolled out a Kubernetes-native deployment on-prem with the

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

.

Operational wins included:

  • Automatic node recovery and rebalancing via Kubernetes
  • Eliminating the custom C client, reducing complexity
Multi-bin optimization: Collapsing data models for performance


Aerospike’s multi-bin optimization simplified things even more. Criteo merged multiple datasets into one namespace, which helped reduce index memory usage and improve access efficiency. "You can write to some bins very easily and read all the bins at once. You only pay for indexing once in memory. That was quite a significant win," Brugidou said.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



Aerospike also made rebalancing and node failure recovery seamless to the team. "We do that transparently,” Brugidou said. Aerospike is able to rebalance so easily… This is fully automated and very easy to do with the Kubernetes Operator.” Multi-bin optimization also meant:

  • Reduced index memory usage by avoiding duplication
  • Faster lookups, because all bins could be read in one fetch

This move simplified things even more, aligning with Criteo's infrastructure-as-code strategy and reducing human intervention during failure recovery.

Infrastructure impact: 78% server reduction and lower carbon footprint


By replacing Couchbase and Memcached with Aerospike, Criteo reduced server count from more than 3,000 to just 720. This meant:

  • Lowered total RAM and disk footprint
  • Reduced power consumption and cooling

Began running on 100% renewable energy in 2022, accomplished with the purchase of certificates and relocation of data centers to more sustainable locations.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



“We’re doing many more queries for less electricity in the end. Way less,” said Brugidou. This migration supports Criteo’s science-based climate target to reduce electricity consumption by 42% by 2030.

Lessons learned: Performance tuning, not over-engineering


Despite the performance of the new system, Criteo didn’t jump straight to exotic optimizations. Instead, it prioritized simplicity:

  • Avoided pre-loading datasets in RAM
  • Let Aerospike handle replication and rebalancing
  • Tuned bin and namespace configs incrementally based on workload metrics

Brugidou emphasized that consistency and throughput remained stable even under stress tests, without resorting to low-level tuning.

Takeaway: Scaling real-time systems without scaling costs


By replacing a brittle, multi-tiered stack with Aerospike’s real-time engine, Criteo gained both technical efficiency and reduced costs. It consolidated two systems into one, improved performance, reduced latency, and slashed infrastructure overhead, without sacrificing resiliency or scale.

For teams building real-time systems with massive QPS and tight SLAs, Criteo’s experience shows what becomes possible with the right storage architecture.

With Aerospike, Criteo:

  • Consolidated two systems into one
  • Maintained sub-millisecond latency
  • Reduced infrastructure costs and energy use
  • Gained reliability through automation

Want to learn more? Check out the webinar replay or talk to our team about what real-time data infrastructure can do for you.



Источник:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу