Back to blog

Wake Up, Jarvis — It’s Time to Rethink DXPs

Blog_AAH_Ai.png

Published: Jul 15, 2025


"Draft a blog post summarising yesterday’s design review," the developer mutters while sipping coffee.

"Would you like a conversational tone with the same forest-themed visuals you used last week?"

"Perfect. And highlight the second paragraph. I want to rework that."

"Highlighting now. Would you like a rephrasing suggestion as well?"

There’s a pause. No typing. No tabs open. Just a fluid, natural exchange between human and machine, contextual, intelligent, and efficient.

This isn’t science fiction.

It’s a signal, faint for some, obvious to others, that a new kind of interface is quietly emerging. 

One where AI doesn’t just react, it remembers. 

It doesn’t just understand commands, it grasps context. One day soon, it might anticipate what you need before you even ask.

We are close to making our own Jarvis — Iron Man’s digital twin. 

And not the cinematic kind, but something real, evolving, and powered by two monumental shifts in AI: context engineering and multimodal intelligence.

From Prompts to Perception

Lately, there’s been a noticeable buzz in the AI community about something new, deeper than prompt hacks or clever APIs: context engineering, which is quickly becoming the backbone of how advanced systems are being built.

Prompt engineering got us started and helped us figure out how to talk to machines. But context engineering will teach them how to think.

Context engineering is the emerging discipline of giving AI systems a memory, a worldview and a sense of continuity. It’s about designing systems that don’t just answer at the moment. Instead they recall what happened before, understand what matters now and anticipate what comes next.

It’s already reshaping how we build AI:

  • Retrieval-augmented generation (RAG) fetches relevant knowledge on the fly
  • Memory systems retain preferences, past decisions and interaction history.
  • Agents make decisions based on evolving user behaviour, not static inputs.

But this is just the beginning. 

Our lives aren’t lived in neatly typed prompts. 

They’re messy, dynamic and multimodal. That’s where the next challenge begins.

Teaching AI to Feel the Room

Multimodality is the ability of AI systems to process and respond to information across multiple forms; text, images, audio, video or any combination of these. It mirrors how we, as humans, perceive and engage with our surroundings: not through one channel, but through a rich blend of sensory inputs.

Think of it as the moment the lights appear in a dark room. Context gives the AI a memory. But without multimodality, it’s blind and deaf to the world around it.

We interact with the world through voice, vision, text, touch, and space. So why should our digital experiences be any different?

For AI to serve as a true companion, it must:

  • Listen to your voice, even in noisy environments
  • Read and summarise what’s on your screen
  • Watch videos and understand their context
  • Recognise patterns in your behaviour across devices

This is what multimodality enables: a seamless blend of inputs that make digital systems more natural, intuitive and human.

But here’s the twist: multimodality and context engineering aren’t separate efforts; they’re interdependent.

When Memory Meets the Senses

A multimodal system without context is just a fancy sensor. It can see, hear and read, but doesn’t understand.

On the other hand, context engineering without multimodality is like trying to understand a movie with just the script. You miss the nuance, the emotion and the full picture.

To build truly intelligent agents, we need both:

  • Systems that can ingest voice, images and behaviour and make sense of them
  • Memory architectures that store, relate and retrieve multimodal experiences
  • Normalisation logic that connects a video clip to a spoken comment to a blog draft

Only then can an agent say, “You mentioned this idea in last week’s call, shall we expand it into a post?” and show you the clip.

DXPs, But Make Them Smart

Today, DXPs are still interface-first. You click, type and scroll. But what if they were conversation-first? What if they understood you before you acted?

With context-rich, multimodal AI, that’s what we’re building:

  • A DXP that knows your intent based on your history
  • One that moves with you across devices and modalities
  • One that can listen, respond and evolve as you do

It’s not just personalisation. It’s participation. Your DXP becomes an active contributor in your content creation journey.

Indeed, the future is exciting, but it’s also complex:

  • Memory at scale: How do you retain useful context without overloading the system?
  • Multimodal alignment: How do you connect a sentence to a screenshot to a spoken cue?
  • Latency: Can this work in real-time, across platforms?
  • Privacy and trust: How do we protect users while empowering intelligence?

Here, we are tackling the challenges head-on because they’re worth it.

“By the way," the assistant adds, "you mentioned in your memoir draft last weekend that you love the Chikmagalur pour-over from Third Wave. They've just opened a cozy new outlet about 10 minutes from your flat. You’ve got a 45-minute break before your next call, should I see if there’s a quiet corner free so you can prep there?

"Thanks, Jarvis. Go ahead and book it."

..and that's how AI will stop being software, and start being a presence in the not-so-distant future.


Suryanarayanan Ramamurthy is the Head of Data Science at Contentstack. Follow him on LinkedIn.
Share on:

About Contentstack

The Contentstack team comprises highly skilled professionals specializing in product marketing, customer acquisition and retention, and digital marketing strategy. With extensive experience holding senior positions at renowned technology companies across Fortune 500, mid-size, and start-up sectors, our team offers impactful solutions based on diverse backgrounds and extensive industry knowledge.

Contentstack is on a mission to deliver the world’s best digital experiences through a fusion of cutting-edge content management, customer data, personalization, and AI technology. Iconic brands, such as AirFrance KLM, ASICS, Burberry, Mattel, Mitsubishi, and Walmart, depend on the platform to rise above the noise in today's crowded digital markets and gain their competitive edge.

In January 2025, Contentstack proudly secured its first-ever position as a Visionary in the 2025 Gartner® Magic Quadrant™ for Digital Experience Platforms (DXP). Further solidifying its prominent standing, Contentstack was recognized as a Leader in the Forrester Research, Inc. March 2025 report, “The Forrester Wave™: Content Management Systems (CMS), Q1 2025.” Contentstack was the only pure headless provider named as a Leader in the report, which evaluated 13 top CMS providers on 19 criteria for current offering and strategy.

Follow Contentstack on LinkedIn.

Blog_AAH_Ai.png

Published: Jul 15, 2025


Background.png