Why Your AI Assistant Sometimes Forgets What You Just Said

Aastha Thakker

Cyber security enthusiast | NFSU | Blogs & Articles | THM - Documentation Team Lead | SOC analyst | Digital Forensics | GU

Published May 8, 2025

Have you ever had a conversation with an AI like ChatGPT where it suddenly seemed to forget important details you mentioned earlier? You’re not alone. This frustrating experience happens to everyone from students to CEOs, and understanding why can transform how you work with these powerful tools.

How AI Memory Works

Unlike humans, who can recall conversations from weeks ago, AI assistants operate with a fixed memory constraint called a “context window.” This isn’t just a design choice — it’s a fundamental technical limitation.

What is a context window? Simply put, it’s the AI’s short-term memory capacity. When you chat with an AI assistant, it can only “see” a certain amount of text at once:

ChatGPT (GPT-4): ~32,000 tokens
Claude: ~100,000 tokens
Gemini: ~32,000 tokens

What’s a token? Tokens are the building blocks of AI language processing — not exactly words, but pieces of text:

Short words (“the”, “and”) = 1 token each
Longer words (“university”) = multiple tokens (e.g., “uni” + “versity”)
Punctuation and spaces also count as tokens

For reference, this paragraph uses approximately 60 tokens.

The Forgetting Phenomenon

When your conversation exceeds the context window limit, the earliest parts get pushed out to make room for new information, just like items falling off a conveyor belt. The AI hasn’t truly “forgotten” information; it simply can’t access what’s no longer in its context window. This isn’t a glitch or random failure — it’s what happens when you hit the context window limit.

Example: Think you’re analyzing Harry Potter and the Sorcerer’s Stone with AI assistance:

You paste in the entire text of the book (consuming around 30,000 tokens).
You discuss the key characters, like Harry, Hermione, and Ron, along with their relationships (~5,000 tokens).
You then ask about Harry’s first encounter with Voldemort at the end of the book.

At this point, you’ve likely exceeded the AI’s context window. The AI responds with general information about the battle between good and evil, but completely misses the specific details about Voldemort and Harry’s encounter because that information has been pushed out of its memory.

This is the fundamental limitation of all large language models, whether it’s Claude, GPT, or others. They can only “see” so much of the conversation history at once. Once that limit is exceeded, earlier information is forgotten, leaving gaps in the response.

“Lost in the Middle Effect”

Even when information fits within the context window, AI models have another tweak: they focus more on what’s at the beginning and end of your conversation, often losing track of details in the middle.

This happens because of how transformer models (the architecture behind most modern AI) process attention across text. It’s similar to how you might remember the introduction and conclusion of a long lecture but forget details from the middle section.

The Technical Reality: Why Size Matters

For technically inclined readers, context window limitations directly relate to hardware constraints:

Memory Requirements: Each token in the context window requires storing multiple vectors (arrays of floating-point numbers) for each layer of the model.
Computational Complexity: The self-attention mechanism that processes relationships between tokens scales quadratically. Doubling the context window means 4x the computational load.
Hardware Limitations: Running with larger context windows requires significantly more GPU memory (VRAM): 8K tokens on a 7B parameter model: ~16GB VRAM 32K tokens on a 70B parameter model: ~80GB VRAM

This is why running large models locally on consumer hardware typically means accepting shorter context windows.

LM Studio: Context Windows in Action

For those who’ve experimented with LM Studio, these limitations become tangibly clear. When running models locally, you’re directly confronted with the tradeoffs between model size, context length, and performance.

A developer using LM Studio to run Mistral-7B on a system with 16GB VRAM tries to process a long code refactoring conversation. After configuring the context window to 8K tokens for better memory, the model successfully maintains the thread of complex code discussions. However, when they try to use the same settings with a larger model like Llama–3–70 B, the application crashes due to insufficient VRAM.

Let’s see an example:

Context Length: This refers to the number of tokens (words or characters) the model can process at once. In this case, the model supports up to 131,072 tokens, but the current setting is 100555 tokens. Adjusting this can change the amount of context the model can maintain for its response.
GPU Offload: This option specifies how much of the processing load is offloaded to the GPU. The current setting is 0 out of 34, meaning none of the load is offloaded to the GPU. The more you increase this, the more the GPU will handle.
CPU Thread Pool Size: This controls how many CPU threads are used for processing. In this case, it’s set to 6. Increasing this can speed up processing on systems with multiple CPU cores.

Check tokens and how much context is full.

How to Work Smarter With AI Memory Limitations

Understanding these constraints lets you adapt your approach:

Start fresh conversations when switching topics. A new chat gives the AI a clean slate. Jumping from recipes to Instagram captions to quadratic equations in one convo is like cooking pasta in a chemistry lab… fun, but chaotic. Keep it clean, keep it crisp, one topic, one chat!
Try the “journalist approach”: Put the most important information at the beginning of your message, then add details in order of decreasing importance.
Break complex projects into focused sessions. Instead of one 2-hour conversation about your entire marketing strategy, have separate, focused chats for competitor analysis, messaging, and distribution channels.
Periodic summarization: In longer conversations, take a moment to recap the key points before moving forward with “So far we’ve discussed X, Y, and Z. Now let’s talk about…”
Watch the token count: Some AI interfaces show token usage — keep an eye on this number relative to the model’s maximum.

For Technical Users:

Use embeddings for document retrieval: Instead of feeding entire documents into the context window, use embedding-based search to retrieve only relevant sections.
Implement RAG (Retrieval Augmented Generation): Store information externally and retrieve only what’s needed for specific queries.
Context compression techniques: Apply summarization to compress earlier parts of the conversation while preserving key information.

Security Implications for Businesses

Larger context windows create new security considerations that businesses should be aware of:

Prompt injection vulnerabilities: More context space means more room for malicious prompts.
Data leakage risks: Extended memory increases the chance of sensitive information being included in responses.

Getting the Most from Your Digital Assistant

By understanding how AI memory works, you can craft more effective interactions for everything from homework help to enterprise-level tasks. Remember that even the most advanced AI systems have these fundamental memory limitations, not because they’re poorly designed, but because of the inherent challenges of processing language at scale.

Next time your AI assistant starts “acting weird” during a long conversation, you’ll know exactly what’s happening and how to fix it. It’s not broken, it’s just running out of memory for your brilliant conversation.

In Conclusion

AI assistants rely on a “context window” — once it’s full, they forget earlier messages.
Use focused, topic-specific chats. Jumping between tasks confuses the assistant.
Prioritize important details early in your prompt (“journalist approach”).
Large models and long memory = high GPU/VRAM demands (watch your specs!).
Extended memory brings security risks: prompt injection, sensitive data leaks.
Keep interactions clean, safe, and structured for the best results.

See you next Thursday! Keep intelligently using AI!

Er. Virti Mehta

Recent Cybersecurity Graduate | VAPT & Offensive Security Enthusiast | Google Certified in Cybersecurity | THM volunteer

4mo

Okay, so that's why when the AI chat goes on for too long we have to start explaining everything again.

1 Reaction

Vishu Goyal

Top Voice - Business Strategy & Customer Retention 🥇 Senior Manager - Strategy & Ops || Startup Consulting || 4M + Impressions || 60+ Brand Partnership || Artist Management || Miss India Asia finalist

4mo

Okay, now I know! Thanks for sharing this!

1 Reaction

See more comments

To view or add a comment, sign in

See all

LinkedIn respects your privacy

Why Your AI Assistant Sometimes Forgets What You Just Said

Aastha Thakker

Cyber security enthusiast | NFSU | Blogs & Articles | THM - Documentation Team Lead | SOC analyst | Digital Forensics | GU

How AI Memory Works

The Forgetting Phenomenon

“Lost in the Middle Effect”

The Technical Reality: Why Size Matters

LM Studio: Context Windows in Action

How to Work Smarter With AI Memory Limitations

For Technical Users:

Security Implications for Businesses

Getting the Most from Your Digital Assistant

In Conclusion

More articles by this author

Others also viewed

This AI newsletter is all you need #35

What is ChatGPT and Why is it Important?

6 elements of an effective AI prompts or how to get the most when communicating with LLMs

Decrypting AI: 5 Key Insights from an AI Educator

AI Agents: Are they the new workforce?

ChatGPT's Top 50 Favorite Words and Phrases

A fun experiment : You Vs. Reasoning model !!!

Do Language Models Think? Rethinking Consciousness, Thought, and AI

LLMs are not a panacea for all problems

ChatGPT: Where do we go from here?

Explore content categories

How AI Memory Works

The Forgetting Phenomenon

“Lost in the Middle Effect”

The Technical Reality: Why Size Matters

LM Studio: Context Windows in Action

How to Work Smarter With AI Memory Limitations

For Technical Users:

Security Implications for Businesses

Getting the Most from Your Digital Assistant

In Conclusion

Reverse Prompting: The AI Strategy Everyone Overlooks

Sep 25, 2025

Transformers in AI

Sep 18, 2025

Transform Your iPhone into a Powerful Remote Laptop — Configuring Termix

Sep 11, 2025

Cybersecurity Audits & Compliance Made Simple

Sep 4, 2025

What if Script Kiddies Start Demanding Ransom with AI?

Aug 28, 2025

Shodan: Search Engine Where U Must Learn How to Search

Aug 21, 2025

Windows Forensics: Digital Clues Hidden in Your System

Aug 14, 2025

Why You Can’t Remember Your Own Passwords Anymore?

Aug 7, 2025

Cyber Insurance: Your Digital Safety Net

Jul 31, 2025

How to Actually “Vibe” with AI

Jul 24, 2025

Others also viewed

This AI newsletter is all you need #35

What is ChatGPT and Why is it Important?

6 elements of an effective AI prompts or how to get the most when communicating with LLMs

Decrypting AI: 5 Key Insights from an AI Educator

AI Agents: Are they the new workforce?

ChatGPT's Top 50 Favorite Words and Phrases

A fun experiment : You Vs. Reasoning model !!!

Do Language Models Think? Rethinking Consciousness, Thought, and AI

LLMs are not a panacea for all problems

ChatGPT: Where do we go from here?

Explore content categories