Research & Case Study

Solving the Knowledge Cutoff Problem in Stremini AI

How I made my chatbot understand current events

This was probably the biggest challenge I faced when building Stremini AI. The problem is simple but really frustrating - AI models only know things up to when they were trained. Ask about anything after that date, and they're basically guessing.

Understanding the Problem

So here's what I was dealing with. Gemini (the AI model I'm using) has a knowledge cutoff in early 2025. That means if you ask it about anything that happened after that date, it literally doesn't know. It can't tell you today's weather, current stock prices, recent news, or who won yesterday's game.

For a student using an educational chatbot, this is a huge problem. Imagine asking about a recent scientific discovery or current events for a homework assignment, and getting "I don't have that information" or worse, the AI making something up.

The Knowledge Cutoff Timeline:
Training cutoff: January 2025
Today's date: December 2025
Gap: Almost a full year of missing information

My Solution: Real-Time Data Integration

Instead of accepting this limitation, I built a system that gives the AI access to current information. The approach is pretty straightforward - when someone asks a question that needs recent data, the chatbot searches the web first, then uses those results to answer.

How I Detect Time-Sensitive Questions

First thing I needed was a way to figure out which questions actually need current data. I wrote a detection function:

function needsRealTimeData(message) {
  const lower = message.toLowerCase();
  
  const realTimeKeywords = [
    'today', 'now', 'current', 'currently', 'latest', 'recent', 'recently',
    'this week', 'this month', 'this year', 
    '2024', '2025',
    'news', 'update', 'happening', 'going on', 'breaking',
    'price', 'stock', 'weather', 'score', 'result', 'live',
    'status', 'situation', 'development', 'announcement',
    'just', 'yesterday', 'last week', 'last month'
  ];
  
  if (realTimeKeywords.some(keyword => lower.includes(keyword))) {
    return true;
  }
  
  const currentQuestionPatterns = [
    /what('s| is) (the )?(latest|current|today|new|happening)/i,
    /who (is|are) (the )?(current|now)/i,
    /how (much|many) (is|are|does|cost)/i,
    /when (is|did|will|does)/i,
  ];
  
  if (currentQuestionPatterns.some(pattern => pattern.test(message))) {
    return true;
  }
  
  return false;
}

Why this works: The function checks for explicit time references like "today" or "current", plus it looks for question patterns that usually need fresh data. Words like "latest" or "recent" are dead giveaways.

Adding Current Date Context

One simple but important thing I do is inject the current date into every conversation. This helps the AI understand temporal context:

function getCurrentDateTime() {
  const now = new Date();
  return {
    utc: now.toUTCString(),
    ist: now.toLocaleString('en-IN', { timeZone: 'Asia/Kolkata' }),
    timestamp: now.toISOString(),
    year: now.getUTCFullYear(),
    month: now.toLocaleString('en-US', { month: 'long' }),
    day: now.getUTCDate(),
    dayOfWeek: now.toLocaleString('en-US', { weekday: 'long' }),
    unixTimestamp: Math.floor(now.getTime() / 1000)
  };
}

Then I include this in the system prompt:

Current Date: ${dateTime.dayOfWeek}, ${dateTime.month} ${dateTime.day}, ${dateTime.year}
Current Time (IST): ${dateTime.ist}

This seems basic, but it's actually really important. Without it, the AI might not even realize when a question is time-sensitive.

Category Detection for Better Search

Different types of questions need different approaches. I categorize queries to optimize the search:

function detectCategory(message) {
  const lower = message.toLowerCase();
  
  if (needsRealTimeData(lower)) {
    return 'realtime';
  }
  
  if (/\b(news|today|current|recent|latest|happening|breaking)\b/i.test(lower)) {
    return 'news';
  }
  if (/\b(math|calculus|algebra|geometry|equation|formula|theorem)\b/i.test(lower)) {
    return 'math';
  }
  if (/\b(biology|chemistry|physics|science|experiment|molecule|atom)\b/i.test(lower)) {
    return 'science';
  }
  if (/\b(code|programming|javascript|python|function|algorithm|debug)\b/i.test(lower)) {
    return 'programming';
  }
  
  return 'general';
}

Category Examples:

"What's the latest news in AI?" → News category
"Current stock price of Tesla" → Realtime category
"How does photosynthesis work?" → Science category

Using Trusted Sources

Not all websites are equally reliable, especially for students. I maintain lists of trusted sources by category:

const TRUSTED_SOURCES = {
  general: [
    'wikipedia.org', 'britannica.com', 'khanacademy.org', 
    'coursera.org', 'edu'
  ],
  science: [
    'ncbi.nlm.nih.gov', 'nature.com', 'sciencedirect.com', 
    'arxiv.org', 'scientificamerican.com'
  ],
  math: [
    'wolframalpha.com', 'mathworld.wolfram.com', 'brilliant.org'
  ],
  programming: [
    'stackoverflow.com', 'github.com', 'mdn.mozilla.org', 
    'w3schools.com', 'geeksforgeeks.org'
  ],
  news: [
    'bbc.com', 'reuters.com', 'apnews.com', 'theguardian.com', 
    'cnn.com', 'cnbc.com', 'news'
  ],
};

When search results come back, I flag which ones are from trusted domains:

const trustedDomains = [
  ...TRUSTED_SOURCES.general,
  ...(TRUSTED_SOURCES[category] || []),
  ...(TRUSTED_SOURCES.news || [])
];

const finalResults = results.map(r => ({
  ...r,
  trusted: trustedDomains.some(domain => 
    r.url.toLowerCase().includes(domain)
  ),
  provider: 'Serper'
}));

The Complete Implementation

Here's how everything works together in the actual endpoint:

chatRoutes.post('/message', async (c) => {
  const { message, enableResearch = true } = await c.req.json();
  
  const sanitizedMessage = sanitizeInput(message);
  const dateTime = getCurrentDateTime();
  
  // Check if we need real-time data
  let searchResults = null;
  if (enableResearch && needsRealTimeData(sanitizedMessage)) {
    const category = detectCategory(sanitizedMessage);
    const searchQuery = buildSearchQuery(sanitizedMessage, category);
    
    console.log(`Research enabled: "${searchQuery}" [${category}]`);
    searchResults = await performWebSearch(searchQuery, category, c.env);
    
    if (searchResults && searchResults.length > 0) {
      console.log(`Using ${searchResults.length} real-time sources`);
    }
  }
  
  // Build system prompt with current date and search results
  const genAI = new GoogleGenerativeAI(apiKey);
  const model = genAI.getGenerativeModel({ 
    model: 'gemini-2.5-flash',
    systemInstruction: buildSystemPrompt(dateTime, searchResults)
  });
  
  // Generate response
  const result = await model.generateContent(sanitizedMessage);
  const text = result.response.text();
  
  return c.json({
    success: true,
    response: text,
    timestamp: dateTime.timestamp,
    sources: searchResults || [],
    researchPerformed: !!searchResults,
    searchProvider: searchResults ? 'Serper' : null,
    resultCount: searchResults?.length || 0
  });
});

Before vs After

WITHOUT Temporal Awareness

Q: "What happened in tech news this week?"
A: "I don't have access to current news. My knowledge cutoff is January 2025."
(Completely useless)

WITH Temporal Awareness

Q: "What happened in tech news this week?"
A: "This week's major tech news includes OpenAI's new model release and Apple's updated product lineup. (Sources: techcrunch.com, theverge.com)"
(Actually helpful)

WITHOUT Temporal Awareness

Q: "Is it going to rain today?"
A: "I cannot provide real-time weather information."
(Not useful)

WITH Temporal Awareness

Q: "Is it going to rain today?"
A: "Today's forecast shows a 70% chance of rain in your area starting around 3 PM. (Source: weather.com)"
(Actually answers the question)

Handling Edge Cases

There were a bunch of edge cases I had to handle:

API Timeouts

Sometimes the search API takes too long. I added timeout handling:

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 8000);

const response = await fetch('https://google.serper.dev/search', {
  method: 'POST',
  headers: {
    'X-API-KEY': apiKey,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ q: query, num: 5 }),
  signal: controller.signal
});

clearTimeout(timeoutId);

No Results Found

What if the search returns nothing? I handle that gracefully:

if (!results || results.length === 0) {
  console.warn('No search results found');
  return null;
}

// In the main handler:
if (searchResults && searchResults.length > 0) {
  console.log(`Using ${searchResults.length} real-time sources`);
} else {
  console.warn('No search results - Check SERPER_API_KEY');
}

Year Context for Current Events

When someone asks about current events without specifying a year, I automatically add it:

if (category === 'realtime' || category === 'news') {
  if (!query.match(/202[4-5]/)) {
    query += ' 2025';
  }
}

This prevents getting outdated results from previous years.

Results After Implementation

After adding temporal awareness:

Can answer questions about events from any date
Provides accurate current information (weather, prices, news)
Automatically cites sources for verification
Search completes in under 2 seconds on average
Falls back gracefully when search fails

What I Learned

Context is everything. Just adding the current date to the system prompt made a huge difference in how the AI understands time-sensitive questions.
Detection matters more than I thought. Getting the time-sensitivity detection right was harder than the actual search implementation. False positives waste API calls, false negatives give outdated answers.
Trusted sources help a lot. Students (and teachers) care about where information comes from. Prioritizing .edu domains and established sources builds trust.
Timeouts are necessary. Without timeout handling, a slow API can hang the entire chatbot. 8 seconds is my sweet spot.
Year context prevents confusion. Adding the current year to searches for recent events prevents getting results from previous years with similar events.

Things I'd Improve

If I was starting over or had more time:

Add caching for frequently asked current questions (like "today's weather")
Implement a confidence score for whether search is actually needed
Add fallback search providers in case Serper is down
Build a feedback system where users can flag outdated or incorrect information
Create better handling for questions that need both historical and current data

Why This Approach Works

The key insight is that you don't need to retrain the entire model to give it current information. You just need to:

Detect when current information is needed
Fetch that information from reliable sources
Inject it into the context before the AI responds
Make sure the AI knows to use that fresh data

This is way more practical than trying to continuously retrain models, and it keeps the chatbot useful for students who need accurate, current information for their assignments.

Solving the knowledge cutoff problem was one of those things that seemed impossible at first but turned out to be pretty manageable once I broke it down. The chatbot went from being stuck in January 2025 to being able to answer questions about literally anything happening today. That's a pretty significant upgrade.

Back to Research Overview