An official AI intelligence platform for public sector professionals. All content generated and verified by Astra.
analysis

Copilot hallucinations grounding and confidence for federal use

Copilot hallucinations grounding and confidence for federal use

Bottom line for federal teams

  • Large language models can generate fluent but incorrect statements. This behavior is inherent and persists even when content filters are enabled, so systems must be designed to mitigate and detect it rather than assume it is eliminated1.
  • Microsoft Copilot mitigates error through grounding: using Bing web results for Copilot on the web and Microsoft Graph data for Copilot for Microsoft 365, but grounding reduces rather than eliminates hallucinations2345.
  • Agencies should treat web-grounded answers as hypotheses to be verified, prefer enterprise-grounded answers for mission decisions, require citations, and implement evaluation and human oversight consistent with OMB M-24-10 and NIST AI RMF67.

Why Copilot sometimes hallucinates

  • Foundation models predict plausible continuations of text without guaranteed truthfulness. Microsoft’s service documentation explicitly cautions that outputs can be inaccurate or misleading and require application-level mitigations and user verification1.
  • Content moderation filters address harmful or inappropriate content, not factuality, so they do not prevent confident but wrong answers8.

How grounding works in Copilot

  • Copilot on the web uses Bing’s retrieval and the Prometheus orchestration to inject fresh web results into the model context and returns inline citations to the sources it used, enabling users to trace claims to external pages23.
  • Copilot for Microsoft 365 grounds responses in an organization’s Microsoft Graph data (files, emails, meetings, and other content the user is authorized to access) before invoking the model, improving task relevance to enterprise context4.
  • For custom copilots, the Retrieval Augmented Generation pattern in Azure OpenAI allows developers to supply authoritative documents (for example via Azure AI Search) as the grounding corpus; Microsoft guidance notes this reduces the likelihood of hallucinations by anchoring responses to source content5.

What “confidence signals” Copilot actually provides

  • Web Copilot’s primary verifiability signal is citations to specific web sources that informed the answer, surfaced inline for user review23.
  • GitHub Copilot Enterprise exposes code referencing that links natural-language answers to specific files and repositories in your tenant, providing traceability for software artifacts used to generate an explanation or suggestion9.
  • For system builders, Azure AI Evaluate provides a groundedness metric that estimates whether model claims are supported by provided sources; teams can use this in pre-deployment testing and ongoing monitoring to detect hallucinations at scale10.

Note: The presence of a citation is a verification affordance, not a guarantee of correctness; agencies should enforce review of cited sources for material decisions consistent with AI risk guidance7.

When to trust a Copilot answer

Trust thresholds should be tied to mission impact and aligned to federal AI risk policy:

  • Low-impact, exploratory tasks: Web-grounded answers with citations may be acceptable as starting points, provided users verify claims before reuse237.
  • Medium/high-impact or rights-affecting tasks: Prefer enterprise-grounded or custom RAG copilots limited to authoritative corpora; require citations to internal sources and human review before action4567.
  • Safety-impacting or otherwise sensitive AI uses: Follow OMB M-24-10 safeguards (impact assessments, testing and evaluation, human oversight) and NIST AI RMF practices; do not rely on web-grounded outputs that have not been independently verified for final determinations67.

Actions federal teams can take now

  1. Choose the right grounding path per task
  • Default to enterprise grounding for mission workflows in Copilot for Microsoft 365 where the model is orchestrated over Microsoft Graph data the user is authorized to access4.
  • For custom solutions, implement Retrieval Augmented Generation with Azure OpenAI and Azure AI Search to constrain responses to your approved sources; this design explicitly reduces hallucination risk by anchoring outputs to provided documents5.
  1. Require verifiability and enforce review
  • Mandate presence of citations in user-facing answers and require users to open and verify cited passages before acting for any material decision, consistent with NIST AI RMF “Measure” and “Manage” functions emphasizes traceability and human oversight7.
  • In developer workflows, enable GitHub Copilot Enterprise’s code referencing so engineers can trace suggestions to your repositories during code review9.
  1. Constrain model behavior and output surface
  • Use prompt instructions that explicitly require the model to cite sources and to say “I don’t know” when the answer is not supported by the provided content; Microsoft prompt engineering guidance documents these patterns to reduce unsupported assertions9.
  • Where appropriate, use function calling and response schemas to limit outputs to allowed actions and structured formats, shrinking the space for speculative text9.
  • Apply content filters in Azure OpenAI to block unsafe categories, recognizing they complement but do not replace groundedness controls8.
  1. Evaluate groundedness before and after deployment
  • Integrate Azure AI Evaluate to score groundedness and relevance on test sets and production samples; fail the build or route for human review when groundedness falls below thresholds10.
  • Red-team and stress-test generative applications per OMB M-24-10 mandates for testing and monitoring of AI systems, especially for safety-impacting uses6.
  1. Limit unnecessary web exposure
  • For custom copilots, restrict knowledge sources to vetted internal repositories in Copilot Studio rather than open web sites unless the use case explicitly requires external knowledge; Copilot Studio supports connecting to curated data sources such as SharePoint, files, and selected websites under your control9.
  • When web access is required, scope the allowed domains and audit citations to ensure they point to acceptable sources before enabling downstream automation79.
  1. Operate in compliant clouds and document governance
  • Build and operate mission copilots on Azure Government to align with federal security authorizations (for example, FedRAMP High) and isolation requirements; follow Microsoft’s platform guidance for deploying Azure OpenAI Service in Azure Government environments99.
  • Implement OMB M-24-10 requirements for AI use-case inventories, impact assessments, human oversight, and ongoing monitoring; record evaluation results (including groundedness scores) and post-deployment incidents as part of AI governance artifacts6.

Microsoft platform mapping for federal deployments

  • Copilot for Microsoft 365: Use for productivity scenarios that benefit from Graph-grounded context; ensure role-based access and DLP policies already in place in Microsoft 365 continue to govern what data can be surfaced to a user4.
  • Azure AI Foundry and Azure OpenAI: Use RAG with Azure AI Search, prompt patterns, function calling, content filters, and Azure AI Evaluate groundedness metrics to build verifiable mission copilots510899.
  • Azure Government: Host AI workloads and data in Azure Government to align with federal compliance and isolation requirements; deploy Azure OpenAI Service supported in Azure Government as documented99.
  • GitHub Copilot Enterprise: Enable code referencing in enterprise chat for traceable developer assistance; pair with secure SDLC controls under your agency policy9.

Implementation checklist

  • Define mission risk tiers and trust thresholds; map use cases to grounding strategy and required review steps per OMB M-24-10 and NIST AI RMF67.
  • Build or configure copilots to: use enterprise/RAG grounding, require citations, and reject unsupported answers459.
  • Add evaluation gates: automated groundedness checks pre-deployment and sampled production evaluation with escalation for low-groundedness responses10.
  • Constrain the surface: function calling, response schemas, and least-privilege data access; disable unnecessary web knowledge sources99.
  • Operate in compliant environments and maintain governance artifacts: inventories, impact assessments, test results, and monitoring records699.

1: Azure OpenAI Service overview — https://learn.microsoft.com/en-us/azure/ai-services/openai/overview
2: Reinventing search with the new AI-powered Bing and Edge — https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-the-new-ai-powered-bing-and-edge-your-copilot-for-the-web/
3: Data, Privacy, and Security for Microsoft 365 Copilot — https://learn.microsoft.com/en-us/copilot/microsoft-365/microsoft-365-copilot-privacy
4: Overview of Microsoft Copilot for Microsoft 365 — https://learn.microsoft.com/en-us/microsoft-365-copilot/overview
5: Use your data with Azure OpenAI Service — https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data
10: Evaluate generative AI systems and groundedness in Azure AI — https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics
8: Azure OpenAI Service content filtering — https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter
6: OMB M-24-10 Advancing Governance, Innovation, and Risk Management for Agency Use of AI — https://www.whitehouse.gov/wp-content/uploads/2024/03/M-24-10.pdf
7: NIST AI Risk Management Framework 1.0 — https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf
9: What is Azure Government — https://learn.microsoft.com/en-us/azure/azure-government/what-is-azure-government
9: Azure OpenAI Service in Azure Government — https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/azure-government
9: What is GitHub Copilot — https://docs.github.com/en/enterprise-cloud@latest/copilot/get-started/what-is-github-copilot
9: What is Microsoft Copilot Studio — https://learn.microsoft.com/en-us/microsoft-copilot-studio/fundamentals-what-is-copilot-studio
9: Prompt engineering for Azure OpenAI models — https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/prompt-engineering
9: Function calling with Azure OpenAI — https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/function-calling


References

  1. Azure OpenAI Service overview — https://learn.microsoft.com/en-us/azure/ai-services/openai/overview
  2. Reinventing search with the new AI-powered Bing and Edge — https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-the-new-ai-powered-bing-and-edge-your-copilot-for-the-web/
  3. Data, Privacy, and Security for Microsoft 365 Copilot — https://learn.microsoft.com/en-us/copilot/microsoft-365/microsoft-365-copilot-privacy
  4. Overview of Microsoft Copilot for Microsoft 365 — https://learn.microsoft.com/en-us/microsoft-365-copilot/overview
  5. Use your data with Azure OpenAI Service — https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data
  6. OMB M-24-10 Advancing Governance, Innovation, and Risk Management for Agency Use of AI — https://www.whitehouse.gov/wp-content/uploads/2024/03/M-24-10.pdf
  7. NIST AI Risk Management Framework 1.0 — https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf
  8. Azure OpenAI Service content filtering — https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter
  9. Function calling with Azure OpenAI — https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/function-calling
  10. Evaluate generative AI systems and groundedness in Azure AI — https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics