Everything You Need to Know About Sentinel Data Lake
As with everything Microsoft Cloud, this product is going to be in a state of constant flux. Below are my notes on Data Lake, enhanced with information from official sources and organised by AI.
1. Overview
-
Announced: Public Preview on 22 July 2024, General Availability on 30 September 2024.
-
Purpose: Unifies security data across an organisation in a cost-effective way, enabling long-term retention and multi-modal analysis (KQL, Jupyter, Spark, and future graph capabilities).
-
Management Interface: Entirely managed through the Microsoft Defender portal (not Azure Portal).
-
Setup Requirements:
- Subscription Owner or User Access Administrator permissions for billing/resource provisioning
- Security Administrator or Global Administrator (Entra ID) for configuration
- For new customers (on/after July 1, 2025), automatic onboarding to Defender portal may occur
- Setup and onboarding guidance available via Microsoft Learn
-
Important Note: Microsoft Sentinel in the Azure portal will be retired by July 2026.
2. What You Get
2.1 Graph-powered Interactive Visualisation (Public Preview)
- Enables graph-based investigation and threat hunting directly in the Defender portal.
- Graphs draw data from multiple sources (e.g. Sentinel, Entra ID, M365, Defender suite).
- Designed to assist with exploring incident relationships and latent threat paths.
- Serves as foundation for agentic defence and deeper security insights.
2.2 Query Over Historical Data
- Use KQL to query all retained data for threat hunting, anomaly detection, and behavioural or predictive analytics.
- Spark is available for advanced processing and ML workloads via Microsoft-managed Spark clusters provisioned automatically.
- Accessed through the Visual Studio Code Sentinel extension for notebook management.
- Supports AI agents via the MCP server, allowing tools like GitHub Copilot to query and automate security tasks.
2.3 Built-in, Flexible Data Tiers
-
Offers purpose-built tiers for SOC operations:
- Analytics tier: For hot data powering ongoing detections, automation, and SOAR.
- Data Lake tier: For long-term storage of raw or normalised data at lower cost.
-
Data can flow seamlessly between tiers, with mirroring for unified access.
2.4 Data from Everywhere
- Integrates data from 350+ native connectors, spanning cloud, on-premises, and third-party sources.
- Supports custom connector creation.
- Automatically includes asset data from Microsoft 365, Entra ID, and Azure without additional setup.
2.5 Centralised Cost Management
-
Central dashboard for managing and optimising data costs:
- Track usage and estimate charges
- Onboard data sources
- Adjust retention and tiering policies
- Manage storage costs and analytics vs lake retention
-
10 GB/day trial available for 31 days
-
Commitment tiers start at 100 GB/day with predictable pricing
2.6 Platform Architecture
- Storage architecture components:
- Analytics tier - Hot data for real-time detection and SOAR
- Data Lake tier - Long-term, cost-effective storage
- Asset store - Asset metadata and entity information (platform component)
- Activity store - Activity logs and security events (platform component)
- TI Store - Dedicated Threat Intelligence storage (platform component)
- AI/ML layer: Includes models, embeddings, and entity analyser for advanced analytics
- SIEM storage: Bridges traditional SIEM capabilities with modern data lake architecture
- Semantic capabilities: Graph database and embeddings enable semantic search across all security data
- Deep Security Copilot integration: Bidirectional integration for AI-powered security operations
3. Data Ingestion and Mirroring
3.1 Ingestion Methods
-
Uses existing Sentinel connectors and DCRs — no reconfiguration required.
-
Data can be:
- Sent directly to the Data Lake, or
- Mirrored from the Analytics tier automatically.
-
Data ingested into analytics tier is automatically mirrored to the lake, preserving a single copy.
-
Unified data connectors feed into the entire platform, not just the data lake.
3.2 Mirroring
- Automatically synchronises data between Analytics (hot) and Data Lake (cold) tiers.
- Free of charge when Data Lake retention matches Analytics retention (both at 90 days).
- Analytics tier has 90 days free retention; extending beyond 90 days incurs costs for both tiers.
- Additional cost applies only for extended Data Lake retention beyond the Analytics tier retention period.
- Existing data in the Analytics tier is not retroactively moved into the Lake — mirroring is forward only.
- Creates a single copy of data accessible across both tiers without duplication charges.
4. Permissions and Onboarding
-
Pre-requisite: Sentinel workspace must be connected to the Defender portal before onboarding.
-
Roles and Access:
- Setup requires Subscription Owner or User Access Administrator for billing and resource provisioning
- Security Administrator or Global Administrator (Entra ID) for data ingestion authorisation
- Log Analytics Contributor permission required to modify table settings in Azure portal/Log Analytics
- In Defender portal, table management requires Security Administrator/Operator (Entra ID) or Data (Manage) Unified RBAC permission
-
All management occurs through Defender Portal → Microsoft Sentinel → Configuration → Tables / Data Connectors.
-
Unified RBAC: Starting July 2025, data lake permissions provided through Microsoft Defender XDR unified RBAC.
4.1 Critical Prerequisites for Data Lake Onboarding
Defender Portal Connection (Mandatory First Step)
- CRITICAL: Sentinel workspace MUST be connected to Defender portal BEFORE data lake onboarding
- Navigate to: Defender Portal → System → Settings → Microsoft Sentinel → Connect a workspace
- Select and designate a primary workspace during connection
- This is a separate prerequisite step, NOT part of the data lake setup itself
Subscription Ownership Requirements
- Must be direct subscription owner or User Access Administrator - management-group-level ownership is insufficient
- Required for billing setup and resource provisioning
- For new customers (on/after July 1, 2025), these permissions may result in automatic onboarding to Defender portal
Regional Limitations & Consent
- Data lake provisions in same region as primary Sentinel workspace
- Only workspaces in same region as primary can attach to data lake
- Consent Required: If Microsoft 365 data resides in different region, you consent to data ingestion into data lake region
- BCDR Not Supported in:
- EU customers (EUDB compliance limitations)
- Israel
- Azure operated by 21Vianet
- Data Lake BCDR Specifically Not Supported in: Australia East, UK South, Switzerland North, Canada Central, Japan East, Central India, Southeast Asia, France Central, Israel Central
Policy Exemptions
- Azure Policy definitions may block deployment of required resources
- Configure policy exemption scoped to the resource group during onboarding
- Specifically exempt resource type:
Microsoft.SentinelPlatformServices/sentinelplatformservices - DL103 error indicates policies preventing creation of Azure managed resources
Important Limitations
- CMK Not Supported: Customer-Managed Keys not supported at GA - data lake uses Microsoft-managed keys only
- No Retroactive Migration: Existing analytics tier data not moved to lake - forward mirroring only
- Data Availability Delays:
- Configuration changes: Take effect within 1-2 minutes
- Data visibility after tier switch or first-time enablement: 90-120 minutes for data to appear in the new tier
- Asset Data population: Can take up to 24 hours to fully populate
- Regular data ingestion: No additional delays once configured
- Managed Identity Created: Onboarding creates managed identity with prefix ‘msg-resources-‘ followed by GUID
- Requires Azure Reader role over subscriptions
- For custom table creation, needs Log Analytics Contributor role
- Auxiliary Logs: Once integrated into data lake, accessible through Data Lake exploration but no longer available in Advanced Hunting
- Basic Logs: NOT supported in Data Lake tier - must convert to Analytics tier first
5. Data Lake Permissions and Access Control
5.1 Dual Permission Model Overview
The Microsoft Sentinel Data Lake uses a dual-model permission structure that combines:
- Microsoft Entra ID Roles - Provide tenant-wide, broad access across ALL workspaces in the data lake
- Azure RBAC / Defender XDR Unified RBAC - Provide granular, workspace-specific permissions
This represents a significant shift from traditional Sentinel RBAC, particularly for job management and cross-workspace operations.
5.2 Permission Requirements by Role
Read/Query Permissions
| Scope | Permission Type | Required Roles/Permissions |
|---|---|---|
| All workspaces | Microsoft Entra ID | • Global Reader • Security Reader • Security Operator • Security Administrator • Global Administrator |
| System tables | Defender XDR Unified RBAC | Custom role with “Security data basics (read)” permission |
| Specific workspace | Azure RBAC | • Log Analytics Reader • Microsoft Sentinel Reader • Microsoft Sentinel Contributor • Reader • Contributor • Owner |
Write Permissions
| Scope | Permission Type | Required Roles/Permissions |
|---|---|---|
| All data lake tables | Microsoft Entra ID | • Security Operator • Security Administrator • Global Administrator |
| System tables | Defender XDR Unified RBAC | Custom role with “Data (Manage)” permission |
| Specific workspace | Azure RBAC | Roles with these permissions: • microsoft.operationalinsights/workspaces/write• microsoft.operationalinsights/workspaces/tables/write• microsoft.operationalinsights/workspaces/tables/deleteBuilt-in roles: Log Analytics Contributor, Owner, Contributor |
Job Management Permissions
| Task | Required Permission | Important Note |
|---|---|---|
| Create/manage KQL jobs | Microsoft Entra ID: • Security Operator • Security Administrator • Global Administrator |
⚠️ No workspace-level delegation available |
| Schedule analytics jobs | Defender XDR Unified RBAC: “Analytics Jobs Schedule” (Read/Manage) |
Requires Unified RBAC activation |
5.3 Defender XDR Unified RBAC Permissions (Preview)
Starting July 2025, new granular permissions are available through Microsoft Defender XDR Unified RBAC:
| Permission | Level | Capabilities |
|---|---|---|
| Data | Manage | • Manage data retention policies • Move data between tiers (Analytics ↔ Data Lake) • Create and delete data lake tables • Configure and manage data connectors • Modify table settings |
| Analytics Jobs Schedule | Read | • View scheduled jobs and their configurations • Access Lake Exploration (read-only) • View notebook outputs |
| Analytics Jobs Schedule | Manage | • Create and modify scheduled KQL jobs • Use Lake Exploration with write capabilities • Execute notebooks in Azure Data Explorer • Enable/disable/delete existing jobs |
| Security data basics | Read | • Query all data lake tables • Access system tables • View data lake experiences in Defender portal |
5.4 Permission Requirements by Common Tasks
| Task | Minimum Permission Required | Notes |
|---|---|---|
| Run KQL queries (all workspaces) | Security Reader (Entra ID) | Broadest access |
| Run KQL queries (specific workspace) | Log Analytics Reader (Azure RBAC) | Workspace-scoped |
| Create scheduled KQL jobs | Security Operator (Entra ID) | Cannot be delegated per workspace |
| Modify table retention/tiers | Security Operator (Entra ID) OR Custom role with “Data (Manage)” |
Affects billing |
| Write query results to tables | Security Operator (Entra ID) | Via KQL jobs |
| Configure data connectors | Custom role with “Data (Manage)” OR Microsoft Sentinel Contributor (Azure RBAC) |
Per workspace |
| Use Jupyter notebooks | Security Operator (Entra ID) OR “Analytics Jobs Schedule (Manage)” |
Requires VS Code |
| View job status/history | “Analytics Jobs Schedule (Read)” | Read-only access |
5.5 Key Limitations and Considerations
Critical Limitations
-
Job Management is All-or-Nothing
- No workspace-level job permissions available
- Requires high-privilege Entra ID roles (Security Operator minimum)
- Cannot delegate job creation to workspace-specific teams
-
System Tables Special Handling
- Require Defender XDR Unified RBAC custom roles
- Cannot be managed through traditional Azure RBAC
-
Cross-Workspace Operations
- Require Entra ID roles for any operation spanning multiple workspaces
- No way to grant cross-workspace access without tenant-wide permissions
Migration Considerations
- Existing Azure RBAC roles continue to work for workspace-specific operations
- Microsoft Sentinel Reader/Contributor roles automatically work with data lake for their assigned workspaces
- New capabilities (job management, cross-workspace queries) require Entra ID roles
- Unified RBAC activation required for granular permissions (available July 2025+)
5.6 Activating Defender XDR Unified RBAC
To use the new granular permissions:
- Navigate to Defender Portal → Permissions → Microsoft Defender XDR → Roles
- Activate Unified RBAC for your tenant
- Create custom roles with specific Data Lake permissions
- Assign roles to users/groups with appropriate data source scopes
Important: Once activated, some traditional RBAC behaviours may change. Test in a non-production environment first.
6. Table Management
6.1 Configuration Options
-
Manage table-level storage and retention directly in the portal.
-
Choose:
- Analytics tier or Data Lake tier or Both
- Analytics retention: up to 2 years
- Total retention: up to 12 years
-
Changes apply only to new data — existing data remains in its current tier.
6.2 Tier Guidance
- Analytics tier: Best for high-fidelity data (EDR, email, identity, SaaS, cloud logs) requiring real-time detection.
- Data Lake tier: Ideal for high-volume, low-fidelity logs (NetFlow, firewall, proxy logs).
6.3 Important Notes
- Moving tables to Data Lake ONLY disables ingestion to Analytics tier, which impacts:
- Analytics rules that rely on hot data
- Hunting queries expecting Analytics tier data
- SOAR playbooks that depend on current data
- These features won’t get new data but may still function on historical Analytics tier data
- Auxiliary logs are automatically integrated into the Data Lake.
- Basic Logs are NOT supported in Data Lake tier - must be converted to Analytics tier first.
- Basic and Auxiliary Logs users should plan migration as no further enhancements are planned for these legacy structures.
7. Querying Data in the Data Lake
6.1 KQL Query Interface
-
Found under Defender Portal → Data Lake Exploration → KQL Queries.
-
UI mirrors Advanced Hunting:
- Tables list (left)
- KQL editor (centre)
- Query results (bottom)
-
Supports full KQL capabilities including machine learning functions and advanced analytics.
6.2 Asset Data (“Default Scope”)
-
Automatically includes rich asset metadata from the default workspace:
- Azure Resource Graph tables:
ARGAuthorizationResourcesARGResourceContainersARGResources
- Entra ID tables:
EntraApplicationsEntraGroupMembershipsEntraGroupsEntraMembersEntraOrganizationsEntraServicePrincipalsEntraUsers
- Azure Resource Graph tables:
-
Provides contextual enrichment for investigations (e.g. group memberships, app access, device ownership).
-
Appears under the “default” workspace scope, alongside Sentinel workspace data.
-
No setup required — automatically populated on Data Lake onboarding.
-
These tables store the current state of your environment, they are not chronological logs.
7. Data Lake Jobs
-
Data Lake jobs = scheduled KQL queries that write results to:
- A new custom table (suffix
_KQL_CL), or - An existing table (schema of the KQL output must match the destination table).
- A new custom table (suffix
-
Configurable options include:
- Job name, description, schedule, target workspace, and output table.
- Supports both ad hoc and recurring runs.
-
Monitoring: Job runs and status visible under Defender Portal → Jobs.
7.1 Comparison: KQL Jobs vs Summary Rules
| Feature | KQL Jobs | Summary Rules |
|---|---|---|
| Query language | Full KQL (JOINs, UNIONs, advanced operators) | Limited subset |
| Lookback period | Up to 12 years | Up to 24 hours |
| Scheduling | Customisable | Fixed intervals |
| Output target | Any custom/existing table | Analytics tier only |
| Data Lake support | ✅ Supported | ✅ Supported (can query a single table) |
8. Jupyter Notebooks and Spark Integration
8.1 Overview
- Supports Spark-backed Jupyter notebooks for advanced analytics and ML workloads.
- Accessed through Visual Studio Code with the Microsoft Sentinel extension.
- Enables deep analysis for forensics, incident response, and anomaly detection.
8.2 Sentinel VS Code Extension
-
View and manage:
- Data Lake tables and schemas
- Job definitions and run status
-
Create, run, and schedule notebooks directly in VS Code.
-
Integrates with GitHub Copilot for AI-assisted security operations.
8.3 Execution Environment
-
Microsoft provisions and manages a dedicated Spark cluster for each Data Lake tenant.
-
Select kernel and compute pool size when executing notebooks:
- Small: Single-table queries
- Medium: Joins and data aggregation
- Large: Multi-table joins or ML workloads
-
Charged based on compute time (per active session).
-
Spark pools run only during query execution, reducing idle cost.
9. Microsoft Sentinel MCP Server (Preview)
9.1 Overview
- Model Context Protocol (MCP) server provides unified, hosted interface for AI-driven security operations.
- Enables natural language queries against security data without requiring schema knowledge.
- Integrates with GitHub Copilot and VS Code for building intelligent security agents.
- Uses semantic search and embeddings for intelligent data discovery.
9.2 Key Capabilities
- Semantic Search: Query tools find relevant tables based on natural language prompts.
- Query Tools: Execute KQL queries and retrieve data using conversational language.
- Custom Analysis: Build security agents that automate enrichment, anomaly detection, and forensics.
- Entity Analysis: Leverages entity analyser for advanced correlation.
9.3 Tool Collections
Available collections include:
| Collection | Purpose | URL |
|---|---|---|
| Data Exploration | Search tables and query data in natural language | https://sentinel.microsoft.com/mcp/data-exploration |
| Agent Creation | Build Security Copilot agents for complex workflows | https://sentinel.microsoft.com/mcp/security-copilot-agent-creation |
9.4 Agent Creation Features
- Generate agent YAML files from natural language descriptions (e.g., “Build an agent that triages compromised accounts”).
- Discover relevant Security Copilot tools and skills automatically.
- Deploy agents at user or workspace scope.
- Iterative development through conversational refinement.
- Integrates with 1p tools (first-party Microsoft tools).
9.5 Integration Requirements
- Requires onboarding to Sentinel Data Lake.
- Security Reader role minimum for access.
- Works with Visual Studio Code and Security Copilot platforms.
10. Pricing and Billing
10.1 Consumption Model
- Ingestion: Data processing charge of $0.10/GB for data lake-only tables (not for mirrored data).
- Storage: Compressed at 6:1 ratio across all data sources, billed per GB/month.
- Compute: Charged for KQL queries and Spark notebook execution time.
- Mirroring: Free when retention matches analytics tier (90 days).
10.2 Cost Optimisation
- Use commitment tiers for predictable pricing.
- Data Lake tier costs less than 15% of traditional analytics logs.
- New cost management experience (preview) provides usage insights and alerts.
- Set usage-based alerts on specific metres (query usage, notebook time).
10.3 Important Notes
- Archive logs automatically transition to Data Lake billing upon enablement.
- Auxiliary logs switch to Data Lake metres after onboarding.
- XDR data requires retention >30 days to ingest beyond default tier.
13. Summary
| Area | Key Points |
|---|---|
| Integration | Uses existing Sentinel connectors; no pipeline changes |
| Cost | Mirroring free within retention; <15% cost vs traditional logs |
| Setup | Global/Security Admins with subscription ownership OR Security Admin + Sentinel Contributor |
| Analysis Methods | KQL, scheduled KQL jobs, Spark, Jupyter, graph visualisations, MCP/AI agents |
| Management | Centralised in Defender portal (Azure portal retiring July 2026) |
| AI Capabilities | MCP server enables natural language queries and automated agent creation |
| Storage Model | Three-tier architecture: Asset, Activity, and TI stores on raw storage |
| Advantages | Unified storage, flexible analysis, rich asset context, cross-source visibility |
14. Future Roadmap
- SQL query support planned for Data Lake exploration.
- Consolidated KQL interface merging Advanced Hunting and Data Lake exploration.
- Enhanced MCP capabilities for more sophisticated AI-driven security operations.
- Customer-Managed Keys (CMK) support for enhanced data sovereignty.
- Government cloud availability anticipated soon.