Dreamforce 2025: AI Agents for Energy & Utilities - Grid Optimization and Sustainability

I attended the “AI for Energy and Utilities” track at Dreamforce 2025 and wanted to share insights on how Agentforce is being deployed in the energy sector. This is a fascinating intersection of AI, critical infrastructure, and sustainability.

Why Energy Sector AI Is Unique

The energy sector has constraints and opportunities that differ fundamentally from other industries:

Critical Infrastructure:

  • Power grid failures affect millions
  • Real-time decisions (milliseconds matter)
  • Safety-critical operations (lives at stake)
  • 24/7/365 reliability requirements

Regulatory Complexity:

  • NERC CIP (North American Electric Reliability Corporation Critical Infrastructure Protection)
  • FERC regulations (Federal Energy Regulatory Commission)
  • State utility commissions (50+ different regulatory bodies)
  • Environmental regulations (EPA, state-level)

Physical-Digital Convergence:

  • AI agents control physical infrastructure (transformers, switches, generators)
  • Operational Technology (OT) + Information Technology (IT) integration
  • Industrial control systems (ICS) security
  • Real-time sensor data (millions of IoT devices)

Sustainability Imperative:

  • Net-zero commitments (2030-2050 targets)
  • Renewable energy integration (solar, wind variability)
  • Grid decentralization (distributed energy resources)
  • Carbon accounting and reporting

Dreamforce Energy Sector Use Cases

1. Predictive Maintenance for Grid Infrastructure (PG&E)

What it does:

  • Analyzes data from 150,000+ transformers, substations, transmission lines
  • Predicts equipment failures 2-6 weeks in advance
  • Prioritizes maintenance crews (highest-risk equipment first)
  • Reduces wildfire risk (California Public Safety Power Shutoff optimization)

Technical architecture:

IoT Sensors (temperature, vibration, oil quality)
    ↓
Edge Computing (local anomaly detection)
    ↓
Data 360 (unified sensor + maintenance history)
    ↓
Agentforce Predictive Maintenance Agent
    ↓
Work Order System (Salesforce Field Service)
    ↓
Dispatch crews to highest-risk equipment

Results (18 months):

  • Unplanned outages: 3,400/year → 1,950/year (-43%)
  • Wildfire-related incidents: 87% reduction
  • Maintenance cost savings: $42M/year (optimized crew dispatch)
  • Customer satisfaction (uptime): 3.8 → 4.3 out of 5

Key insight: AI agents bridge the gap between sensor data and physical action. This is operational technology (OT) meeting AI.

2. Renewable Energy Forecasting and Grid Balancing (NextEra Energy)

What it does:

  • Forecasts solar/wind generation 24-72 hours ahead
  • Balances variable renewable supply with demand
  • Optimizes battery storage charging/discharging
  • Coordinates with fossil fuel “peaker” plants

The grid balancing challenge:

Supply must EXACTLY match demand every second
  - Too much supply → frequency spikes, equipment damage
  - Too little supply → brownouts, blackouts

Traditional grid (predictable):
  Coal/gas plants ramp up/down slowly

Renewable grid (variable):
  Solar: drops 80% in 5 minutes (cloud cover)
  Wind: fluctuates with weather patterns

AI agents must predict and balance in real-time.

NextEra’s multi-agent system:

Agent 1: Weather Forecasting Agent

  • Ingests NOAA, European weather models
  • Predicts cloud cover, wind speeds (hyperlocal)
  • Updates every 15 minutes

Agent 2: Demand Forecasting Agent

  • Analyzes historical usage patterns
  • Accounts for weather (heat waves spike AC usage)
  • Detects anomalies (major events, holidays)

Agent 3: Grid Optimization Agent

  • Receives forecasts from Agents 1 & 2
  • Decides: charge batteries, discharge batteries, or start peaker plants
  • Sends control signals to distributed energy resources (DER)

Agent 4: Market Trading Agent

  • Buys/sells electricity in wholesale markets
  • Optimizes for price + carbon footprint
  • Coordinates with grid operator (ISO/RTO)

Results:

  • Renewable energy curtailment: 12% → 3% (less waste)
  • Battery utilization: +38% (better ROI on storage)
  • Wholesale market savings: $18M/year (buy low, sell high)
  • Carbon emissions: -22% (optimized fossil fuel usage)

Key insight: Multi-agent orchestration is critical. Grid balancing requires 4+ agents working together in real-time.

3. Customer Energy Efficiency Agent (Duke Energy)

What it does:

  • Analyzes customer energy usage patterns
  • Recommends personalized efficiency improvements
  • Calculates ROI on home upgrades (solar panels, insulation, HVAC)
  • Coordinates rebates and incentives

Conversational UI example:

Customer: "Why is my electric bill so high this month?"

Energy Efficiency Agent:
"Your August bill was $287, up 34% from July ($214).

Analysis of your usage:
  - Air conditioning: 62% of bill (+$48 vs July)
  - Water heater: 18% of bill
  - Appliances: 20% of bill

Your AC is running 40% more than similar homes in your area.
This could indicate:
  - Dirty air filter (replace every 3 months)
  - Poor insulation (check attic, windows)
  - Older AC unit (16+ years old, inefficient)

Recommendations:
  1. Replace air filter → Save $12/month (free)
  2. Attic insulation upgrade → Save $38/month ($1,200 upfront, 32-month payback)
  3. New high-efficiency AC → Save $67/month ($6,500 upfront, 97-month payback)

Duke Energy rebates available:
  - Insulation: $400 rebate (reduces payback to 21 months)
  - AC replacement: $1,200 rebate (reduces payback to 79 months)

[Schedule Energy Audit] [Apply for Rebates] [Talk to Specialist]"

Results:

  • Customer engagement: 14% → 31% (participation in efficiency programs)
  • Average energy savings: 18% per participating household
  • Customer satisfaction: +22 points (customers appreciate personalized advice)
  • Demand reduction: 340 MW (equivalent to avoiding one new power plant)

Key insight: AI agents enable mass personalization. Each customer gets tailored advice, not generic efficiency tips.

4. Electric Vehicle (EV) Managed Charging Agent (Southern Company)

What it does:

  • Coordinates charging for 45,000+ EVs on the grid
  • Optimizes charging times (off-peak, when renewable energy is abundant)
  • Prevents grid overload (too many EVs charging simultaneously)
  • Enables Vehicle-to-Grid (V2G) for grid stability

The EV challenge:

One EV charging = equivalent to 3 homes worth of electricity

45,000 EVs charging simultaneously (6pm, everyone comes home):
  = 135,000 homes worth of demand
  = Grid overload, brownouts

Solution: Stagger charging, incentivize off-peak

Managed charging workflow:

EV owner plugs in at 6pm
  ↓
Agent asks: "When do you need your car charged?"
Owner: "8am tomorrow"
  ↓
Agent calculates:
  - Battery level: 30% (needs 14 hours to fully charge)
  - Available time: 14 hours (6pm → 8am)
  - Grid conditions: Peak 6-9pm, off-peak 11pm-6am
  - Renewable availability: High wind generation 1-5am
  ↓
Agent schedules:
  - Delay charging until 11pm
  - Charge during off-peak + high renewable hours
  - Finish by 7am (1 hour buffer)
  ↓
Owner saves money (off-peak rates)
Grid avoids overload
More renewable energy utilized

Results:

  • Peak demand reduction: 18% (spreading EV charging)
  • Customer savings: $42/month average (off-peak rates)
  • Renewable energy utilization: +12% (charging during high wind/solar)
  • Grid infrastructure avoided: $280M (no new substations needed)

Future: Vehicle-to-Grid (V2G)
EVs can discharge back to grid during peak demand:

Heat wave → grid stressed → brownout risk
  ↓
Agent asks EV owners: "Can we use 20% of your battery to stabilize the grid?
                       We'll pay you $15 and recharge overnight for free."
  ↓
500 EVs discharge 10 kWh each = 5 MW of grid support
  ↓
Crisis averted, EV owners earn money

Key insight: AI agents coordinate distributed resources (thousands of EVs) for grid-level benefits.

NERC CIP Compliance for Energy Sector AI

Energy sector has stringent cybersecurity requirements under NERC CIP (Critical Infrastructure Protection).

Key NERC CIP requirements:

CIP-005: Electronic Security Perimeter

  • AI agents accessing grid control systems must go through secure perimeter
  • Network segmentation (OT network isolated from IT network)
  • No direct internet access for critical systems

CIP-007: System Security Management

  • All agent actions logged (audit trail)
  • Malware protection on agent infrastructure
  • Security patching within 35 days

CIP-010: Configuration Change Management

  • Any agent configuration change requires approval
  • Baseline configuration documented
  • Testing required before production deployment

CIP-013: Supply Chain Risk Management

  • Salesforce as vendor must meet supply chain security requirements
  • Third-party LLMs (OpenAI, etc.) require security review
  • Vendor risk assessment updated annually

Penalties for non-compliance:

  • $1M per day per violation
  • FERC enforcement actions
  • Potential loss of grid operating authority

This is why energy sector AI deployments are slow and expensive.

Real-Time AI at Scale: The Technical Challenge

Energy grid operates at millisecond timescales. AI agents must be real-time.

Latency requirements:

Traditional enterprise AI:
  - Response time: 1-5 seconds (acceptable)
  - Batch processing: hourly/daily updates (common)

Energy grid AI:
  - Response time: 10-100 milliseconds (required)
  - Real-time processing: continuous streaming data

PG&E’s architecture for real-time AI:

Edge Layer (substations, transformers):
  - Local anomaly detection (microsecond response)
  - Send alerts to cloud only when anomaly detected
  - No round-trip to cloud for routine operations

Cloud Layer (Agentforce):
  - Predictive modeling (2-week forecasts)
  - Work order generation and crew dispatch
  - Historical analysis and pattern recognition

Hybrid approach:
  - Edge handles real-time safety (circuit breakers, relays)
  - Cloud handles prediction and optimization
  - Edge + Cloud communicate via secure VPN

Why not 100% cloud?

  • Latency (round-trip to cloud takes 50-200ms)
  • Reliability (grid must operate even if cloud connection lost)
  • Bandwidth (millions of sensors generate terabytes/day)

Edge AI + Cloud AI = hybrid architecture for energy.

Sustainability Reporting and Carbon Accounting Agents

Energy companies have net-zero commitments and need to track carbon emissions.

Duke Energy’s Carbon Accounting Agent:

What it does:

  • Tracks emissions from all generation sources (coal, gas, nuclear, renewables)
  • Calculates Scope 1, 2, 3 emissions (GHG Protocol standard)
  • Forecasts emissions trajectory toward 2050 net-zero goal
  • Recommends operational changes to reduce carbon

Example analysis:

Current emissions: 42.3 million metric tons CO2e/year

Breakdown:
  - Coal plants: 65% of emissions (being retired by 2035)
  - Natural gas: 28% of emissions (peaker plants)
  - Fugitive methane: 7% of emissions (pipeline leaks)

2050 Net-Zero Gap: -42.3 million metric tons

Pathway to net-zero:
  - Coal retirement: -27.5 million tons (by 2035)
  - Gas plant efficiency: -3.2 million tons (by 2030)
  - Methane leak reduction: -2.1 million tons (by 2028)
  - Renewable expansion: -6.8 million tons avoided (by 2040)
  - Carbon offsets: -2.7 million tons (forestry, CCS)

Agent recommendation: Accelerate coal retirement by 2 years → saves $180M in carbon costs

Regulatory drivers:

  • SEC climate disclosure rules (Scope 1, 2, 3 reporting)
  • State renewable portfolio standards (RPS)
  • Carbon pricing (some states/regions)
  • ESG investor pressure

Key insight: AI agents turn sustainability from compliance exercise to operational optimization.

Cost-Benefit Analysis for Energy Sector AI

PG&E’s ROI (24 months, predictive maintenance agents):

Investment:

  • Agentforce licenses (NERC CIP compliant): $1.8M/year (300 field engineers + operations center)
  • IoT sensor deployment: $45M (one-time, 150K devices)
  • Edge computing infrastructure: $12M (one-time)
  • Implementation and integration: $8M (one-time)
  • Ongoing operations: $2.4M/year
  • Year 1 total: $69.2M
  • Year 2+ total: $4.2M/year

Returns:

  • Avoided outage costs: $58M/year (lost revenue + penalties)
  • Maintenance optimization: $42M/year (crew efficiency)
  • Wildfire liability reduction: $120M/year (insurance + legal costs)
  • Asset life extension: $18M/year (equipment lasts longer)
  • Total annual benefit: $238M/year

ROI: 244% in Year 2
Payback: 3.5 months (insanely fast for infrastructure)

Why such high ROI?

  • Wildfire liability is massive in California (PG&E paid $13.5B in wildfire settlements 2017-2020)
  • Outage penalties can be $10M+ per incident
  • Grid infrastructure is expensive (transformers cost $500K-$2M each)

For PG&E, AI agents are existential (prevent bankruptcy from wildfire liability).

Integration with Operational Technology (OT) Systems

Energy AI doesn’t just integrate with IT systems (Salesforce, databases) - it integrates with physical control systems.

Common OT systems in energy:

SCADA (Supervisory Control and Data Acquisition):

  • Real-time monitoring of grid equipment
  • Remote control of switches, breakers, transformers
  • Historical data logging

DMS (Distribution Management System):

  • Models the electrical distribution network
  • Power flow analysis
  • Outage management

EMS (Energy Management System):

  • Transmission-level control
  • Economic dispatch (which generators to run)
  • Frequency regulation

DERMS (Distributed Energy Resource Management System):

  • Coordinates solar, wind, batteries, EVs
  • Virtual power plant aggregation

Agentforce integration pattern:

Agentforce Agent
    ↓
Salesforce Platform (IT environment)
    ↓
MuleSoft Integration Layer (DMZ)
    ↓ (one-way data flow for security)
OT Historian (read-only access to SCADA data)
    ↓ (separate write-only channel with approval)
DMS / SCADA (control commands)
    ↓
Physical equipment (transformers, switches)

Critical security principle:

  • Agents can read OT data freely (monitoring)
  • Agents can write control commands ONLY with human approval (safety)
  • Air-gap between IT and OT networks (prevent cyber attacks)

NextEra Energy learned this the hard way:
Early deployment had agent directly controlling battery discharge. A software bug caused agent to discharge all batteries simultaneously → grid frequency spike → emergency shutdown.

Now: Human-in-the-loop for all control actions.

Workforce Transition: Field Engineers + AI Agents

Energy sector has aging workforce (average age 55+) and knowledge retention challenges.

Duke Energy’s approach:

  • Train field engineers to work with AI agents
  • Agent provides recommendations, engineer makes decisions
  • Capture expert knowledge in agent training data

Example: Substation maintenance

Veteran engineer (30 years experience):
  - Hears transformer humming → knows it's oil circulation issue
  - Smells ozone → knows it's corona discharge
  - Feels vibration → knows bearing is failing

AI agent:
  - Analyzes sensor data (temperature, vibration, oil quality)
  - Predicts failure probability
  - Recommends maintenance action

Human + AI:
  - Agent provides data-driven diagnosis
  - Engineer validates with physical inspection
  - Engineer's expertise improves agent over time

Change management:

  • 67% of field engineers were initially skeptical of AI
  • After 6 months working with agents: 82% say it improves their work
  • Key: Position AI as tool, not replacement

My Recommendation for Energy & Utilities

If you’re in energy sector considering Agentforce:

  1. Start with non-control use cases

    • Predictive maintenance (monitoring, not control)
    • Customer service (billing, efficiency advice)
    • Carbon reporting (data analysis)
  2. Don’t touch OT control until you have:

    • NERC CIP compliance validated
    • Cybersecurity maturity model (NIST Cybersecurity Framework)
    • Human-in-the-loop approval workflows
    • Comprehensive testing (tabletop exercises, simulations)
  3. Budget for OT integration complexity

    • 2-3x higher than standard enterprise Salesforce
    • Specialized OT security expertise required
    • Edge computing infrastructure
  4. Plan for 18-24 month deployment

    • NERC CIP compliance takes time
    • OT integration is complex
    • Workforce training and change management
  5. Focus on safety-critical ROI

    • Wildfire prevention (if applicable)
    • Outage reduction
    • Equipment failure prevention
    • These have massive financial impact

The opportunity is enormous (energy sector wastes $100B+/year on inefficiency), but safety and reliability are non-negotiable.

Questions for the Community

  1. For other energy/utility folks: How are you approaching OT/IT convergence for AI agents?

  2. For Priya (security): How would you design air-gap between IT agents and OT control systems?

  3. For Carlos (finance): How do you model ROI when benefits include “avoided wildfire liability” (probabilistic)?

  4. For critical infrastructure industries (water, transportation, manufacturing): Similar challenges with physical-digital AI?


I’m speaking at DistribuTECH 2026 on “AI Agents for Grid Modernization” - happy to share more detailed case studies offline.

David, fascinating energy sector overview. Let me address your question about designing an air-gap between IT agents and OT control systems. This is exactly the challenge we face securing critical infrastructure AI.

Air-Gap Architecture for Energy Sector AI

The traditional “air-gap” (physical network separation) doesn’t work well with AI agents that need data from both IT and OT environments. We need a controlled bridge with strict security controls.

Three-Zone Security Model

┌─────────────────────────────────────────────────────────┐
│  ZONE 1: IT Environment (Internet-Connected)            │
│  - Salesforce / Agentforce                               │
│  - Corporate network                                     │
│  - Customer-facing systems                               │
│  - Business applications                                 │
│  Risk: High (exposed to internet attacks)                │
└────────────────┬────────────────────────────────────────┘
                 ↓ (one-way data flow)
         ┌───────────────────────────┐
         │  Security Gateway (DMZ)   │
         │  - Data diode (hardware)  │
         │  - Deep packet inspection │
         │  - Protocol whitelist     │
         │  - Anomaly detection      │
         └───────┬───────────────────┘
                 ↓ (read-only access)
┌─────────────────────────────────────────────────────────┐
│  ZONE 2: OT Data Zone (Historian)                       │
│  - Historical SCADA data                                 │
│  - Sensor readings (archived)                            │
│  - Maintenance logs                                      │
│  - Analytics and reporting                               │
│  Risk: Medium (isolated, but has IT connections)         │
└────────────────┬────────────────────────────────────────┘
                 ↓ (physically separate network)
         ┌───────────────────────────┐
         │  Hardware Air-Gap         │
         │  - No physical connection │
         │  - Manual data transfer   │
         │  - Removable media only   │
         └───────────────────────────┘
                 ↓ (control commands only, human-approved)
┌─────────────────────────────────────────────────────────┐
│  ZONE 3: OT Control Zone (SCADA/DMS)                    │
│  - Real-time grid control                                │
│  - Circuit breakers, switches                            │
│  - Substation automation                                 │
│  - No external connectivity                              │
│  Risk: Critical (controls physical infrastructure)       │
└─────────────────────────────────────────────────────────┘

Data Diode: Hardware-Enforced One-Way Flow

Traditional firewall: Software-based, can be misconfigured or compromised
Data diode: Hardware device that physically allows data flow in ONE direction only

OT Zone → Fiber optic cable → Receive-only port (IT Zone)
         (light signal sent)    (can only receive, cannot send)

IT Zone cannot send data back to OT Zone (physically impossible)

PG&E uses data diodes from Owl Cyber Defense:

  • Cost: $45K per data diode pair
  • 50 substations = $2.25M investment
  • BUT: Prevents 99.9% of cyber attack vectors

What flows through data diode (OT → IT):

  • Sensor readings (temperature, voltage, current)
  • Equipment status (online, offline, alarm)
  • Historical event logs
  • Maintenance records

What CANNOT flow (IT → OT):

  • Control commands
  • Configuration changes
  • Software updates
  • Malware (intentional or accidental)

Control Path: Separate, Human-Approved Channel

For AI agents to control OT equipment, we use a separate approval workflow:

1. Agent analyzes data (from historian via data diode)
   ↓
2. Agent recommends action: "Open circuit breaker at Substation 47B"
   ↓
3. Recommendation sent to human operator (grid control center)
   ↓
4. Operator reviews:
   - Agent confidence score
   - Current grid conditions
   - Safety implications
   - Regulatory compliance
   ↓
5. Operator approves/rejects via SCADA workstation (separate from IT network)
   ↓
6. If approved, operator manually sends control command via OT network
   ↓
7. Circuit breaker opens
   ↓
8. Result logged in both IT and OT audit trails

Agent never has direct control access. Human is always in the loop.

Protocol Filtering at Security Gateway

Even with data diode, we filter what protocols are allowed:

Allowed protocols (OT → IT):

  • OPC UA (industrial automation standard)
  • Modbus TCP (energy sector standard)
  • DNP3 (SCADA protocol)
  • IEC 61850 (substation automation)
  • MQTT (IoT sensor data)

Blocked protocols:

  • HTTP/HTTPS (no web browsing from OT)
  • FTP (no file transfers)
  • SSH (no remote shell access)
  • RDP (no remote desktop)
  • Email protocols (no phishing vector)

Deep packet inspection: Every packet is analyzed for malicious payloads, even within allowed protocols.

NERC CIP Compliance Implementation

CIP-005 (Electronic Security Perimeter) compliance:

Required controls:
✓ Network segmentation (3 zones: IT, OT Data, OT Control)
✓ Access control lists (explicit allow-list, deny-all default)
✓ Intrusion detection systems (IDS at zone boundaries)
✓ Security event monitoring (SIEM ingestion)
✓ Annual penetration testing (third-party audit)

Our implementation:

  • Palo Alto firewalls at zone boundaries ($180K/year)
  • Nozomi Networks IDS for OT traffic ($240K/year)
  • Splunk SIEM for unified monitoring ($120K/year)
  • Annual pen test by Idaho National Lab ($200K)

Total NERC CIP security budget: $740K/year (this is on top of standard IT security)

Anomaly Detection for OT Traffic

AI agents accessing OT data can exhibit malicious patterns. We monitor for:

Volume anomalies:

Normal: Agent queries 500 data points/minute from historian
Alert:  Agent queries 5,000 data points/minute (data exfiltration?)

Pattern anomalies:

Normal: Agent queries same substations daily (predictive maintenance)
Alert:  Agent suddenly queries all substations (reconnaissance?)

Temporal anomalies:

Normal: Agent active during business hours (8am-6pm)
Alert:  Agent active at 3am (compromised credentials?)

Geographic anomalies:

Normal: Agent accesses substations in service territory
Alert:  Agent accesses competitor's data (misconfiguration?)

Automatic response: Suspicious agent activity triggers circuit breaker (agent access suspended, security team alerted).

Physical Security for OT Infrastructure

Agents run on IT infrastructure, but OT systems have physical security:

Control centers:

  • Biometric access (fingerprint + badge)
  • 24/7 security guards
  • Video surveillance (retained 90 days)
  • No phones/USB devices allowed
  • Faraday cage (prevents RF eavesdropping)

Substations:

  • Perimeter fencing (8-foot with barbed wire)
  • Motion sensors and alarms
  • Locked equipment cabinets
  • Tamper-evident seals
  • Remote video monitoring

Why physical security matters for AI:

  • Attacker with physical access can bypass all network security
  • USB device inserted into SCADA workstation = malware spread
  • Physical damage to equipment defeats AI predictions

Cost: $1.2M/year for physical security across 50 critical substations.

Incident Response: OT Cyber Attack Scenarios

Scenario 1: Agent credentials compromised

1. Attacker steals Salesforce credentials
2. Impersonates agent, queries OT historian
3. IDS detects anomalous query pattern
4. Automatic response: Suspend agent, revoke credentials
5. Containment time: 2 minutes
6. Impact: Zero (data diode prevented control access)

Scenario 2: Malware in IT environment

1. Ransomware infects corporate network
2. Spreads to Salesforce/Agentforce servers
3. Attempts to spread to OT via data diode
4. Data diode blocks (one-way only)
5. OT systems unaffected, continue operating
6. Impact: IT systems down, OT grid operations continue

Scenario 3: Insider threat

1. Malicious employee tries to use agent to damage grid
2. Employee modifies agent to issue dangerous commands
3. Commands sent to human operator for approval
4. Operator sees unusual command, rejects
5. Security team alerted, employee access revoked
6. Impact: Zero (human-in-the-loop prevented damage)

Scenario 4: Nation-state APT (Advanced Persistent Threat)

1. Sophisticated attacker breaches IT perimeter
2. Establishes foothold, moves laterally
3. Attempts to pivot to OT network
4. Encounters data diode (hardware barrier)
5. Cannot bypass (would require physical access)
6. Limited to IT data exfiltration, no grid control
7. Impact: Medium (data loss), not Critical (no grid impact)

The air-gap + data diode architecture prevents worst-case scenarios.

Testing and Validation

We conduct quarterly “red team” exercises:

Red Team (attackers):

  • Attempt to compromise agent infrastructure
  • Try to send malicious commands to OT systems
  • Exploit misconfigured firewalls, data diodes
  • Social engineering (trick operators into approving bad commands)

Blue Team (defenders):

  • Monitor for intrusions
  • Respond to security events
  • Validate security controls working as designed

Recent exercise results (Q3 2025):

  • Red team successfully compromised IT environment (phishing)
  • Red team could NOT pivot to OT control zone (data diode effective)
  • Red team could NOT trick operators into approving malicious commands (training effective)
  • Overall: IT defenses need improvement, OT air-gap held strong

Cost-Benefit: Is Air-Gap Worth It?

Air-gap infrastructure costs (50 substations):

  • Data diodes: $2.25M (one-time)
  • Zone segmentation (firewalls, switches): $1.8M (one-time)
  • IDS/IPS for OT: $240K/year
  • SIEM monitoring: $120K/year
  • Security operations: $600K/year (2 FTEs)
  • Penetration testing: $200K/year
  • Total Year 1: $5.21M
  • Total Year 2+: $1.16M/year

Benefit: Avoided grid cyber attack

  • Ukraine 2015 cyber attack: 230,000 customers lost power for 6 hours
  • Economic impact: $200M+ (lost productivity, food spoilage, etc.)
  • Colonial Pipeline ransomware 2021: $4.4M ransom + $100M+ remediation
  • Average cost of critical infrastructure cyber attack: $500M - $1B

Expected value calculation:

Probability of successful cyber attack without air-gap: 15% over 5 years
Average attack cost: $500M
Expected loss: $500M × 15% = $75M

Cost of air-gap: $5.21M + ($1.16M × 4) = $9.85M over 5 years

Net benefit: $75M - $9.85M = $65.15M

ROI: 661% (air-gap is cheap compared to cyber attack cost)

For critical infrastructure, air-gap is a no-brainer investment.

Emerging Threat: AI-Powered Cyber Attacks

New concern: Attackers using AI to defeat our AI defenses

Adversarial AI scenarios:

  1. Prompt injection attacks: Trick agent into revealing sensitive data
  2. Model poisoning: Corrupt agent’s training data to cause failures
  3. Evasion attacks: Craft malicious traffic that evades IDS detection
  4. AI-generated phishing: Target grid operators with personalized attacks

Our counter-measures:

  • Adversarial training (expose agents to attack patterns)
  • Input validation (sanitize all agent inputs)
  • Output filtering (prevent agents from revealing sensitive data)
  • Continuous monitoring (detect model drift or corruption)

This is an arms race: AI attackers vs AI defenders.

Recommendations for Energy Sector OT/IT Security

For utilities deploying AI agents:

  1. Never allow direct agent control of OT systems

    • Human-in-the-loop for all control actions
    • Agent provides recommendations only
  2. Implement hardware data diodes

    • Software firewalls aren’t sufficient for critical infrastructure
    • Physical one-way flow prevents 99% of attacks
  3. Defense in depth: 3-zone model

    • IT zone (internet-connected, higher risk)
    • OT data zone (historian, read-only access)
    • OT control zone (air-gapped, no external access)
  4. Monitor agent behavior continuously

    • Volume, pattern, temporal, geographic anomalies
    • Automatic suspension on suspicious activity
  5. Physical security is as important as cyber

    • Control center access controls
    • Substation perimeter security
    • Tamper-evident seals
  6. Test your defenses quarterly

    • Red team / blue team exercises
    • Tabletop simulations
    • Penetration testing
  7. Budget appropriately

    • $5M+ for initial air-gap infrastructure
    • $1M+/year ongoing security operations
    • This is 2-3x higher than standard IT security, but necessary

The stakes are too high to cut corners on OT/IT security.


Question for David: How is your organization handling the human-in-the-loop approval workflow? Are operators getting “alert fatigue” from too many agent recommendations?

Question for the group: Anyone else working on critical infrastructure AI (water, transportation, manufacturing)? Would love to compare notes on OT/IT air-gap architectures.

David, great energy sector overview. The edge computing + cloud hybrid architecture you described is exactly what we’re implementing. Let me share the infrastructure perspective.

Edge Computing Infrastructure for Real-Time Energy AI

Energy sector AI has latency requirements that standard cloud computing can’t meet. We need edge infrastructure deployed at substations, wind farms, and solar installations.

Edge vs Cloud: The Latency Problem

Cloud-only architecture (doesn’t work for grid control):

Substation sensor → Internet → Cloud AI → Internet → Substation actuator
Latency: 50-200ms (too slow for grid protection)

Edge + Cloud hybrid (what we deployed):

Substation sensor → Edge AI (local) → Substation actuator
Latency: 5-20ms (fast enough for real-time control)

Edge → Cloud (periodic sync)
        ↓
   Cloud AI (predictive analytics, optimization)

Edge Computing Hardware at Substations

PG&E’s edge deployment (150 substations):

Hardware per substation:

  • NVIDIA Jetson AGX Orin (AI inference accelerator): $2,000
  • Industrial PC (ruggedized, fanless): $3,500
  • 4G/5G cellular backup (when fiber fails): $800
  • Uninterruptible Power Supply (UPS): $1,200
  • Environmental enclosure (weather-proof, NEMA 4X): $2,800
  • Total per substation: $10,300

150 substations × $10,300 = $1.545M hardware investment

Why specialized hardware:

  • Temperature range: -40°C to +85°C (outdoor installations)
  • Vibration resistant (substation equipment vibrates)
  • Fanless cooling (fans fail in dusty environments)
  • Low power (<50W, runs on battery backup)
  • 10-year lifespan (industrial grade)

Edge AI Workloads

What runs on edge vs cloud:

Edge (real-time, <100ms):

  • Anomaly detection (temperature spike, vibration)
  • Equipment protection (trip circuit breaker if fault detected)
  • Local optimization (adjust voltage regulator)
  • Sensor fusion (combine multiple sensor readings)

Cloud (predictive, minutes to hours):

  • Predictive maintenance (predict failure 2-6 weeks ahead)
  • Multi-substation optimization (grid-wide load balancing)
  • Historical analytics (trend analysis)
  • Agent training (model updates)

Data flow:

Edge generates: 10 TB/day (raw sensor data)
Edge sends to cloud: 50 GB/day (compressed, anomalies only)

Compression ratio: 200:1 (massive bandwidth savings)

Why not send everything to cloud?

  • Bandwidth cost: $0.08/GB → 10TB/day = $800/day = $292K/year (too expensive)
  • Latency: Round-trip to cloud takes 100ms+ (too slow)
  • Reliability: If internet fails, edge continues operating (resilience)

Edge-Cloud Synchronization

How edge and cloud stay in sync:

Edge → Cloud (continuous upload):

  • Anomaly events (triggered by threshold)
  • Hourly summaries (aggregated statistics)
  • Equipment status changes (online → offline)
  • Alarms and warnings

Cloud → Edge (periodic download):

  • Model updates (new AI model versions)
  • Configuration changes (threshold adjustments)
  • Software patches (security updates)

Synchronization protocol:

Edge buffer stores data locally (72 hours retention)
  ↓
Every 5 minutes: Attempt upload to cloud
  ↓
If internet available: Upload compressed data, clear buffer
If internet unavailable: Keep buffering, retry
  ↓
When internet restored: Backfill up to 72 hours

This handles intermittent connectivity gracefully.

Container Orchestration at the Edge

We use Kubernetes for edge deployments:

Standard Kubernetes (cloud):

  • 100-1000 nodes in a cluster
  • High availability (replicas across zones)
  • Centralized control plane

K3s (lightweight Kubernetes for edge):

  • Single-node clusters (per substation)
  • Low memory footprint (<512MB)
  • Works on ARM processors (Jetson)

Our edge stack:

K3s (Kubernetes distribution)
  ↓
Docker containers:
  - AI inference engine (TensorFlow Lite)
  - Time-series database (InfluxDB edge)
  - MQTT broker (local sensor ingestion)
  - OPC UA gateway (industrial protocol)
  - Sync agent (cloud communication)

Deployment automation:

GitOps workflow:
1. Engineer commits new model to Git repo
2. CI/CD pipeline builds container image
3. FluxCD deploys to all 150 edge nodes
4. Canary rollout (10% → 50% → 100%)
5. Automatic rollback if errors detected

This lets us update 150 substations in 30 minutes (vs manual updates taking weeks).

Edge AI Model Optimization

Cloud models don’t run efficiently on edge hardware.

Model optimization techniques:

1. Quantization (reduce model size):

Original model: 32-bit floating point, 250 MB
Quantized model: 8-bit integer, 65 MB

Model size reduction: 74%
Inference speed improvement: 3.2x
Accuracy loss: <1% (acceptable)

2. Pruning (remove unnecessary neurons):

Original model: 12 million parameters
Pruned model: 3.2 million parameters

Model size reduction: 73%
Inference speed improvement: 2.8x
Accuracy loss: <2%

3. Knowledge distillation (train small model to mimic large model):

Teacher model (cloud): 500 MB, 200ms inference
Student model (edge): 40 MB, 15ms inference

Student achieves 94% of teacher accuracy in 7.5% of size

Result: Cloud model optimized for edge deployment

  • 250 MB → 40 MB (84% size reduction)
  • 180ms → 12ms inference (15x faster)
  • Runs on $2K Jetson instead of $10K GPU server

Network Architecture for Edge Deployments

150 substations need reliable connectivity:

Primary connection: Fiber optic

  • 1 Gbps dedicated fiber to each substation
  • <10ms latency to regional data center
  • 99.95% uptime SLA
  • Cost: $800/month per substation = $120K/month = $1.44M/year

Secondary connection: 4G/5G cellular

  • Automatic failover if fiber fails
  • 10-50 Mbps (sufficient for critical data)
  • Higher latency (30-100ms) but acceptable for backup
  • Cost: $80/month per substation = $12K/month = $144K/year

Tertiary: Satellite (for remote substations):

  • Starlink for rural areas without fiber/cellular
  • 50-200 Mbps, 20-40ms latency
  • $110/month = $16.5K/year (for 10 remote substations)

Total network cost: $1.6M/year

Uptime achieved: 99.98% (only 1.7 hours downtime per year across all substations)

Edge Security Considerations

Edge devices are physically accessible (unlike cloud servers).

Security controls:

1. Secure boot (prevent unauthorized firmware):

  • UEFI Secure Boot enabled
  • Cryptographically signed kernel and bootloader
  • Tampering triggers alert + device lockdown

2. Full disk encryption:

  • All storage encrypted (AES-256)
  • Key stored in TPM (Trusted Platform Module)
  • Prevents data extraction if device stolen

3. Network segmentation:

  • Edge device has 2 network interfaces:
    • Interface 1: OT network (substations sensors/actuators)
    • Interface 2: IT network (cloud connectivity)
  • Firewall between interfaces (prevents lateral movement)

4. Tamper detection:

  • Physical tamper switch (opens if enclosure opened)
  • Sends alert to security operations center
  • Device locks down until security team validates

5. Remote wipe capability:

  • If device compromised, security team can remotely wipe
  • Device becomes unusable (prevents attacker access)
  • Requires physical replacement + re-provisioning

Monitoring and Observability for Edge Fleet

Managing 150 edge devices requires centralized monitoring:

Metrics we track:

  • CPU/memory/disk usage (performance)
  • Network latency (cloud connectivity)
  • Inference latency (AI model performance)
  • Sensor data quality (missing or corrupt data)
  • Container health (crashed processes)
  • Security events (login attempts, tamper alerts)

Our observability stack:

  • Prometheus (metrics collection): 150 edge nodes → cloud Prometheus
  • Grafana (dashboards): Real-time view of all 150 substations
  • Loki (log aggregation): Centralized logging for troubleshooting
  • Alertmanager (notifications): PagerDuty alerts for critical issues

Alert examples:

Critical: Edge device offline >10 minutes
  → Page on-call engineer

Warning: Edge device CPU >80% for >5 minutes
  → Slack notification, investigate within 1 hour

Info: New edge model deployed successfully
  → Log only, no action required

Dashboard KPIs:

  • % of edge devices online (target: 99.9%)
  • Average inference latency (target: <20ms)
  • Model accuracy (target: >95%)
  • Network bandwidth usage (track cost)

Cost Analysis: Edge vs Cloud-Only

Let’s compare total cost of ownership:

Cloud-only approach (doesn’t meet latency requirements, but for comparison):

Cloud compute (GPU inference): $18K/month × 12 = $216K/year
Network bandwidth (10 TB/day): $292K/year
Storage (long-term retention): $48K/year
──────────────────────────────────────
Total: $556K/year

Edge + Cloud hybrid approach (our deployment):

Edge hardware (amortized over 5 years): $309K/year
Edge network connectivity: $1,600K/year
Edge management (1 FTE): $140K/year
Cloud compute (reduced workload): $72K/year
Cloud storage: $48K/year
──────────────────────────────────────
Total: $2,169K/year

Edge is 3.9x more expensive!

But wait - factor in the value:

Cloud-only:

  • Cannot meet <100ms latency requirement
  • Grid protection fails (equipment damage, outages)
  • Estimated cost of failures: $40M/year

Edge + Cloud:

  • Meets <20ms latency requirement
  • Grid protection works (prevents failures)
  • Estimated avoided cost: $40M/year

ROI of edge deployment:

Incremental cost: $2,169K - $556K = $1,613K/year
Value delivered: $40M/year (avoided outages)

ROI: ($40M - $1.6M) / $1.6M = 2,381%

Edge computing is expensive, but critical for energy sector.

Scaling Edge Deployments

We started with 10 substations (pilot), now at 150, targeting 500.

Scaling challenges:

1. Hardware procurement:

  • 350 additional edge devices needed
  • Supply chain delays (12-18 weeks lead time)
  • Cost: $3.6M (350 × $10,300)

2. Network connectivity:

  • Not all substations have fiber available
  • Rural areas: satellite or cellular only
  • Cost variability: $80-$800/month per substation

3. Installation and configuration:

  • Each substation requires 8 hours on-site work
  • 350 substations × 8 hours = 2,800 hours
  • At 2 technicians: 1,400 tech-days = 6.4 person-years
  • Cost: $800K (contract installers)

4. Ongoing maintenance:

  • 500 edge devices × 2% annual failure rate = 10 device replacements/year
  • Site visits for troubleshooting: 4 hours average × $200/hour
  • Annual maintenance budget: $120K

Total 3-year scaling cost:

  • Hardware: $3.6M
  • Installation: $800K
  • Network (incremental): $3.4M (3 years)
  • Maintenance: $360K (3 years)
  • Total: $8.16M to scale from 150 → 500 substations

Future: Edge AI Model Training

Current: Models trained in cloud, deployed to edge
Future: Federated learning (models trained on edge)

Why federated learning for energy:

  • Data sovereignty (some states prohibit sensor data leaving state borders)
  • Privacy (competitive intelligence in wholesale markets)
  • Bandwidth (avoid uploading 10 TB/day per substation)

Federated learning workflow:

1. Cloud sends base model to all 500 edge devices
2. Each edge device trains on local data (substation-specific)
3. Edge devices send model updates (gradients) to cloud
   (NOT raw data, much smaller: 5 MB vs 10 TB)
4. Cloud aggregates 500 model updates → improved global model
5. Cloud sends updated model back to edge devices
6. Repeat weekly

Benefits:

  • Data never leaves edge (privacy, compliance)
  • Bandwidth savings: 5 MB vs 10 TB = 2,000,000x reduction
  • Models personalized to each substation (better accuracy)

Challenges:

  • Edge devices lack GPU for training (slow)
  • Coordinating 500 devices for synchronous updates
  • Detecting malicious edge devices (poisoning attack)

We’re piloting federated learning in Q1 2026 (10 substations).

Recommendations for Edge AI Deployments

For energy/utilities considering edge computing:

  1. Start with high-value use cases

    • Where latency matters (grid protection, real-time optimization)
    • Where bandwidth is prohibitive (uploading raw sensor data)
  2. Budget 3-4x cloud costs

    • Edge hardware, connectivity, installation, maintenance
    • But factor in value of low-latency processing
  3. Use lightweight Kubernetes (K3s)

    • Standard Kubernetes too heavyweight for edge
    • K3s designed for resource-constrained environments
  4. Optimize models for edge

    • Quantization, pruning, knowledge distillation
    • Target: 10x size reduction, 5x speed improvement
  5. Plan for offline operation

    • Edge must work when cloud disconnected
    • Local buffering and retry logic
  6. Secure the edge

    • Physical security (tamper detection)
    • Encryption at rest and in transit
    • Remote wipe capability
  7. Centralized monitoring

    • Prometheus + Grafana for metrics
    • Alerting for offline devices
    • Track edge fleet health

Edge computing is complex and expensive, but essential for real-time energy AI.


Question for David: What’s your experience with edge device failures in harsh environments (extreme heat, cold, vibration)? We’re seeing 2-3% annual failure rate - is that typical?

Question for Priya: How do you secure edge devices that are physically accessible (e.g., in unstaffed substations)? Tamper detection sufficient?