The New Slack: Dreamforce 2025's Vision of an Agentic Operating System

David, Luis - great product and architecture insights. Let me add the data engineering perspective. I attended the “Enterprise Search & Data Integration” workshop. Here’s what it takes to make this work with real enterprise data.

Data Integration: The Hard Part

Challenge: Data is Messy

Our reality at TianPan:

  • 15 data sources (Salesforce, Google Drive, GitHub, Jira, Notion, Confluence, internal DBs)
  • Different data formats (structured, unstructured, semi-structured)
  • Inconsistent naming conventions
  • Duplicate records
  • Stale data (some docs haven’t been updated in 2+ years)

Enterprise Search promise: “Just connect your tools and search everything!”

Reality: Garbage in, garbage out.

Data Quality Requirements for AI Agents

For agents to work well, data must be:

1. Accurate

  • No outdated information
  • Remove deprecated docs
  • Update “living documents” regularly

2. Complete

  • All relevant fields populated
  • No critical missing data
  • Proper relationships between objects

3. Consistent

  • Standardized terminology (“customer” vs “client” vs “account”)
  • Unified IDs (same entity across systems)
  • Consistent date formats, units, etc.

4. Accessible

  • Proper permissions (agents respect access controls)
  • Well-indexed (fast retrieval)
  • Metadata-rich (tags, categories, owners)

Dreamforce stat: “70% of enterprise search failures are due to poor data quality, not bad algorithms.”

Enterprise Search Data Pipeline

Step 1: Data Extraction

Connectors we need to set up:

Google Drive (20,000+ files):

# Using Google Drive API
from googleapiclient.discovery import build

drive_service = build('drive', 'v3', credentials=creds)

# List all files
results = drive_service.files().list(
    pageSize=1000,
    fields="files(id, name, mimeType, modifiedTime, owners, permissions)"
).execute()

files = results.get('files', [])

for file in files:
    # Download file content
    content = drive_service.files().get_media(fileId=file['id']).execute()

    # Extract text (handle PDFs, Docs, Sheets differently)
    text = extract_text(content, file['mimeType'])

    # Send to Data 360
    index_document({
        'source': 'google_drive',
        'id': file['id'],
        'title': file['name'],
        'content': text,
        'modified': file['modifiedTime'],
        'permissions': file['permissions']
    })

GitHub (500+ repos):

# Index README, wiki, issues, PRs
import github

g = github.Github(access_token)

for repo in g.get_user().get_repos():
    # Index README
    try:
        readme = repo.get_readme()
        index_document({
            'source': 'github',
            'type': 'readme',
            'repo': repo.full_name,
            'content': readme.decoded_content.decode(),
            'url': readme.html_url
        })
    except: pass

    # Index issues
    for issue in repo.get_issues(state='all'):
        index_document({
            'source': 'github',
            'type': 'issue',
            'repo': repo.full_name,
            'title': issue.title,
            'content': issue.body,
            'url': issue.html_url,
            'created': issue.created_at
        })

Jira (8,000+ tickets):

from jira import JIRA

jira = JIRA(server='https://tianpan.atlassian.net', basic_auth=(email, token))

# Search all projects
issues = jira.search_issues('project in (PROJ1, PROJ2, PROJ3)', maxResults=10000)

for issue in issues:
    index_document({
        'source': 'jira',
        'id': issue.key,
        'title': issue.fields.summary,
        'description': issue.fields.description,
        'status': issue.fields.status.name,
        'assignee': issue.fields.assignee.displayName if issue.fields.assignee else None,
        'created': issue.fields.created
    })

Step 2: Text Extraction & Cleaning

Challenge: Different file types require different parsers.

File types we handle:

  • PDFs: PyPDF2, pdfminer
  • Word Docs: python-docx
  • Excel: openpyxl, pandas
  • Code: Language-specific parsers (AST for Python, etc.)
  • Images (OCR): Tesseract
  • HTML: BeautifulSoup

Example: PDF extraction:

import PyPDF2

def extract_pdf_text(file_path):
    text = ""
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text()

    # Clean extracted text
    text = text.replace('\n\n', ' ')  # Remove excessive newlines
    text = re.sub(r'\s+', ' ', text)    # Normalize whitespace

    return text

Step 3: Embedding Generation

Convert text → vector representation for semantic search.

Model choice (from Dreamforce recommendations):

  • OpenAI text-embedding-ada-002: $0.0001 per 1K tokens, 1536 dimensions
  • Cohere embed-english-v3: $0.0001 per 1K tokens, 1024 dimensions
  • Open source (Sentence Transformers): Free, 768 dimensions

For 20,000 documents:

import openai

def generate_embeddings(documents):
    embeddings = []

    for doc in documents:
        # Chunk large documents (max 8191 tokens for ada-002)
        chunks = chunk_text(doc['content'], max_tokens=8000)

        for chunk in chunks:
            response = openai.Embedding.create(
                input=chunk,
                model="text-embedding-ada-002"
            )

            embedding = response['data'][0]['embedding']

            embeddings.append({
                'doc_id': doc['id'],
                'chunk_index': chunks.index(chunk),
                'vector': embedding,
                'text': chunk
            })

    return embeddings

# Cost estimate for 20K docs (avg 2K tokens each):
# 20,000 docs × 2,000 tokens = 40M tokens
# 40M tokens × $0.0001 per 1K = $4 initial indexing cost

Step 4: Vector Storage

Store embeddings in vector database for fast similarity search.

Options:

  • Pinecone: Managed, $70/month for 10M vectors
  • Weaviate: Open source, self-hosted
  • Milvus: Open source, good for large scale
  • pgvector: PostgreSQL extension (if already using Postgres)

Pinecone example:

import pinecone

pinecone.init(api_key='...', environment='us-east-1-aws')

index = pinecone.Index('enterprise-search')

# Upsert vectors
index.upsert(vectors=[
    ('doc_1_chunk_0', embedding_1, {'source': 'google_drive', 'title': 'Q4 Plan'}),
    ('doc_2_chunk_0', embedding_2, {'source': 'jira', 'title': 'Feature request'}),
    # ... 20,000 more
])

# Query
query_embedding = openai.Embedding.create(input="Q4 product roadmap", model="ada-002")
results = index.query(query_embedding['data'][0]['embedding'], top_k=10)

for result in results['matches']:
    print(f"{result['metadata']['title']} - Score: {result['score']}")

Data Freshness & Incremental Updates

Problem: Data changes constantly. How do we keep search index up-to-date?

Strategy 1: Scheduled Full Re-Index

Frequency: Weekly
Approach: Re-index everything from scratch
Pros: Simple, ensures consistency
Cons: Expensive (API costs, compute time)

Not recommended for production.

Strategy 2: Incremental Updates

Frequency: Hourly or real-time
Approach: Only update changed documents

Implementation:

# Track last update timestamp
last_update = get_last_update_time()

# Query only modified files since last update
modified_files = drive_service.files().list(
    q=f"modifiedTime > '{last_update}'",
    fields="files(id, name, modifiedTime)"
).execute()

for file in modified_files['files']:
    # Re-index only this file
    update_index(file)

# Update timestamp
set_last_update_time(datetime.now())

Cost savings: ~95% reduction vs full re-index.

Strategy 3: Webhook-Based Real-Time Updates

Best for: Critical documents that change frequently

Example: Google Drive webhooks:

# Set up webhook
drive_service.files().watch(
    fileId='12345',
    body={
        'id': 'unique-channel-id',
        'type': 'web_hook',
        'address': 'https://our-endpoint.com/webhooks/google-drive'
    }
).execute()

# Webhook handler
@app.post('/webhooks/google-drive')
async def handle_drive_webhook(request):
    file_id = request.headers['X-Goog-Resource-ID']

    # File was modified, re-index it
    await re_index_file(file_id)

    return 200

Permission-Aware Search

Critical requirement: Users should only see search results they have access to.

Challenge: Permissions are defined differently across tools.

Google Drive: Explicit sharing (“[email protected] can view”)
GitHub: Org membership + repo permissions
Jira: Project roles
Salesforce: Complex sharing rules, field-level security

Unified approach:

def search_with_permissions(query, user_email):
    # Generate query embedding
    query_vector = embed(query)

    # Vector search (get top 100 candidates)
    candidates = vector_db.query(query_vector, top_k=100)

    # Filter by permissions
    accessible_results = []
    for doc in candidates:
        if user_can_access(doc, user_email):
            accessible_results.append(doc)

        if len(accessible_results) >= 10:
            break  # Return top 10

    return accessible_results

def user_can_access(doc, user_email):
    source = doc['source']

    if source == 'google_drive':
        # Check Google Drive permissions
        return user_email in doc['permissions']
    elif source == 'github':
        # Check GitHub org membership
        return user_in_github_org(user_email, doc['repo_org'])
    elif source == 'jira':
        # Check Jira project access
        return user_has_jira_project_access(user_email, doc['project'])
    # ... etc

Performance consideration: Permission checks add latency (100ms per check).

Optimization: Cache user permissions, refresh every 15 minutes.

Data Governance

Questions from security team:

1. Data residency: Where is indexed data stored?

  • Answer: Pinecone US East (AWS) for vectors, Salesforce multi-tenant for metadata

2. Data retention: How long do we keep old document versions?

  • Answer: 90 days (configurable)

3. PII detection: What if indexed documents contain SSNs, credit cards?

  • Answer: Salesforce Shield scans and redacts automatically

4. GDPR right to be forgotten: How do we remove user data?

  • Answer: Delete from source system, index auto-updates within 1 hour

My Data Quality Recommendations

Before enabling Enterprise Search:

Phase 1: Data Audit (2 weeks)

  • Catalog all data sources (done: 15 sources)
  • Measure data quality per source (accuracy, completeness)
  • Identify deprecated/stale documents (candidates for deletion)
  • Document data ownership (who maintains each source)

Phase 2: Data Cleanup (1 month)

  • Archive or delete outdated docs (2+ years old, no recent views)
  • Standardize naming conventions (create style guide)
  • Enrich metadata (add tags, categories, owners)
  • Fix broken links and references

Phase 3: Ongoing Maintenance

  • Quarterly data quality reviews
  • Automated stale doc detection (flag if not updated in 6 months)
  • User feedback loop (report irrelevant search results)

Cost Analysis for TianPan

One-time setup:

  • Initial embedding generation: $4 (20K docs × 2K tokens avg)
  • Engineering effort: 80 hours × $100/hour = $8,000

Monthly ongoing:

  • Pinecone vector DB: $70/month
  • Incremental embedding updates: $0.50/month (500 updated docs/month)
  • API costs (Google Drive, GitHub, Jira): $20/month
  • Total: $90.50/month

Add to Luis’s infrastructure estimate: $330 + $90.50 = $420.50/month total

Plus Slack Enterprise Search license: $5/user/month × 450 = $2,250/month

Grand total: $2,670.50/month for infrastructure + licenses

Questions for Team

  1. Data quality: Who will own ongoing data cleanup and governance?

  2. Sources priority: Which 3 tools should we integrate first? (I suggest: Google Drive, GitHub, Jira)

  3. Permissions: Are our existing access controls sufficient, or do we need audit?

  4. Monitoring: How do we track search relevance over time?

From a data perspective, Enterprise Search is achievable. But it requires upfront investment in data quality - which we’ve been deferring for years. This is the forcing function to finally clean up our data.

Rachel Martinez
Lead Data Engineer @ TianPan