The New Slack: Dreamforce 2025's Vision of an Agentic Operating System

data_rachel · October 21, 2025, 4:58am

David, Luis - great product and architecture insights. Let me add the data engineering perspective. I attended the “Enterprise Search & Data Integration” workshop. Here’s what it takes to make this work with real enterprise data.

Data Integration: The Hard Part

Challenge: Data is Messy

Our reality at TianPan:

15 data sources (Salesforce, Google Drive, GitHub, Jira, Notion, Confluence, internal DBs)
Different data formats (structured, unstructured, semi-structured)
Inconsistent naming conventions
Duplicate records
Stale data (some docs haven’t been updated in 2+ years)

Enterprise Search promise: “Just connect your tools and search everything!”

Reality: Garbage in, garbage out.

Data Quality Requirements for AI Agents

For agents to work well, data must be:

1. Accurate

No outdated information
Remove deprecated docs
Update “living documents” regularly

2. Complete

All relevant fields populated
No critical missing data
Proper relationships between objects

3. Consistent

Standardized terminology (“customer” vs “client” vs “account”)
Unified IDs (same entity across systems)
Consistent date formats, units, etc.

4. Accessible

Proper permissions (agents respect access controls)
Well-indexed (fast retrieval)
Metadata-rich (tags, categories, owners)

Dreamforce stat: “70% of enterprise search failures are due to poor data quality, not bad algorithms.”

Enterprise Search Data Pipeline

Step 1: Data Extraction

Connectors we need to set up:

Google Drive (20,000+ files):

# Using Google Drive API
from googleapiclient.discovery import build

drive_service = build('drive', 'v3', credentials=creds)

# List all files
results = drive_service.files().list(
    pageSize=1000,
    fields="files(id, name, mimeType, modifiedTime, owners, permissions)"
).execute()

files = results.get('files', [])

for file in files:
    # Download file content
    content = drive_service.files().get_media(fileId=file['id']).execute()

    # Extract text (handle PDFs, Docs, Sheets differently)
    text = extract_text(content, file['mimeType'])

    # Send to Data 360
    index_document({
        'source': 'google_drive',
        'id': file['id'],
        'title': file['name'],
        'content': text,
        'modified': file['modifiedTime'],
        'permissions': file['permissions']
    })

GitHub (500+ repos):

# Index README, wiki, issues, PRs
import github

g = github.Github(access_token)

for repo in g.get_user().get_repos():
    # Index README
    try:
        readme = repo.get_readme()
        index_document({
            'source': 'github',
            'type': 'readme',
            'repo': repo.full_name,
            'content': readme.decoded_content.decode(),
            'url': readme.html_url
        })
    except: pass

    # Index issues
    for issue in repo.get_issues(state='all'):
        index_document({
            'source': 'github',
            'type': 'issue',
            'repo': repo.full_name,
            'title': issue.title,
            'content': issue.body,
            'url': issue.html_url,
            'created': issue.created_at
        })

Jira (8,000+ tickets):

from jira import JIRA

jira = JIRA(server='https://tianpan.atlassian.net', basic_auth=(email, token))

# Search all projects
issues = jira.search_issues('project in (PROJ1, PROJ2, PROJ3)', maxResults=10000)

for issue in issues:
    index_document({
        'source': 'jira',
        'id': issue.key,
        'title': issue.fields.summary,
        'description': issue.fields.description,
        'status': issue.fields.status.name,
        'assignee': issue.fields.assignee.displayName if issue.fields.assignee else None,
        'created': issue.fields.created
    })

Step 2: Text Extraction & Cleaning

Challenge: Different file types require different parsers.

File types we handle:

PDFs: PyPDF2, pdfminer
Word Docs: python-docx
Excel: openpyxl, pandas
Code: Language-specific parsers (AST for Python, etc.)
Images (OCR): Tesseract
HTML: BeautifulSoup

Example: PDF extraction:

import PyPDF2

def extract_pdf_text(file_path):
    text = ""
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text()

    # Clean extracted text
    text = text.replace('\n\n', ' ')  # Remove excessive newlines
    text = re.sub(r'\s+', ' ', text)    # Normalize whitespace

    return text

Step 3: Embedding Generation

Convert text → vector representation for semantic search.

Model choice (from Dreamforce recommendations):

OpenAI text-embedding-ada-002: $0.0001 per 1K tokens, 1536 dimensions
Cohere embed-english-v3: $0.0001 per 1K tokens, 1024 dimensions
Open source (Sentence Transformers): Free, 768 dimensions

For 20,000 documents:

import openai

def generate_embeddings(documents):
    embeddings = []

    for doc in documents:
        # Chunk large documents (max 8191 tokens for ada-002)
        chunks = chunk_text(doc['content'], max_tokens=8000)

        for chunk in chunks:
            response = openai.Embedding.create(
                input=chunk,
                model="text-embedding-ada-002"
            )

            embedding = response['data'][0]['embedding']

            embeddings.append({
                'doc_id': doc['id'],
                'chunk_index': chunks.index(chunk),
                'vector': embedding,
                'text': chunk
            })

    return embeddings

# Cost estimate for 20K docs (avg 2K tokens each):
# 20,000 docs × 2,000 tokens = 40M tokens
# 40M tokens × $0.0001 per 1K = $4 initial indexing cost

Step 4: Vector Storage

Store embeddings in vector database for fast similarity search.

Options:

Pinecone: Managed, $70/month for 10M vectors
Weaviate: Open source, self-hosted
Milvus: Open source, good for large scale
pgvector: PostgreSQL extension (if already using Postgres)

Pinecone example:

import pinecone

pinecone.init(api_key='...', environment='us-east-1-aws')

index = pinecone.Index('enterprise-search')

# Upsert vectors
index.upsert(vectors=[
    ('doc_1_chunk_0', embedding_1, {'source': 'google_drive', 'title': 'Q4 Plan'}),
    ('doc_2_chunk_0', embedding_2, {'source': 'jira', 'title': 'Feature request'}),
    # ... 20,000 more
])

# Query
query_embedding = openai.Embedding.create(input="Q4 product roadmap", model="ada-002")
results = index.query(query_embedding['data'][0]['embedding'], top_k=10)

for result in results['matches']:
    print(f"{result['metadata']['title']} - Score: {result['score']}")

Data Freshness & Incremental Updates

Problem: Data changes constantly. How do we keep search index up-to-date?

Strategy 1: Scheduled Full Re-Index

Frequency: Weekly
Approach: Re-index everything from scratch
Pros: Simple, ensures consistency
Cons: Expensive (API costs, compute time)

Not recommended for production.

Strategy 2: Incremental Updates

Frequency: Hourly or real-time
Approach: Only update changed documents

Implementation:

# Track last update timestamp
last_update = get_last_update_time()

# Query only modified files since last update
modified_files = drive_service.files().list(
    q=f"modifiedTime > '{last_update}'",
    fields="files(id, name, modifiedTime)"
).execute()

for file in modified_files['files']:
    # Re-index only this file
    update_index(file)

# Update timestamp
set_last_update_time(datetime.now())

Cost savings: ~95% reduction vs full re-index.

Strategy 3: Webhook-Based Real-Time Updates

Best for: Critical documents that change frequently

Example: Google Drive webhooks:

# Set up webhook
drive_service.files().watch(
    fileId='12345',
    body={
        'id': 'unique-channel-id',
        'type': 'web_hook',
        'address': 'https://our-endpoint.com/webhooks/google-drive'
    }
).execute()

# Webhook handler
@app.post('/webhooks/google-drive')
async def handle_drive_webhook(request):
    file_id = request.headers['X-Goog-Resource-ID']

    # File was modified, re-index it
    await re_index_file(file_id)

    return 200

Permission-Aware Search

Critical requirement: Users should only see search results they have access to.

Challenge: Permissions are defined differently across tools.

Google Drive: Explicit sharing (“[email protected] can view”)
GitHub: Org membership + repo permissions
Jira: Project roles
Salesforce: Complex sharing rules, field-level security

Unified approach:

def search_with_permissions(query, user_email):
    # Generate query embedding
    query_vector = embed(query)

    # Vector search (get top 100 candidates)
    candidates = vector_db.query(query_vector, top_k=100)

    # Filter by permissions
    accessible_results = []
    for doc in candidates:
        if user_can_access(doc, user_email):
            accessible_results.append(doc)

        if len(accessible_results) >= 10:
            break  # Return top 10

    return accessible_results

def user_can_access(doc, user_email):
    source = doc['source']

    if source == 'google_drive':
        # Check Google Drive permissions
        return user_email in doc['permissions']
    elif source == 'github':
        # Check GitHub org membership
        return user_in_github_org(user_email, doc['repo_org'])
    elif source == 'jira':
        # Check Jira project access
        return user_has_jira_project_access(user_email, doc['project'])
    # ... etc

Performance consideration: Permission checks add latency (100ms per check).

Optimization: Cache user permissions, refresh every 15 minutes.

Data Governance

Questions from security team:

1. Data residency: Where is indexed data stored?

Answer: Pinecone US East (AWS) for vectors, Salesforce multi-tenant for metadata

2. Data retention: How long do we keep old document versions?

Answer: 90 days (configurable)

3. PII detection: What if indexed documents contain SSNs, credit cards?

Answer: Salesforce Shield scans and redacts automatically

4. GDPR right to be forgotten: How do we remove user data?

Answer: Delete from source system, index auto-updates within 1 hour

My Data Quality Recommendations

Before enabling Enterprise Search:

Phase 1: Data Audit (2 weeks)

Catalog all data sources (done: 15 sources)
Measure data quality per source (accuracy, completeness)
Identify deprecated/stale documents (candidates for deletion)
Document data ownership (who maintains each source)

Phase 2: Data Cleanup (1 month)

Archive or delete outdated docs (2+ years old, no recent views)
Standardize naming conventions (create style guide)
Enrich metadata (add tags, categories, owners)
Fix broken links and references

Phase 3: Ongoing Maintenance

Quarterly data quality reviews
Automated stale doc detection (flag if not updated in 6 months)
User feedback loop (report irrelevant search results)

Cost Analysis for TianPan

One-time setup:

Initial embedding generation: $4 (20K docs × 2K tokens avg)
Engineering effort: 80 hours × $100/hour = $8,000

Monthly ongoing:

Pinecone vector DB: $70/month
Incremental embedding updates: $0.50/month (500 updated docs/month)
API costs (Google Drive, GitHub, Jira): $20/month
Total: $90.50/month

Add to Luis’s infrastructure estimate: $330 + $90.50 = $420.50/month total

Plus Slack Enterprise Search license: $5/user/month × 450 = $2,250/month

Grand total: $2,670.50/month for infrastructure + licenses

Questions for Team

Data quality: Who will own ongoing data cleanup and governance?
Sources priority: Which 3 tools should we integrate first? (I suggest: Google Drive, GitHub, Jira)
Permissions: Are our existing access controls sufficient, or do we need audit?
Monitoring: How do we track search relevance over time?

From a data perspective, Enterprise Search is achievable. But it requires upfront investment in data quality - which we’ve been deferring for years. This is the forcing function to finally clean up our data.

Rachel Martinez
Lead Data Engineer @ TianPan