David, Luis - great product and architecture insights. Let me add the data engineering perspective. I attended the “Enterprise Search & Data Integration” workshop. Here’s what it takes to make this work with real enterprise data.
Data Integration: The Hard Part
Challenge: Data is Messy
Our reality at TianPan:
- 15 data sources (Salesforce, Google Drive, GitHub, Jira, Notion, Confluence, internal DBs)
- Different data formats (structured, unstructured, semi-structured)
- Inconsistent naming conventions
- Duplicate records
- Stale data (some docs haven’t been updated in 2+ years)
Enterprise Search promise: “Just connect your tools and search everything!”
Reality: Garbage in, garbage out.
Data Quality Requirements for AI Agents
For agents to work well, data must be:
1. Accurate
- No outdated information
- Remove deprecated docs
- Update “living documents” regularly
2. Complete
- All relevant fields populated
- No critical missing data
- Proper relationships between objects
3. Consistent
- Standardized terminology (“customer” vs “client” vs “account”)
- Unified IDs (same entity across systems)
- Consistent date formats, units, etc.
4. Accessible
- Proper permissions (agents respect access controls)
- Well-indexed (fast retrieval)
- Metadata-rich (tags, categories, owners)
Dreamforce stat: “70% of enterprise search failures are due to poor data quality, not bad algorithms.”
Enterprise Search Data Pipeline
Step 1: Data Extraction
Connectors we need to set up:
Google Drive (20,000+ files):
# Using Google Drive API
from googleapiclient.discovery import build
drive_service = build('drive', 'v3', credentials=creds)
# List all files
results = drive_service.files().list(
pageSize=1000,
fields="files(id, name, mimeType, modifiedTime, owners, permissions)"
).execute()
files = results.get('files', [])
for file in files:
# Download file content
content = drive_service.files().get_media(fileId=file['id']).execute()
# Extract text (handle PDFs, Docs, Sheets differently)
text = extract_text(content, file['mimeType'])
# Send to Data 360
index_document({
'source': 'google_drive',
'id': file['id'],
'title': file['name'],
'content': text,
'modified': file['modifiedTime'],
'permissions': file['permissions']
})
GitHub (500+ repos):
# Index README, wiki, issues, PRs
import github
g = github.Github(access_token)
for repo in g.get_user().get_repos():
# Index README
try:
readme = repo.get_readme()
index_document({
'source': 'github',
'type': 'readme',
'repo': repo.full_name,
'content': readme.decoded_content.decode(),
'url': readme.html_url
})
except: pass
# Index issues
for issue in repo.get_issues(state='all'):
index_document({
'source': 'github',
'type': 'issue',
'repo': repo.full_name,
'title': issue.title,
'content': issue.body,
'url': issue.html_url,
'created': issue.created_at
})
Jira (8,000+ tickets):
from jira import JIRA
jira = JIRA(server='https://tianpan.atlassian.net', basic_auth=(email, token))
# Search all projects
issues = jira.search_issues('project in (PROJ1, PROJ2, PROJ3)', maxResults=10000)
for issue in issues:
index_document({
'source': 'jira',
'id': issue.key,
'title': issue.fields.summary,
'description': issue.fields.description,
'status': issue.fields.status.name,
'assignee': issue.fields.assignee.displayName if issue.fields.assignee else None,
'created': issue.fields.created
})
Step 2: Text Extraction & Cleaning
Challenge: Different file types require different parsers.
File types we handle:
- PDFs: PyPDF2, pdfminer
- Word Docs: python-docx
- Excel: openpyxl, pandas
- Code: Language-specific parsers (AST for Python, etc.)
- Images (OCR): Tesseract
- HTML: BeautifulSoup
Example: PDF extraction:
import PyPDF2
def extract_pdf_text(file_path):
text = ""
with open(file_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
for page in pdf_reader.pages:
text += page.extract_text()
# Clean extracted text
text = text.replace('\n\n', ' ') # Remove excessive newlines
text = re.sub(r'\s+', ' ', text) # Normalize whitespace
return text
Step 3: Embedding Generation
Convert text → vector representation for semantic search.
Model choice (from Dreamforce recommendations):
- OpenAI text-embedding-ada-002: $0.0001 per 1K tokens, 1536 dimensions
- Cohere embed-english-v3: $0.0001 per 1K tokens, 1024 dimensions
- Open source (Sentence Transformers): Free, 768 dimensions
For 20,000 documents:
import openai
def generate_embeddings(documents):
embeddings = []
for doc in documents:
# Chunk large documents (max 8191 tokens for ada-002)
chunks = chunk_text(doc['content'], max_tokens=8000)
for chunk in chunks:
response = openai.Embedding.create(
input=chunk,
model="text-embedding-ada-002"
)
embedding = response['data'][0]['embedding']
embeddings.append({
'doc_id': doc['id'],
'chunk_index': chunks.index(chunk),
'vector': embedding,
'text': chunk
})
return embeddings
# Cost estimate for 20K docs (avg 2K tokens each):
# 20,000 docs × 2,000 tokens = 40M tokens
# 40M tokens × $0.0001 per 1K = $4 initial indexing cost
Step 4: Vector Storage
Store embeddings in vector database for fast similarity search.
Options:
- Pinecone: Managed, $70/month for 10M vectors
- Weaviate: Open source, self-hosted
- Milvus: Open source, good for large scale
- pgvector: PostgreSQL extension (if already using Postgres)
Pinecone example:
import pinecone
pinecone.init(api_key='...', environment='us-east-1-aws')
index = pinecone.Index('enterprise-search')
# Upsert vectors
index.upsert(vectors=[
('doc_1_chunk_0', embedding_1, {'source': 'google_drive', 'title': 'Q4 Plan'}),
('doc_2_chunk_0', embedding_2, {'source': 'jira', 'title': 'Feature request'}),
# ... 20,000 more
])
# Query
query_embedding = openai.Embedding.create(input="Q4 product roadmap", model="ada-002")
results = index.query(query_embedding['data'][0]['embedding'], top_k=10)
for result in results['matches']:
print(f"{result['metadata']['title']} - Score: {result['score']}")
Data Freshness & Incremental Updates
Problem: Data changes constantly. How do we keep search index up-to-date?
Strategy 1: Scheduled Full Re-Index
Frequency: Weekly
Approach: Re-index everything from scratch
Pros: Simple, ensures consistency
Cons: Expensive (API costs, compute time)
Not recommended for production.
Strategy 2: Incremental Updates
Frequency: Hourly or real-time
Approach: Only update changed documents
Implementation:
# Track last update timestamp
last_update = get_last_update_time()
# Query only modified files since last update
modified_files = drive_service.files().list(
q=f"modifiedTime > '{last_update}'",
fields="files(id, name, modifiedTime)"
).execute()
for file in modified_files['files']:
# Re-index only this file
update_index(file)
# Update timestamp
set_last_update_time(datetime.now())
Cost savings: ~95% reduction vs full re-index.
Strategy 3: Webhook-Based Real-Time Updates
Best for: Critical documents that change frequently
Example: Google Drive webhooks:
# Set up webhook
drive_service.files().watch(
fileId='12345',
body={
'id': 'unique-channel-id',
'type': 'web_hook',
'address': 'https://our-endpoint.com/webhooks/google-drive'
}
).execute()
# Webhook handler
@app.post('/webhooks/google-drive')
async def handle_drive_webhook(request):
file_id = request.headers['X-Goog-Resource-ID']
# File was modified, re-index it
await re_index_file(file_id)
return 200
Permission-Aware Search
Critical requirement: Users should only see search results they have access to.
Challenge: Permissions are defined differently across tools.
Google Drive: Explicit sharing (“[email protected] can view”)
GitHub: Org membership + repo permissions
Jira: Project roles
Salesforce: Complex sharing rules, field-level security
Unified approach:
def search_with_permissions(query, user_email):
# Generate query embedding
query_vector = embed(query)
# Vector search (get top 100 candidates)
candidates = vector_db.query(query_vector, top_k=100)
# Filter by permissions
accessible_results = []
for doc in candidates:
if user_can_access(doc, user_email):
accessible_results.append(doc)
if len(accessible_results) >= 10:
break # Return top 10
return accessible_results
def user_can_access(doc, user_email):
source = doc['source']
if source == 'google_drive':
# Check Google Drive permissions
return user_email in doc['permissions']
elif source == 'github':
# Check GitHub org membership
return user_in_github_org(user_email, doc['repo_org'])
elif source == 'jira':
# Check Jira project access
return user_has_jira_project_access(user_email, doc['project'])
# ... etc
Performance consideration: Permission checks add latency (100ms per check).
Optimization: Cache user permissions, refresh every 15 minutes.
Data Governance
Questions from security team:
1. Data residency: Where is indexed data stored?
- Answer: Pinecone US East (AWS) for vectors, Salesforce multi-tenant for metadata
2. Data retention: How long do we keep old document versions?
- Answer: 90 days (configurable)
3. PII detection: What if indexed documents contain SSNs, credit cards?
- Answer: Salesforce Shield scans and redacts automatically
4. GDPR right to be forgotten: How do we remove user data?
- Answer: Delete from source system, index auto-updates within 1 hour
My Data Quality Recommendations
Before enabling Enterprise Search:
Phase 1: Data Audit (2 weeks)
- Catalog all data sources (done: 15 sources)
- Measure data quality per source (accuracy, completeness)
- Identify deprecated/stale documents (candidates for deletion)
- Document data ownership (who maintains each source)
Phase 2: Data Cleanup (1 month)
- Archive or delete outdated docs (2+ years old, no recent views)
- Standardize naming conventions (create style guide)
- Enrich metadata (add tags, categories, owners)
- Fix broken links and references
Phase 3: Ongoing Maintenance
- Quarterly data quality reviews
- Automated stale doc detection (flag if not updated in 6 months)
- User feedback loop (report irrelevant search results)
Cost Analysis for TianPan
One-time setup:
- Initial embedding generation: $4 (20K docs × 2K tokens avg)
- Engineering effort: 80 hours × $100/hour = $8,000
Monthly ongoing:
- Pinecone vector DB: $70/month
- Incremental embedding updates: $0.50/month (500 updated docs/month)
- API costs (Google Drive, GitHub, Jira): $20/month
- Total: $90.50/month
Add to Luis’s infrastructure estimate: $330 + $90.50 = $420.50/month total
Plus Slack Enterprise Search license: $5/user/month × 450 = $2,250/month
Grand total: $2,670.50/month for infrastructure + licenses
Questions for Team
-
Data quality: Who will own ongoing data cleanup and governance?
-
Sources priority: Which 3 tools should we integrate first? (I suggest: Google Drive, GitHub, Jira)
-
Permissions: Are our existing access controls sufficient, or do we need audit?
-
Monitoring: How do we track search relevance over time?
From a data perspective, Enterprise Search is achievable. But it requires upfront investment in data quality - which we’ve been deferring for years. This is the forcing function to finally clean up our data.
Rachel Martinez
Lead Data Engineer @ TianPan