Backend Utility Functions
Technical documentation for core utility functions in backend/my_utils.py that support AI processing, content formatting, and system integration.
Overview
The my_utils.py module provides essential utility functions used throughout the backend for AI content processing, file handling, and system integration. These utilities are core to the VAC Service Architecture and are used extensively by the Email Integration system.
Key Areas:
- AI content formatting and processing
- Thinking tag management for email compatibility
- File and MIME type handling
- Token counting for model limits
- Protocol buffer conversion utilities
- Timer and duration formatting
Core Functions
Content Processing
format_python_objects(msg: str) -> str
Transforms raw Python object representations into user-friendly, readable text.
Purpose: Converts technical Python output from tools and AI processing into human-readable content suitable for email responses and UI display.
Transformations:
Dictionary Formatting:
# Before
"{'search_results': 'found items', 'count': 5}"
# After
"search_results: found items | count: 5"
List Formatting:
# Before
"['item1', 'item2', 'item3']"
# After
"item1, item2, item3"
# Long lists (>3 items)
"5 items: item1, item2..."
Timer Messages:
# Before
"search_function [123s]"
# After
"⏱️ Search Function (2m 3s)"
Technical Term Replacement:
replacements = {
'func_name': 'function',
'datastore_id': 'search database',
'vertex_search': 'AI search',
'google_search_retrieval': 'web search',
'code_execution': 'code runner',
'extract_from_files': 'file analysis'
}
Integration: Used extensively in email responses to make AI output more accessible to non-technical users.
strip_thinking_tags(text: str) -> str
Removes internal AI reasoning content that shouldn’t be visible to users.
Purpose: Essential for email integration where thinking tags would confuse recipients and break email formatting.
Processing:
- Case-insensitive removal (
<thinking>,<THINKING>,<Thinking>) - Multiline content support with
re.DOTALL - Whitespace cleanup to prevent empty lines
- Preserves user-visible content structure
Example:
response = """<thinking>
Let me analyze this request carefully.
The user is asking about React components.
I should provide practical examples.
</thinking>
Here's how to create a React component:
1. Use functional components with hooks
2. Follow proper naming conventions
3. Implement TypeScript interfaces
<thinking>
I should also mention testing best practices.
</thinking>
4. Write comprehensive tests for your components"""
cleaned = strip_thinking_tags(response)
# Returns:
# """Here's how to create a React component:
#
# 1. Use functional components with hooks
# 2. Follow proper naming conventions
# 3. Implement TypeScript interfaces
#
# 4. Write comprehensive tests for your components"""
Critical for: Email responses, public API endpoints, and any user-facing content.
check_and_display_thinking(msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync)
Formats and streams thinking content for development and debugging.
Purpose: Provides visibility into AI reasoning during development while maintaining clean user experience in production.
Features:
- Escapes thinking tags to prevent UI conflicts (
<thinking>→<​thinking>) - Applies
format_python_objects()for readable output - Streams content via callback for real-time display
- Handles null callback gracefully with error logging
Example:
await check_and_display_thinking(
"Processing search query with vertex_search tool [45s]",
callback
)
# Streams: "⏱️ Processing Search Query With AI Search Tool (45s)"
File and Data Handling
sanitize_file(filename: str) -> str
Converts filenames to Gemini API-compatible format.
Gemini API Requirements:
- Only lowercase alphanumeric characters and dashes
- Cannot begin or end with dashes
- Maximum 40 character length
Process:
def sanitize_file(filename):
name, extension = os.path.splitext(filename)
name = name.lower()
# Replace non-alphanumeric with dashes
sanitized_name = re.sub(r'[^a-z0-9]', '-', name)
# Remove consecutive dashes
sanitized_name = re.sub(r'-+', '-', sanitized_name)
# Remove leading/trailing dashes
sanitized_name = sanitized_name.strip('-')
# Fallback for empty names
if not sanitized_name:
sanitized_name = 'file'
return sanitized_name[:40] # Length limit
Examples:
sanitize_file("My_File-Name!.pdf") # → "my-file-name"
sanitize_file("--special--chars--.txt") # → "special-chars"
sanitize_file("ä-ö-ü-émoji.doc") # → "file"
get_mime_type(file_name: str) -> str
Determines MIME type for files using intelligent detection.
Integration: Uses sunholo.utils.mime.guess_mime_type() for robust detection.
Logging: Includes detailed logging for debugging file type issues:
log.info(f"{file_name=} {mime_type=}")
Common Types:
{
"document.pdf": "application/pdf",
"script.py": "text/x-python",
"data.json": "application/json",
"image.png": "image/png",
"video.mp4": "video/mp4"
}
extract_bucket_name(document_url: str) -> Optional[str]
Extracts Firebase Storage bucket name from URLs.
URL Format: https://firebasestorage.googleapis.com/v0/b/BUCKET_NAME/...
Example:
url = "https://firebasestorage.googleapis.com/v0/b/aitana-prod/o/documents%2Ffile.pdf"
bucket = extract_bucket_name(url) # → "aitana-prod"
Error Handling: Returns None for invalid URLs with error logging.
AI Model Integration
count_tokens(contents) -> int
Counts tokens for model limit management.
Integration: Uses genai_client.models.count_tokens() for accurate counts.
Usage:
# Check before processing
token_count = await count_tokens(message_content)
if token_count > TOKEN_INPUT_LIMIT:
# Handle content truncation
await nice_errors("Content too long", callback)
return
Constants:
TOKEN_INPUT_LIMIT = 180000 * 5 # Standard limit
FAST_TOKEN_LIMIT = 180000 * 5 # Fast processing limit
nice_errors(error_msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync, type: str = "error")
Streams user-friendly error messages with proper formatting.
Features:
- HTML alert formatting for UI display
- Configurable error types (error, warning, info)
- Null callback protection with logging
Example:
await nice_errors(
"Rate limit exceeded. Please wait 30 seconds.",
callback,
type="warning"
)
# Streams: "<alert type='warning'>Rate limit exceeded. Please wait 30 seconds.</alert>"
Data Structure Conversion
convert_composite_to_native(value)
Recursively converts Google Protocol Buffer objects to native Python types.
Purpose: Handles proto.marshal.collections objects from Google AI APIs.
Supported Types:
MapComposite→dictRepeatedComposite→list- Primitives → unchanged
Example:
# Before (proto object)
result = MapComposite({
'data': RepeatedComposite([
MapComposite({'id': 1, 'name': 'Item 1'}),
MapComposite({'id': 2, 'name': 'Item 2'})
])
})
# After conversion
native_result = convert_composite_to_native(result)
# Returns:
# {
# 'data': [
# {'id': 1, 'name': 'Item 1'},
# {'id': 2, 'name': 'Item 2'}
# ]
# }
Time and Duration Utilities
timedelta_to_iso8601(td: timedelta) -> str
Converts Python timedelta objects to ISO 8601 duration format.
ISO 8601 Format: P[n]DT[n]H[n]M[n]S
Examples:
timedelta_to_iso8601(timedelta(days=2, hours=5, minutes=30, seconds=15))
# Returns: "P2DT5H30M15S"
timedelta_to_iso8601(timedelta(minutes=45))
# Returns: "P0DT0H45M0S"
timedelta_to_iso8601(timedelta(0))
# Returns: "P0DT0H0M0S"
Text Processing Utilities
format_human_chat_history(chat_history: List[Dict]) -> str
Formats chat history for AI context inclusion.
Format:
chat_history = [
{"name": "user", "content": "Hello"},
{"name": "assistant", "content": "Hi there"},
{"name": "user", "content": "How are you?"}
]
formatted = format_human_chat_history(chat_history)
# Returns:
# "user: Hello\nassistant: Hi there\nuser: How are you?"
strip_div_tags(text: str) -> str
Removes HTML div tags while preserving content.
Use Cases:
- Cleaning AI-generated content for text-only contexts
- Preparing content for plain text email fallbacks
- Sanitizing content for external APIs
Example:
html_content = '<div class="response">Important content</div>'
clean_text = strip_div_tags(html_content) # → "Important content"
Integration Patterns
Email Processing Integration
The utility functions are extensively used in the Email Integration system:
# Email content processing pipeline
class EmailProcessor:
def clean_assistant_response(self, response: str) -> str:
"""Clean response for email compatibility."""
return strip_thinking_tags(response)
def process_email_message(self, ...):
# Format message content
formatted_content = format_python_objects(email_content)
# Clean AI response for email
cleaned_response = strip_thinking_tags(ai_response)
# Create email-friendly version
return self._send_email_response(cleaned_response, ...)
VAC Service Integration
Core utilities support the VAC Service Architecture:
# Token management
content_tokens = await count_tokens(full_content)
if content_tokens > TOKEN_INPUT_LIMIT:
await nice_errors("Content exceeds token limit", callback)
return truncate_content(full_content)
# File processing
for document in documents:
sanitized_name = sanitize_file(document['name'])
mime_type = get_mime_type(document['name'])
# Process with sanitized filename
Streaming Response Processing
Real-time content formatting for user interfaces:
async def process_streaming_response(content_stream, callback):
for chunk in content_stream:
# Format raw AI output
formatted_chunk = format_python_objects(chunk)
# Handle thinking content separately
if '<thinking>' in formatted_chunk:
await check_and_display_thinking(formatted_chunk, callback)
else:
await callback.async_on_llm_new_token(formatted_chunk)
Performance Considerations
Regex Optimization
Thinking Tag Removal:
- Uses compiled regex patterns for efficiency
re.IGNORECASE | re.DOTALLflags for comprehensive matching- Single-pass processing for large content
Object Formatting:
- Cached regex patterns for repeated use
- Early returns for non-matching content
- Efficient string replacement strategies
Memory Management
Large Content Handling:
- Streaming processing for large files
- Incremental token counting
- Lazy evaluation for optional processing
Protocol Buffer Conversion:
- Recursive processing with depth limits
- Memory-efficient traversal
- Garbage collection friendly patterns
Error Handling
Robust Processing
File Operations:
def extract_bucket_name(document_url):
try:
parts = document_url.split('/v0/b/')
if len(parts) < 2:
return None
return parts[1].split('/')[0]
except Exception as e:
log.error(f"Error extracting bucket name: {e}")
return None
Content Processing:
def format_python_objects(msg: str) -> str:
try:
# Complex processing logic
return processed_msg
except Exception as e:
log.debug(f"Error formatting Python objects: {e}")
return msg # Return original on failure
Graceful Degradation
Callback Handling:
async def check_and_display_thinking(msg, callback):
if callback is None:
log.error(f"No callback for thinking message {msg}")
return "" # Fail silently
MIME Type Detection:
def get_mime_type(file_name):
if not file_name:
return None # Handle None gracefully
return guess_mime_type(file_name)
Testing
See Email Integration Testing Guide for comprehensive testing strategies including utility function tests.
Key Test Areas:
- Content formatting edge cases
- File sanitization boundary conditions
- Protocol buffer conversion accuracy
- Error handling and recovery
- Performance with large content
Related Documentation
- Backend Email API - Email system that uses these utilities extensively
- VAC Service Architecture - Core AI pipeline integration
- Email Integration Testing - Testing strategies and examples
- Tool Context - Tool system that relies on these utilities