Backend Utility Functions

Technical documentation for core utility functions in backend/my_utils.py that support AI processing, content formatting, and system integration.

Overview

The my_utils.py module provides essential utility functions used throughout the backend for AI content processing, file handling, and system integration. These utilities are core to the VAC Service Architecture and are used extensively by the Email Integration system.

Key Areas:

  • AI content formatting and processing
  • Thinking tag management for email compatibility
  • File and MIME type handling
  • Token counting for model limits
  • Protocol buffer conversion utilities
  • Timer and duration formatting

Core Functions

Content Processing

format_python_objects(msg: str) -> str

Transforms raw Python object representations into user-friendly, readable text.

Purpose: Converts technical Python output from tools and AI processing into human-readable content suitable for email responses and UI display.

Transformations:

Dictionary Formatting:

# Before
"{'search_results': 'found items', 'count': 5}"

# After  
"search_results: found items | count: 5"

List Formatting:

# Before
"['item1', 'item2', 'item3']"

# After
"item1, item2, item3"

# Long lists (>3 items)
"5 items: item1, item2..."

Timer Messages:

# Before
"search_function [123s]"

# After
"⏱️ Search Function (2m 3s)"

Technical Term Replacement:

replacements = {
    'func_name': 'function',
    'datastore_id': 'search database', 
    'vertex_search': 'AI search',
    'google_search_retrieval': 'web search',
    'code_execution': 'code runner',
    'extract_from_files': 'file analysis'
}

Integration: Used extensively in email responses to make AI output more accessible to non-technical users.

strip_thinking_tags(text: str) -> str

Removes internal AI reasoning content that shouldn’t be visible to users.

Purpose: Essential for email integration where thinking tags would confuse recipients and break email formatting.

Processing:

  • Case-insensitive removal (<thinking>, <THINKING>, <Thinking>)
  • Multiline content support with re.DOTALL
  • Whitespace cleanup to prevent empty lines
  • Preserves user-visible content structure

Example:

response = """<thinking>
Let me analyze this request carefully.
The user is asking about React components.
I should provide practical examples.
</thinking>

Here's how to create a React component:

1. Use functional components with hooks
2. Follow proper naming conventions
3. Implement TypeScript interfaces

<thinking>
I should also mention testing best practices.
</thinking>

4. Write comprehensive tests for your components"""

cleaned = strip_thinking_tags(response)
# Returns:
# """Here's how to create a React component:
# 
# 1. Use functional components with hooks
# 2. Follow proper naming conventions  
# 3. Implement TypeScript interfaces
# 
# 4. Write comprehensive tests for your components"""

Critical for: Email responses, public API endpoints, and any user-facing content.

check_and_display_thinking(msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync)

Formats and streams thinking content for development and debugging.

Purpose: Provides visibility into AI reasoning during development while maintaining clean user experience in production.

Features:

  • Escapes thinking tags to prevent UI conflicts (<thinking><&#8203thinking>)
  • Applies format_python_objects() for readable output
  • Streams content via callback for real-time display
  • Handles null callback gracefully with error logging

Example:

await check_and_display_thinking(
    "Processing search query with vertex_search tool [45s]",
    callback
)
# Streams: "⏱️ Processing Search Query With AI Search Tool (45s)"

File and Data Handling

sanitize_file(filename: str) -> str

Converts filenames to Gemini API-compatible format.

Gemini API Requirements:

  • Only lowercase alphanumeric characters and dashes
  • Cannot begin or end with dashes
  • Maximum 40 character length

Process:

def sanitize_file(filename):
    name, extension = os.path.splitext(filename)
    name = name.lower()
    
    # Replace non-alphanumeric with dashes
    sanitized_name = re.sub(r'[^a-z0-9]', '-', name)
    
    # Remove consecutive dashes  
    sanitized_name = re.sub(r'-+', '-', sanitized_name)
    
    # Remove leading/trailing dashes
    sanitized_name = sanitized_name.strip('-')
    
    # Fallback for empty names
    if not sanitized_name:
        sanitized_name = 'file'
        
    return sanitized_name[:40]  # Length limit

Examples:

sanitize_file("My_File-Name!.pdf")     # → "my-file-name"
sanitize_file("--special--chars--.txt") # → "special-chars"  
sanitize_file("ä-ö-ü-émoji.doc")       # → "file"

get_mime_type(file_name: str) -> str

Determines MIME type for files using intelligent detection.

Integration: Uses sunholo.utils.mime.guess_mime_type() for robust detection.

Logging: Includes detailed logging for debugging file type issues:

log.info(f"{file_name=} {mime_type=}")

Common Types:

{
    "document.pdf": "application/pdf",
    "script.py": "text/x-python", 
    "data.json": "application/json",
    "image.png": "image/png",
    "video.mp4": "video/mp4"
}

extract_bucket_name(document_url: str) -> Optional[str]

Extracts Firebase Storage bucket name from URLs.

URL Format: https://firebasestorage.googleapis.com/v0/b/BUCKET_NAME/...

Example:

url = "https://firebasestorage.googleapis.com/v0/b/aitana-prod/o/documents%2Ffile.pdf"
bucket = extract_bucket_name(url)  # → "aitana-prod"

Error Handling: Returns None for invalid URLs with error logging.

AI Model Integration

count_tokens(contents) -> int

Counts tokens for model limit management.

Integration: Uses genai_client.models.count_tokens() for accurate counts.

Usage:

# Check before processing
token_count = await count_tokens(message_content)

if token_count > TOKEN_INPUT_LIMIT:
    # Handle content truncation
    await nice_errors("Content too long", callback)
    return

Constants:

TOKEN_INPUT_LIMIT = 180000 * 5    # Standard limit
FAST_TOKEN_LIMIT = 180000 * 5     # Fast processing limit  

nice_errors(error_msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync, type: str = "error")

Streams user-friendly error messages with proper formatting.

Features:

  • HTML alert formatting for UI display
  • Configurable error types (error, warning, info)
  • Null callback protection with logging

Example:

await nice_errors(
    "Rate limit exceeded. Please wait 30 seconds.", 
    callback, 
    type="warning"
)
# Streams: "<alert type='warning'>Rate limit exceeded. Please wait 30 seconds.</alert>"

Data Structure Conversion

convert_composite_to_native(value)

Recursively converts Google Protocol Buffer objects to native Python types.

Purpose: Handles proto.marshal.collections objects from Google AI APIs.

Supported Types:

  • MapCompositedict
  • RepeatedCompositelist
  • Primitives → unchanged

Example:

# Before (proto object)
result = MapComposite({
    'data': RepeatedComposite([
        MapComposite({'id': 1, 'name': 'Item 1'}),
        MapComposite({'id': 2, 'name': 'Item 2'})
    ])
})

# After conversion
native_result = convert_composite_to_native(result)
# Returns:
# {
#     'data': [
#         {'id': 1, 'name': 'Item 1'},
#         {'id': 2, 'name': 'Item 2'}
#     ]
# }

Time and Duration Utilities

timedelta_to_iso8601(td: timedelta) -> str

Converts Python timedelta objects to ISO 8601 duration format.

ISO 8601 Format: P[n]DT[n]H[n]M[n]S

Examples:

timedelta_to_iso8601(timedelta(days=2, hours=5, minutes=30, seconds=15))
# Returns: "P2DT5H30M15S"

timedelta_to_iso8601(timedelta(minutes=45))  
# Returns: "P0DT0H45M0S"

timedelta_to_iso8601(timedelta(0))
# Returns: "P0DT0H0M0S"

Text Processing Utilities

format_human_chat_history(chat_history: List[Dict]) -> str

Formats chat history for AI context inclusion.

Format:

chat_history = [
    {"name": "user", "content": "Hello"},
    {"name": "assistant", "content": "Hi there"},
    {"name": "user", "content": "How are you?"}
]

formatted = format_human_chat_history(chat_history)
# Returns:
# "user: Hello\nassistant: Hi there\nuser: How are you?"

strip_div_tags(text: str) -> str

Removes HTML div tags while preserving content.

Use Cases:

  • Cleaning AI-generated content for text-only contexts
  • Preparing content for plain text email fallbacks
  • Sanitizing content for external APIs

Example:

html_content = '<div class="response">Important content</div>'
clean_text = strip_div_tags(html_content)  # → "Important content"

Integration Patterns

Email Processing Integration

The utility functions are extensively used in the Email Integration system:

# Email content processing pipeline
class EmailProcessor:
    def clean_assistant_response(self, response: str) -> str:
        """Clean response for email compatibility."""
        return strip_thinking_tags(response)
    
    def process_email_message(self, ...):
        # Format message content
        formatted_content = format_python_objects(email_content)
        
        # Clean AI response for email
        cleaned_response = strip_thinking_tags(ai_response)
        
        # Create email-friendly version
        return self._send_email_response(cleaned_response, ...)

VAC Service Integration

Core utilities support the VAC Service Architecture:

# Token management
content_tokens = await count_tokens(full_content)
if content_tokens > TOKEN_INPUT_LIMIT:
    await nice_errors("Content exceeds token limit", callback)
    return truncate_content(full_content)

# File processing
for document in documents:
    sanitized_name = sanitize_file(document['name'])
    mime_type = get_mime_type(document['name'])
    # Process with sanitized filename

Streaming Response Processing

Real-time content formatting for user interfaces:

async def process_streaming_response(content_stream, callback):
    for chunk in content_stream:
        # Format raw AI output
        formatted_chunk = format_python_objects(chunk)
        
        # Handle thinking content separately
        if '<thinking>' in formatted_chunk:
            await check_and_display_thinking(formatted_chunk, callback)
        else:
            await callback.async_on_llm_new_token(formatted_chunk)

Performance Considerations

Regex Optimization

Thinking Tag Removal:

  • Uses compiled regex patterns for efficiency
  • re.IGNORECASE | re.DOTALL flags for comprehensive matching
  • Single-pass processing for large content

Object Formatting:

  • Cached regex patterns for repeated use
  • Early returns for non-matching content
  • Efficient string replacement strategies

Memory Management

Large Content Handling:

  • Streaming processing for large files
  • Incremental token counting
  • Lazy evaluation for optional processing

Protocol Buffer Conversion:

  • Recursive processing with depth limits
  • Memory-efficient traversal
  • Garbage collection friendly patterns

Error Handling

Robust Processing

File Operations:

def extract_bucket_name(document_url):
    try:
        parts = document_url.split('/v0/b/')
        if len(parts) < 2:
            return None
        return parts[1].split('/')[0]
    except Exception as e:
        log.error(f"Error extracting bucket name: {e}")
        return None

Content Processing:

def format_python_objects(msg: str) -> str:
    try:
        # Complex processing logic
        return processed_msg
    except Exception as e:
        log.debug(f"Error formatting Python objects: {e}")
        return msg  # Return original on failure

Graceful Degradation

Callback Handling:

async def check_and_display_thinking(msg, callback):
    if callback is None:
        log.error(f"No callback for thinking message {msg}")
        return ""  # Fail silently

MIME Type Detection:

def get_mime_type(file_name):
    if not file_name:
        return None  # Handle None gracefully
    return guess_mime_type(file_name)

Testing

See Email Integration Testing Guide for comprehensive testing strategies including utility function tests.

Key Test Areas:

  • Content formatting edge cases
  • File sanitization boundary conditions
  • Protocol buffer conversion accuracy
  • Error handling and recovery
  • Performance with large content