Backend Utility Functions

Technical documentation for core utility functions in backend/my_utils.py that support AI processing, content formatting, and system integration.

Overview

The my_utils.py module provides essential utility functions used throughout the backend for AI content processing, file handling, and system integration. These utilities are core to the VAC Service Architecture and are used extensively by the Email Integration system.

Key Areas:

AI content formatting and processing
Thinking tag management for email compatibility
File and MIME type handling
Token counting for model limits
Protocol buffer conversion utilities
Timer and duration formatting

Core Functions

Content Processing

`format_python_objects(msg: str) -> str`

Transforms raw Python object representations into user-friendly, readable text.

Purpose: Converts technical Python output from tools and AI processing into human-readable content suitable for email responses and UI display.

Transformations:

Dictionary Formatting:

# Before
"{'search_results': 'found items', 'count': 5}"

# After  
"search_results: found items | count: 5"

List Formatting:

# Before
"['item1', 'item2', 'item3']"

# After
"item1, item2, item3"

# Long lists (>3 items)
"5 items: item1, item2..."

Timer Messages:

# Before
"search_function [123s]"

# After
"⏱️ Search Function (2m 3s)"

Technical Term Replacement:

replacements = {
    'func_name': 'function',
    'datastore_id': 'search database', 
    'vertex_search': 'AI search',
    'google_search_retrieval': 'web search',
    'code_execution': 'code runner',
    'extract_from_files': 'file analysis'
}

Integration: Used extensively in email responses to make AI output more accessible to non-technical users.

`strip_thinking_tags(text: str) -> str`

Removes internal AI reasoning content that shouldn’t be visible to users.

Purpose: Essential for email integration where thinking tags would confuse recipients and break email formatting.

Processing:

Case-insensitive removal (<thinking>, <THINKING>, <Thinking>)
Multiline content support with re.DOTALL
Whitespace cleanup to prevent empty lines
Preserves user-visible content structure

Example:

response = """<thinking>
Let me analyze this request carefully.
The user is asking about React components.
I should provide practical examples.
</thinking>

Here's how to create a React component:

1. Use functional components with hooks
2. Follow proper naming conventions
3. Implement TypeScript interfaces

<thinking>
I should also mention testing best practices.
</thinking>

4. Write comprehensive tests for your components"""

cleaned = strip_thinking_tags(response)
# Returns:
# """Here's how to create a React component:
# 
# 1. Use functional components with hooks
# 2. Follow proper naming conventions  
# 3. Implement TypeScript interfaces
# 
# 4. Write comprehensive tests for your components"""

Critical for: Email responses, public API endpoints, and any user-facing content.

`check_and_display_thinking(msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync)`

Formats and streams thinking content for development and debugging.

Purpose: Provides visibility into AI reasoning during development while maintaining clean user experience in production.

Features:

Escapes thinking tags to prevent UI conflicts (<thinking> → <&#8203thinking>)
Applies format_python_objects() for readable output
Streams content via callback for real-time display
Handles null callback gracefully with error logging

Example:

await check_and_display_thinking(
    "Processing search query with vertex_search tool [45s]",
    callback
)
# Streams: "⏱️ Processing Search Query With AI Search Tool (45s)"

File and Data Handling

`sanitize_file(filename: str) -> str`

Converts filenames to Gemini API-compatible format.

Gemini API Requirements:

Only lowercase alphanumeric characters and dashes
Cannot begin or end with dashes
Maximum 40 character length

Process:

def sanitize_file(filename):
    name, extension = os.path.splitext(filename)
    name = name.lower()
    
    # Replace non-alphanumeric with dashes
    sanitized_name = re.sub(r'[^a-z0-9]', '-', name)
    
    # Remove consecutive dashes  
    sanitized_name = re.sub(r'-+', '-', sanitized_name)
    
    # Remove leading/trailing dashes
    sanitized_name = sanitized_name.strip('-')
    
    # Fallback for empty names
    if not sanitized_name:
        sanitized_name = 'file'
        
    return sanitized_name[:40]  # Length limit

Examples:

sanitize_file("My_File-Name!.pdf")     # → "my-file-name"
sanitize_file("--special--chars--.txt") # → "special-chars"  
sanitize_file("ä-ö-ü-émoji.doc")       # → "file"

`get_mime_type(file_name: str) -> str`

Determines MIME type for files using intelligent detection.

Integration: Uses sunholo.utils.mime.guess_mime_type() for robust detection.

Logging: Includes detailed logging for debugging file type issues:

log.info(f"{file_name=} {mime_type=}")

Common Types:

{
    "document.pdf": "application/pdf",
    "script.py": "text/x-python", 
    "data.json": "application/json",
    "image.png": "image/png",
    "video.mp4": "video/mp4"
}

`extract_bucket_name(document_url: str) -> Optional[str]`

Extracts Firebase Storage bucket name from URLs.

URL Format: https://firebasestorage.googleapis.com/v0/b/BUCKET_NAME/...

Example:

url = "https://firebasestorage.googleapis.com/v0/b/aitana-prod/o/documents%2Ffile.pdf"
bucket = extract_bucket_name(url)  # → "aitana-prod"

Error Handling: Returns None for invalid URLs with error logging.

AI Model Integration

`count_tokens(contents) -> int`

Counts tokens for model limit management.

Integration: Uses genai_client.models.count_tokens() for accurate counts.

Usage:

# Check before processing
token_count = await count_tokens(message_content)

if token_count > TOKEN_INPUT_LIMIT:
    # Handle content truncation
    await nice_errors("Content too long", callback)
    return

Constants:

TOKEN_INPUT_LIMIT = 180000 * 5    # Standard limit
FAST_TOKEN_LIMIT = 180000 * 5     # Fast processing limit  

`nice_errors(error_msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync, type: str = "error")`

Streams user-friendly error messages with proper formatting.

Features:

HTML alert formatting for UI display
Configurable error types (error, warning, info)
Null callback protection with logging

Example:

await nice_errors(
    "Rate limit exceeded. Please wait 30 seconds.", 
    callback, 
    type="warning"
)
# Streams: "<alert type='warning'>Rate limit exceeded. Please wait 30 seconds.</alert>"

Data Structure Conversion

`convert_composite_to_native(value)`

Recursively converts Google Protocol Buffer objects to native Python types.

Purpose: Handles proto.marshal.collections objects from Google AI APIs.

Supported Types:

MapComposite → dict
RepeatedComposite → list
Primitives → unchanged

Example:

# Before (proto object)
result = MapComposite({
    'data': RepeatedComposite([
        MapComposite({'id': 1, 'name': 'Item 1'}),
        MapComposite({'id': 2, 'name': 'Item 2'})
    ])
})

# After conversion
native_result = convert_composite_to_native(result)
# Returns:
# {
#     'data': [
#         {'id': 1, 'name': 'Item 1'},
#         {'id': 2, 'name': 'Item 2'}
#     ]
# }

Time and Duration Utilities

`timedelta_to_iso8601(td: timedelta) -> str`

Converts Python timedelta objects to ISO 8601 duration format.

ISO 8601 Format: P[n]DT[n]H[n]M[n]S

Examples:

timedelta_to_iso8601(timedelta(days=2, hours=5, minutes=30, seconds=15))
# Returns: "P2DT5H30M15S"

timedelta_to_iso8601(timedelta(minutes=45))  
# Returns: "P0DT0H45M0S"

timedelta_to_iso8601(timedelta(0))
# Returns: "P0DT0H0M0S"

Text Processing Utilities

`format_human_chat_history(chat_history: List[Dict]) -> str`

Formats chat history for AI context inclusion.

Format:

chat_history = [
    {"name": "user", "content": "Hello"},
    {"name": "assistant", "content": "Hi there"},
    {"name": "user", "content": "How are you?"}
]

formatted = format_human_chat_history(chat_history)
# Returns:
# "user: Hello\nassistant: Hi there\nuser: How are you?"

`strip_div_tags(text: str) -> str`

Removes HTML div tags while preserving content.

Use Cases:

Cleaning AI-generated content for text-only contexts
Preparing content for plain text email fallbacks
Sanitizing content for external APIs

Example:

html_content = '<div class="response">Important content</div>'
clean_text = strip_div_tags(html_content)  # → "Important content"

Integration Patterns

Email Processing Integration

The utility functions are extensively used in the Email Integration system:

# Email content processing pipeline
class EmailProcessor:
    def clean_assistant_response(self, response: str) -> str:
        """Clean response for email compatibility."""
        return strip_thinking_tags(response)
    
    def process_email_message(self, ...):
        # Format message content
        formatted_content = format_python_objects(email_content)
        
        # Clean AI response for email
        cleaned_response = strip_thinking_tags(ai_response)
        
        # Create email-friendly version
        return self._send_email_response(cleaned_response, ...)

VAC Service Integration

Core utilities support the VAC Service Architecture:

# Token management
content_tokens = await count_tokens(full_content)
if content_tokens > TOKEN_INPUT_LIMIT:
    await nice_errors("Content exceeds token limit", callback)
    return truncate_content(full_content)

# File processing
for document in documents:
    sanitized_name = sanitize_file(document['name'])
    mime_type = get_mime_type(document['name'])
    # Process with sanitized filename

Streaming Response Processing

Real-time content formatting for user interfaces:

async def process_streaming_response(content_stream, callback):
    for chunk in content_stream:
        # Format raw AI output
        formatted_chunk = format_python_objects(chunk)
        
        # Handle thinking content separately
        if '<thinking>' in formatted_chunk:
            await check_and_display_thinking(formatted_chunk, callback)
        else:
            await callback.async_on_llm_new_token(formatted_chunk)

Performance Considerations

Regex Optimization

Thinking Tag Removal:

Uses compiled regex patterns for efficiency
re.IGNORECASE | re.DOTALL flags for comprehensive matching
Single-pass processing for large content

Object Formatting:

Cached regex patterns for repeated use
Early returns for non-matching content
Efficient string replacement strategies

Memory Management

Large Content Handling:

Streaming processing for large files
Incremental token counting
Lazy evaluation for optional processing

Protocol Buffer Conversion:

Recursive processing with depth limits
Memory-efficient traversal
Garbage collection friendly patterns

Error Handling

Robust Processing

File Operations:

def extract_bucket_name(document_url):
    try:
        parts = document_url.split('/v0/b/')
        if len(parts) < 2:
            return None
        return parts[1].split('/')[0]
    except Exception as e:
        log.error(f"Error extracting bucket name: {e}")
        return None

Content Processing:

def format_python_objects(msg: str) -> str:
    try:
        # Complex processing logic
        return processed_msg
    except Exception as e:
        log.debug(f"Error formatting Python objects: {e}")
        return msg  # Return original on failure

Graceful Degradation

Callback Handling:

async def check_and_display_thinking(msg, callback):
    if callback is None:
        log.error(f"No callback for thinking message {msg}")
        return ""  # Fail silently

MIME Type Detection:

def get_mime_type(file_name):
    if not file_name:
        return None  # Handle None gracefully
    return guess_mime_type(file_name)

Testing

See Email Integration Testing Guide for comprehensive testing strategies including utility function tests.

Key Test Areas:

Content formatting edge cases
File sanitization boundary conditions
Protocol buffer conversion accuracy
Error handling and recovery
Performance with large content

Backend Email API - Email system that uses these utilities extensively
VAC Service Architecture - Core AI pipeline integration
Email Integration Testing - Testing strategies and examples
Tool Context - Tool system that relies on these utilities

Backend Utility Functions

Overview

Core Functions

Content Processing

format_python_objects(msg: str) -> str

strip_thinking_tags(text: str) -> str

check_and_display_thinking(msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync)

File and Data Handling

sanitize_file(filename: str) -> str

get_mime_type(file_name: str) -> str

extract_bucket_name(document_url: str) -> Optional[str]

AI Model Integration

count_tokens(contents) -> int

nice_errors(error_msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync, type: str = "error")

Data Structure Conversion

convert_composite_to_native(value)

Time and Duration Utilities

timedelta_to_iso8601(td: timedelta) -> str

Text Processing Utilities

format_human_chat_history(chat_history: List[Dict]) -> str

strip_div_tags(text: str) -> str

Integration Patterns

Email Processing Integration

VAC Service Integration

Streaming Response Processing

Performance Considerations

Regex Optimization

Memory Management

Error Handling

Robust Processing

Graceful Degradation

Testing

Related Documentation

`format_python_objects(msg: str) -> str`

`strip_thinking_tags(text: str) -> str`

`check_and_display_thinking(msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync)`

`sanitize_file(filename: str) -> str`

`get_mime_type(file_name: str) -> str`

`extract_bucket_name(document_url: str) -> Optional[str]`

`count_tokens(contents) -> int`

`nice_errors(error_msg: str, callback: BufferStreamingStdOutCallbackHandlerAsync, type: str = "error")`

`convert_composite_to_native(value)`

`timedelta_to_iso8601(td: timedelta) -> str`

`format_human_chat_history(chat_history: List[Dict]) -> str`

`strip_div_tags(text: str) -> str`