Architecture¶

Design Philosophy¶

mcp-datahub follows these core principles:

Standalone First: Works as a complete MCP server out of the box
Library Second: Import into custom servers for composition
Island Architecture: No dependencies on other txn2 libraries
Direct API Integration: Calls DataHub GraphQL API directly
Domain Types Stay Local: All types defined within this library

Package Structure¶

flowchart TB
    subgraph root["github.com/txn2/mcp-datahub"]
        cmd["cmd/mcp-datahub<br/>CLI entry point"]
        internal["internal/server<br/>Server setup (not exported)"]
        subgraph pkg["pkg/"]
            client["client<br/>DataHub GraphQL client"]
            tools["tools<br/>MCP tool implementations"]
            types["types<br/>Domain types"]
            integration["integration<br/>Hook interfaces"]
            multiserver["multiserver<br/>Multi-server management"]
            extensions["extensions<br/>Config file, logging, metrics, errors"]
        end
    end

Component Diagram¶

flowchart TB
    subgraph server["Your MCP Server"]
        subgraph datahub["mcp-datahub"]
            tools["tools<br/>(Toolkit)"]
            client["client<br/>(Client)"]
            types["types<br/>(Entities)"]
            tools --> client
            client --> types
        end
    end
    client --> api["DataHub GraphQL<br/>API"]

Request Lifecycle¶

When a tool is called, the request flows through multiple layers:

sequenceDiagram
    participant AI as AI Assistant
    participant Server as MCP Server
    participant MW as Middleware Chain
    participant Handler as Tool Handler
    participant Client as DataHub Client
    participant API as DataHub API

    AI->>Server: CallTool(datahub_search, {query: "customer"})
    Server->>MW: Before hooks
    MW->>MW: Auth middleware
    MW->>MW: Rate limit middleware
    MW->>MW: Access filter
    MW->>Handler: Execute handler
    Handler->>Client: Search(ctx, "customer")
    Client->>API: GraphQL query
    API-->>Client: Response
    Client-->>Handler: SearchResult
    Handler-->>MW: CallToolResult
    MW->>MW: Metadata enricher
    MW->>MW: Access filter (results)
    MW->>MW: Audit logger
    MW-->>Server: Final result
    Server-->>AI: Tool response

Middleware Chain¶

Middleware wraps tool handlers to add cross-cutting concerns:

flowchart LR
    subgraph Before["Before Hooks (in order)"]
        B1[URN Resolver]
        B2[Access Filter]
        B3[User Middleware]
    end

    subgraph Handler
        H[Tool Handler]
    end

    subgraph After["After Hooks (reverse order)"]
        A1[User Middleware]
        A2[Metadata Enricher]
        A3[Access Filter]
        A4[Audit Logger]
    end

    B1 --> B2 --> B3 --> H --> A1 --> A2 --> A3 --> A4

Execution Order:

URN Resolver (Before): Translate external IDs to DataHub URNs
Access Filter (Before): Check if user can access the entity
User Middleware (Before): Custom pre-processing
Tool Handler: Execute the actual tool logic
User Middleware (After): Custom post-processing
Metadata Enricher (After): Add custom metadata to response
Access Filter (After): Filter results by access
Audit Logger (After): Log the tool invocation

Tools Only Design¶

This library exposes MCP Tools only. It does not expose Resources or Prompts.

Rationale:

Tools are the natural fit for DataHub operations (search, get, list)
Resources imply static content; DataHub content is dynamic and query-driven
Prompts are use-case specific; add them in your custom MCP servers

Client Architecture¶

The DataHub client handles all communication with the DataHub GraphQL API:

flowchart TB
    subgraph Client
        config[Configuration]
        http[HTTP Client]
        retry[Retry Logic]
        graphql[GraphQL Builder]
    end

    subgraph Operations
        search[Search]
        entity[GetEntity]
        schema[GetSchema]
        lineage[GetLineage]
        queries[GetQueries]
    end

    config --> http
    http --> retry
    retry --> graphql

    search --> graphql
    entity --> graphql
    schema --> graphql
    lineage --> graphql
    queries --> graphql

    graphql --> api[DataHub GraphQL API]

Client Features:

Feature	Description
Connection pooling	Reuses HTTP connections
Automatic retries	Retries failed requests with backoff
Timeout handling	Configurable request timeouts
Error wrapping	Wraps errors with context
Version compatibility	Handles DataHub version differences

Multi-Server Architecture¶

The multi-server component manages connections to multiple DataHub instances:

flowchart TB
    subgraph MultiServer
        manager[Connection Manager]
        config[Server Config]
        cache[Client Cache]
    end

    subgraph Connections
        prod[prod: DataHub Client]
        staging[staging: DataHub Client]
        dev[dev: DataHub Client]
    end

    manager --> config
    manager --> cache
    cache --> prod
    cache --> staging
    cache --> dev

    prod --> api1[Production API]
    staging --> api2[Staging API]
    dev --> api3[Dev API]

Integration Points¶

The library provides extension points for enterprise features:

flowchart TB
    subgraph Toolkit
        handler[Tool Handlers]
    end

    subgraph Integration
        qp[QueryProvider]
        af[AccessFilter]
        al[AuditLogger]
        ur[URNResolver]
        me[MetadataEnricher]
    end

    qp --> handler
    af --> handler
    al --> handler
    ur --> handler
    me --> handler

    qp --> trino[Trino Client]
    af --> authsvc[Auth Service]
    al --> auditdb[(Audit DB)]
    ur --> idmap[(ID Mapping)]
    me --> metasvc[Metadata Service]

Error Handling Strategy¶

Errors are handled consistently across the library:

flowchart TB
    subgraph Errors
        network[Network Error]
        auth[Auth Error]
        notfound[Not Found]
        validation[Validation Error]
        internal[Internal Error]
    end

    subgraph Handling
        retry[Retry with Backoff]
        reject[Reject Request]
        empty[Return Empty]
        wrap[Wrap and Return]
    end

    network --> retry
    auth --> reject
    notfound --> empty
    validation --> reject
    internal --> wrap

Error Policies:

Error Type	Policy	Retries
Network timeout	Retry with backoff	Up to 3
401 Unauthorized	Reject immediately	None
404 Not Found	Return empty result	None
400 Bad Request	Reject with details	None
500 Server Error	Retry with backoff	Up to 3

Caching Strategy¶

The library does not implement caching by default. This is intentional:

DataHub data changes frequently
Cache invalidation is complex
Different use cases need different caching strategies

To add caching, use middleware:

toolkit := tools.NewToolkit(client,
    tools.WithToolMiddleware(tools.ToolGetEntity, cacheMiddleware),
    tools.WithToolMiddleware(tools.ToolGetSchema, cacheMiddleware),
)

Thread Safety¶

All components are thread-safe:

Client uses connection pooling with proper synchronization
Toolkit can handle concurrent tool calls
Middleware must be stateless or properly synchronized

Integration Hooks¶

The library provides interfaces for extending functionality:

// URNResolver resolves external IDs to DataHub URNs
type URNResolver interface {
    ResolveToDataHubURN(ctx context.Context, externalID string) (string, error)
}

// AccessFilter controls entity access
type AccessFilter interface {
    CanAccess(ctx context.Context, urn string) (bool, error)
    FilterURNs(ctx context.Context, urns []string) ([]string, error)
}

// AuditLogger logs tool invocations
type AuditLogger interface {
    LogToolCall(ctx context.Context, tool string, params map[string]any, userID string) error
}

// MetadataEnricher adds custom metadata to responses
type MetadataEnricher interface {
    EnrichEntity(ctx context.Context, urn string, data map[string]any) (map[string]any, error)
}

// QueryProvider injects query execution context
type QueryProvider interface {
    Name() string
    ResolveTable(ctx context.Context, urn string) (*TableIdentifier, error)
    GetTableAvailability(ctx context.Context, urn string) (*TableAvailability, error)
    GetQueryExamples(ctx context.Context, urn string) ([]QueryExample, error)
    GetExecutionContext(ctx context.Context, urns []string) (*ExecutionContext, error)
    Close() error
}

Composability: How to compose with other toolkits
Quick Start: Get started using the library
API Reference: Complete API documentation