Prismatic · Enterprise AI Orchestration

#Observability Is Not Optional

The OTEL doctrine (Observability Telemetry Enforcement Layer) is one of the platform’s 18 enforcement pillars. It mandates that every GenServer, controller, LiveView, and external API call must emit telemetry events. Observability is not something you bolt on after a production incident – it is a design constraint from day one.

#The :telemetry Foundation

Elixir’s :telemetry library provides a lightweight, VM-native event system. Events are tuples of name, measurements, and metadata:

:telemetry.execute(
  [:prismatic, :osint, :adapter, :execute],
  %{duration: duration_ms, result_count: length(results)},
  %{adapter: adapter_name, query: sanitized_query}
)

The key principle is separation of emission and handling. The code that does work emits events. Completely separate code decides what to do with those events – log them, aggregate them, send them to Prometheus, or trigger alerts.

#GenServer Telemetry

Every GenServer in the platform emits telemetry for three lifecycle events:

#init/1

def init(config) do
  start_time = System.monotonic_time()
  # ... initialization logic ...
  duration = System.monotonic_time() - start_time

  :telemetry.execute(
    [:prismatic, :genserver, :init],
    %{duration: duration},
    %{module: __MODULE__, config_keys: Map.keys(config)}
  )

  {:ok, state}
end

#handle_call/handle_cast

Every message handler emits duration, queue length, and result status. This data reveals which GenServers are bottlenecks and which message types are slowest.

#terminate/2

Termination events capture the reason and the final state size. Unexpected terminations (anything other than :normal or :shutdown) trigger escalation alerts.

#Controller and LiveView Instrumentation

Phoenix already emits telemetry for HTTP requests via Plug.Telemetry. The platform extends this with:

#Request Context

:telemetry.execute(
  [:prismatic, :request, :complete],
  %{duration: duration_ms, status: status_code},
  %{
    path: conn.request_path,
    method: conn.method,
    user_id: get_user_id(conn),
    request_id: Logger.metadata()[:request_id]
  }
)

#LiveView Mount and Event Handling

LiveView mounts and event handlers emit timing data that feeds into the performance gate system. The PERF doctrine mandates:

LiveView mount: less than 150ms
Event handling: less than 100ms
Page load: less than 250ms

Any violation is flagged in the telemetry dashboard and blocks deployment if the violation persists across multiple measurements.

#Span Creation for Distributed Tracing

For operations that span multiple processes or external calls, the platform creates trace spans:

def execute_investigation(case_id) do
  span_id = generate_span_id()

  :telemetry.span(
    [:prismatic, :investigation, :execute],
    %{case_id: case_id, span_id: span_id},
    fn ->
      result = do_investigation(case_id)
      {result, %{entity_count: length(result.entities)}}
    end
  )
end

:telemetry.span/3 automatically emits start and stop events (or exception events on failure), with duration calculated precisely using monotonic time. Spans can be nested and correlated using span IDs.

#Metric Collection and Aggregation

Raw telemetry events are ephemeral – they fire and are gone. The platform uses Telemetry.Metrics to define persistent aggregations:

Metric Type	Example	Purpose
Counter	`prismatic.osint.adapter.execute.count`	How many times each adapter runs
Sum	`prismatic.osint.adapter.execute.result_count`	Total results produced
Last Value	`prismatic.genserver.mailbox_length`	Current mailbox depth
Distribution	`prismatic.request.duration`	Latency percentiles (p50, p95, p99)

These metrics are exported to Prometheus for long-term storage and Grafana for visualization.

#Structured Logging

The platform uses structured logging exclusively. No unstructured string interpolation:

# Correct: structured metadata
Logger.info("Investigation completed",
  case_id: case_id,
  entity_count: length(entities),
  duration_ms: duration,
  source: :dd_engine
)

# Incorrect: unstructured string
# Logger.info("Investigation #{case_id} found #{length(entities)} entities in #{duration}ms")

Structured logs are machine-parseable, searchable, and can be correlated across services using request IDs and span IDs.

#The ErrorFeed Real-Time Dashboard

The ErrorFeed is a LiveView dashboard at /admin/error-feed that provides real-time visibility into platform errors:

#Features

Live stream: Errors appear in real-time via PubSub subscription to “error_patterns”
Pattern detection: The PatternTracker identifies recurring error signatures and groups them
Severity classification: Errors are classified as critical, warning, or info based on type and frequency
Root cause linking: Each error links to the telemetry span that produced it, enabling one-click drill-down
Trend analysis: Rolling 1-hour, 6-hour, and 24-hour error rate charts

#Architecture

Application Code
    |
    v
StreamLoggerBackend (captures all log levels)
    |
    v
PatternTracker (classifies and groups)
    |
    v
PubSub "error_patterns" topic
    |
    v
ErrorFeedLive (renders in browser)

The StreamLoggerBackend is a custom Logger backend that intercepts all log events at runtime. It filters for error-level events and forwards them to the PatternTracker, which maintains a sliding window of recent errors and identifies patterns.

#OTEL Doctrine Enforcement

The OTEL pillar is enforced at two levels:

Pre-commit: The doctrine checker scans modified files for GenServers without telemetry emissions, controllers without request logging, and rescue blocks without error logging.
CI/CD: mix check.doctrines --pillar otel runs a comprehensive audit of all modules, flagging any that lack the required telemetry integration.

Violations are advisory in pre-commit (warning) but blocking in CI (the build fails). This gives developers a chance to fix issues before pushing while ensuring nothing reaches production without proper observability.

If you cannot observe it, you cannot improve it. If you cannot measure it, you cannot manage it.

Telemetry-Driven Development: Observability from Day One

#Observability Is Not Optional

#The :telemetry Foundation

#GenServer Telemetry

#init/1

#handle_call/handle_cast

#terminate/2

#Controller and LiveView Instrumentation

#Request Context

#LiveView Mount and Event Handling

#Span Creation for Distributed Tracing

#Metric Collection and Aggregation

#Structured Logging

#The ErrorFeed Real-Time Dashboard

#Features

#Architecture

#OTEL Doctrine Enforcement

Glossary

Academy Courses

Telemetry-First Observability: Events Before Dashboards

Telemetry to Prometheus: The Pipeline You Stop Writing Once You Get It Right

Telemetry-Driven Development: Observability from Day One

#Observability Is Not Optional

#The :telemetry Foundation

#GenServer Telemetry

#init/1

#handle_call/handle_cast

#terminate/2

#Controller and LiveView Instrumentation

#Request Context

#LiveView Mount and Event Handling

#Span Creation for Distributed Tracing

#Metric Collection and Aggregation

#Structured Logging

#The ErrorFeed Real-Time Dashboard

#Features

#Architecture

#OTEL Doctrine Enforcement

Glossary

Academy Courses

Continue reading

Telemetry-First Observability: Events Before Dashboards

Telemetry to Prometheus: The Pipeline You Stop Writing Once You Get It Right

Telemetry-Driven Development: Observability from Day One