We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Sparkline: Building a Content Analysis SaaS in Elixir
Inside Sparkline's NLP pipeline: sentiment analysis, entity extraction, topic modeling, and API integration for content intelligence at scale.
Tomas Korcak (korczis)
Prismatic Platform
Content analysis is one of those domains where the gap between a demo and a production system is enormous. A demo can run sentiment analysis on a single paragraph in a Jupyter notebook. A production system needs to process thousands of documents per hour, handle multiple languages, extract structured entities, model topics across a corpus, and expose results through a stable API. Sparkline is Prismatic's content analysis subsystem, running as a standalone Phoenix application on port 4002.
The Pipeline
Every document entering Sparkline passes through a five-stage pipeline:
Ingestion β Preprocessing β Analysis β Enrichment β Storage/API
Each stage is implemented as a separate module with a consistent interface, allowing stages to be composed, replaced, or parallelized independently.
Stage 1: Ingestion
Documents enter through either the REST API or internal PubSub messages. The ingestion stage normalizes input format and creates a processing context:
defmodule Sparkline.Pipeline.Ingestion do
@moduledoc """
Document ingestion with format normalization
and processing context creation.
"""
@spec ingest(map()) :: {:ok, Sparkline.Document.t()} | {:error, term()}
def ingest(params) do
with {:ok, content} <- extract_content(params),
{:ok, language} <- detect_language(content),
{:ok, doc} <- create_document(content, language, params) do
{:ok, doc}
end
end
defp detect_language(content) do
case Lingua.detect(content) do
{:ok, lang} -> {:ok, lang}
:error -> {:ok, :en} # default fallback
end
end
defp extract_content(%{"text" => text}) when is_binary(text), do: {:ok, text}
defp extract_content(%{"url" => url}) when is_binary(url), do: fetch_and_extract(url)
defp extract_content(%{"html" => html}) when is_binary(html), do: {:ok, strip_html(html)}
defp extract_content(_), do: {:error, :invalid_content}
end
Stage 2: Preprocessing
Preprocessing prepares the raw text for analysis. This includes tokenization, sentence splitting, and normalization:
defmodule Sparkline.Pipeline.Preprocessing do
@moduledoc """
Text preprocessing: tokenization, sentence splitting,
stopword removal, and normalization.
"""
@spec preprocess(Sparkline.Document.t()) :: {:ok, Sparkline.Document.t()}
def preprocess(doc) do
sentences = split_sentences(doc.content)
tokens = Enum.flat_map(sentences, &tokenize/1)
normalized = normalize_tokens(tokens, doc.language)
{:ok, %{doc |
sentences: sentences,
tokens: tokens,
normalized_tokens: normalized,
stats: %{
sentence_count: length(sentences),
token_count: length(tokens),
unique_tokens: tokens |> MapSet.new() |> MapSet.size()
}
}}
end
defp split_sentences(text) do
text
|> String.split(~r/(?<=[.!?])\s+(?=[A-Z\p{Lu}])/, trim: true)
|> Enum.map(&String.trim/1)
|> Enum.reject(&(&1 == ""))
end
defp tokenize(sentence) do
sentence
|> String.downcase()
|> String.split(~r/[\s\p{P}]+/, trim: true)
end
end
Stage 3: Analysis
The analysis stage runs multiple analyzers in parallel using Task.async_stream:
defmodule Sparkline.Pipeline.Analysis do
@moduledoc """
Parallel analysis orchestrator. Runs sentiment, entity,
and topic analyzers concurrently.
"""
@analyzers [
Sparkline.Analyzer.Sentiment,
Sparkline.Analyzer.Entity,
Sparkline.Analyzer.Topic,
Sparkline.Analyzer.Readability
]
@spec analyze(Sparkline.Document.t()) :: {:ok, Sparkline.Document.t()}
def analyze(doc) do
results =
@analyzers
|> Task.async_stream(
fn analyzer -> {analyzer, analyzer.analyze(doc)} end,
max_concurrency: 4,
timeout: :timer.seconds(30)
)
|> Enum.reduce(%{}, fn
{:ok, {analyzer, {:ok, result}}}, acc ->
Map.put(acc, analyzer.key(), result)
{:ok, {analyzer, {:error, reason}}}, acc ->
Map.put(acc, analyzer.key(), %{error: reason})
{:exit, _reason}, acc ->
acc
end)
{:ok, %{doc | analysis: results}}
end
end
Sentiment Analysis
The sentiment analyzer operates at both document and sentence level. Document-level sentiment provides an overall tone, while sentence-level analysis reveals where opinions shift:
defmodule Sparkline.Analyzer.Sentiment do
@moduledoc """
Multi-level sentiment analysis with lexicon-based scoring
and contextual adjustments.
"""
@spec analyze(Sparkline.Document.t()) :: {:ok, map()}
def analyze(doc) do
sentence_sentiments =
Enum.map(doc.sentences, fn sentence ->
tokens = tokenize(sentence)
raw_score = compute_lexicon_score(tokens, doc.language)
adjusted = apply_modifiers(tokens, raw_score)
%{
text: sentence,
score: adjusted,
label: classify_score(adjusted),
confidence: compute_confidence(tokens)
}
end)
doc_score =
sentence_sentiments
|> Enum.map(& &1.score)
|> weighted_average()
{:ok, %{
document: %{score: doc_score, label: classify_score(doc_score)},
sentences: sentence_sentiments,
distribution: compute_distribution(sentence_sentiments)
}}
end
defp apply_modifiers(tokens, score) do
negation_count = count_negations(tokens)
intensifier_factor = compute_intensifier_factor(tokens)
adjusted = if rem(negation_count, 2) == 1, do: -score, else: score
adjusted * intensifier_factor
end
end
The lexicon-based approach uses language-specific sentiment dictionaries with ~8,000 entries per language. We currently support Czech, English, German, and Slovak. The modifier system handles negations ("not good" = negative) and intensifiers ("very good" = more positive than "good").
Entity Extraction
Entity extraction identifies named entities -- people, organizations, locations, dates, and monetary amounts -- within the text:
defmodule Sparkline.Analyzer.Entity do
@moduledoc """
Named entity recognition using pattern matching
and contextual classification.
"""
@entity_types [:person, :organization, :location, :date, :money, :ico]
@spec analyze(Sparkline.Document.t()) :: {:ok, map()}
def analyze(doc) do
entities =
doc.sentences
|> Enum.flat_map(fn sentence ->
extract_pattern_entities(sentence) ++
extract_capitalized_sequences(sentence) ++
extract_czech_business_ids(sentence)
end)
|> deduplicate()
|> classify()
{:ok, %{
entities: entities,
entity_count: length(entities),
by_type: Enum.group_by(entities, & &1.type)
}}
end
defp extract_czech_business_ids(text) do
Regex.scan(~r/\bI[CΔ]O[:\s]*(\d{8})\b/, text)
|> Enum.map(fn [_full, ico] ->
%{text: ico, type: :ico, confidence: 0.95}
end)
end
end
Czech business IDs (ICO) are particularly important for our due diligence workflows. When Sparkline extracts an ICO from a document, it can automatically trigger an ARES lookup through the OSINT pipeline.
Topic Modeling
Topic modeling uses a TF-IDF approach to identify the dominant themes in a document or corpus:
defmodule Sparkline.Analyzer.Topic do
@moduledoc """
TF-IDF based topic extraction with corpus-aware weighting.
"""
@spec analyze(Sparkline.Document.t()) :: {:ok, map()}
def analyze(doc) do
tf = compute_term_frequency(doc.normalized_tokens)
idf = get_inverse_document_frequency()
tf_idf =
tf
|> Enum.map(fn {term, freq} ->
{term, freq * Map.get(idf, term, 1.0)}
end)
|> Enum.sort_by(fn {_term, score} -> score end, :desc)
|> Enum.take(20)
topics = cluster_terms(tf_idf)
{:ok, %{
top_terms: tf_idf,
topics: topics,
topic_count: length(topics)
}}
end
end
API Integration
Sparkline exposes a REST API on port 4002 for external consumers:
# POST /api/v1/analyze
%{
"text" => "Navigara s.r.o., ICO 12345678, reported revenue of 50M CZK...",
"analyzers" => ["sentiment", "entity", "topic"],
"language" => "cs"
}
# Response
%{
"document_id" => "doc_abc123",
"language" => "cs",
"analysis" => %{
"sentiment" => %{"score" => -0.15, "label" => "slightly_negative"},
"entities" => [
%{"text" => "Navigara s.r.o.", "type" => "organization"},
%{"text" => "12345678", "type" => "ico"},
%{"text" => "50M CZK", "type" => "money"}
],
"topics" => [
%{"label" => "financial_performance", "terms" => ["revenue", "czk", "reported"]}
]
},
"processing_time_ms" => 145
}
The API supports both synchronous processing (for small documents) and asynchronous processing via webhooks (for large documents or batch operations). Async requests return immediately with a document_id and call the configured webhook URL when processing completes.
Performance Characteristics
Sparkline is designed for throughput. The parallel analyzer architecture means that adding a new analyzer does not increase latency -- it runs concurrently with existing analyzers:
The bottleneck is typically entity extraction for long documents, as the pattern matching scales with document length. We mitigate this by processing sentences in parallel within the entity analyzer.
Sparkline runs as a standalone Phoenix application within the Prismatic umbrella. It can be deployed independently for teams that need content analysis without the full intelligence platform, or it integrates seamlessly with the OSINT and DD pipelines for automated document intelligence.