Nexus Blueprint

Layer 1

Core Runtime

The heartbeat of Nexus. An async loop that calls LLMs, executes tools, and iterates until the job is done.

↻

Agent Loop

Receives directive → builds prompt with context → calls LLM → parses tool calls → executes tools → feeds results back → repeats until done or max iterations hit.

⚙

Tool System

Unified Tool protocol with registry. Read-only tools run concurrently via asyncio.gather. Mutation tools run sequentially. Permission checks gate every execution.

☁

Model Provider

Provider-agnostic LLM abstraction. Testing: DeepSeek API (cheap, tool_use support). Production: self-hosted Gemma 4 via Ollama (zero cost at scale). One config change to swap.

✂

Session & Compaction

Conversation state with token budget tracking. Auto-compacts when threshold is reached. Save/load for session resume across restarts.

Agent Loop

Python

class AgentLoop:
    provider: ModelProvider
    tool_registry: ToolRegistry
    permission_engine: Permissions
    session: Session

    async def run(self, directive: str) -> AgentResult:
        while not done and iteration < max_iterations:
            response = await self.provider.complete(self.session.messages)
            tool_calls = extract_tool_calls(response)
            if not tool_calls:
                break
            results = await self.tool_router.execute(tool_calls)
            self.session.append_tool_results(results)
        return AgentResult(...)

Tool Router

Python

class ToolRouter:
    async def execute(self, calls: list[ToolCall]) -> list[ToolResult]:
        read_only = [c for c in calls if self.registry[c.name].is_read_only]
        mutations = [c for c in calls if not self.registry[c.name].is_read_only]
        # Concurrent for read-only, sequential for mutations
        results = await asyncio.gather(*[self._run(c) for c in read_only])
        for c in mutations:
            results.append(await self._run(c))
        return results

Layer 2

Pipeline Engine

Structured workflows that chain stages together. Each stage passes artifacts to the next. Inspired by OMX's planning-execution-verification model.

Scrape Pipeline Autonomous

Recon

→

Strategy

→

Plan

→

Scrape

→

Verify

→

Report

Direct Scrape Pipeline Interactive

Recon

→

Strategy

→

Plan

→

Scrape

→

Verify

→

Deliver

Strategy Layer KB Rules First

Don't let the LLM plan everything

The Strategy stage matches recon findings against the KB and applies known rules first. The LLM only plans for gaps the KB can't cover. This saves tokens and produces more reliable results.

Context-Based Tool Exposure

Each stage only sees relevant tools

Recon sees web_fetch and browser_scrape. Strategy only sees kb_search. Scrape sees the full scraping toolkit. This prevents LLM confusion as tool count grows past 20+.

Hardened Verify Loop Ralph Loop

Typed failures + strict retry limits

Max 5 retries. Each failure is classified: empty_data (rotate approach), partial_data (retry same), blocked (escalate tools), format_error (fix only), unrecoverable (stop). No infinite loops, no token burn.

KB Ingest Pipeline Learning

Transcribe

→

Extract

→

Format

→

Index

Stage Reference

Stage	What it does
`ReconStage`	Visit target site, identify protections, map endpoints, check KB for known solutions
`StrategyStage`	Match recon findings against KB. Apply known rules first, identify gaps for LLM
`PlanStage`	LLM plans ONLY for gaps the strategy couldn't cover. Picks tools and anti-detection
`ScrapeStage`	Execute the plan. Build and run scraper using selected tools
`VerifyStage`	Hardened Ralph Loop — typed failures, strict 5-retry limit, escalation paths
`ReportStage`	Generate dev report in standardized format, update KB with new learnings
`DeliverStage`	Return results to user or store for Upwork proposal
`TranscribeStage`	Download YouTube video, extract transcript
`ExtractStage`	LLM extracts scraping techniques from transcript
`FormatStage`	Convert to standardized report format (site-spec JSON + report MD)
`IndexStage`	Tag and store in Knowledge Base

Layer 3

Knowledge Base

A tagged wiki of web scraping solutions. Three-layer architecture: hot index → subtree indexes → leaf detail files. The agent looks up techniques like a developer searches Stack Overflow.

nexus-kb/ KB-INDEX.md # Hot index — tag cloud, recent additions techniques/ README.md # Subtree index cloudflare-bypass.md # Tagged solution turnstile-solving.md jwt-extraction.md curl-cffi-impersonation.md firebase-account-farming.md rate-limit-bypass.md site-specs/ site-spec-upwork.json # Machine-readable site profile site-spec-dewatermark.json report-upwork.md # Human-readable dev report failures/ cloudflare-turnstile-click.md # What didn't work and why tools/ curl-cffi.md # Tool reference card nodriver.md playwright.md

KB Agent Tools

`kb_search`

Search by tags: ["cloudflare", "bypass"] → matching solutions with code snippets

`kb_get_site_spec`

Lookup known site profile by domain → returns protections, endpoints, rate limits, auth methods

`kb_add_entry`

Store new technique or solution with tags, code snippets, and source attribution

`kb_add_failure`

Record what didn't work and why — prevents the agent from repeating failed approaches

Quality Control Auto-Ranking

Success Rate Tracking

Every technique tracks success_count and fail_count. The strategy layer picks the highest-rated solution first.

Confidence Decay

Unused techniques lose confidence over time (0.95^months). Anti-bot measures evolve — old solutions shouldn't rank equally with proven recent ones.

Layer 4

Tools

20+ tools across 5 categories. All follow the same Tool protocol. Your existing APIs become tools via thin wrappers.

Scraping Tools Core

Tool	Source	What it does
`recon`	New	Visit site, detect protections, map structure
`curl_scrape`	New	HTTP scraping via curl_cffi (Cloudflare bypass)
`browser_scrape`	New	Browser-based scraping via Playwright/nodriver
`proxy_manager`	New	Rotate/manage proxy pool
`identity_manager`	New	Full browser identity: cookies, sessions, fingerprints, TLS profile
`captcha_solver`	WatermarkAPI	Turnstile/CAPTCHA solving via browser pool

Data Tools Read/Write

Tool	Source	What it does
`kb_search`	New	Search Knowledge Base
`kb_add`	New	Add to Knowledge Base
`file_read`	Claw Code	Read local files
`file_write`	Claw Code	Write local files
`code_execute`	NCA Toolkit	Run Python code (sandboxed)

Media Tools Existing APIs

Tool	Source	What it does
`dewatermark`	Dewatermark API	Remove watermarks from images
`getty_download`	GettyImagesW	Batch download Getty images
`youtube_download`	YouTube API	Download YouTube videos
`youtube_transcript`	New/existing	Get video transcripts
`video_caption`	SRT Tanker	Render captions onto video
`media_convert`	NCA Toolkit	Convert media formats
`media_transcribe`	NCA Toolkit	Speech-to-text

Intel Tools Discovery

Tool	Source	What it does
`upwork_search`	Upwork API	Search Upwork jobs
`upwork_job_detail`	Upwork API	Get full job description
`web_search`	New	General web search
`web_fetch`	New	Fetch and parse any URL

Notification & Vision Tools Notify / Future

Tool	Source	What it does
`telegram_send`	Existing	Send Telegram message
`notify`	New	Route event to configured sinks
`video_analyze`	New (future)	Extract frames, analyze visually via LLM
`screenshot`	NCA Toolkit	Screenshot a webpage

Layer 5

Identity System

Not just proxy rotation. A full browser identity manager — cookies, sessions, fingerprints, TLS profiles, geo-matched config. Each identity is consistent and trackable.

🕵

Full Identity Profiles

Each identity bundles proxy + fingerprint + TLS profile + cookies + user agent + timezone + language. Everything geo-matched to the proxy IP for consistency.

🚫

Block Tracking

When an identity gets flagged on a domain, it's marked as blocked there but stays usable elsewhere. Least-recently-used rotation prevents overuse.

🌐

Geo-Aware Generation

New identities auto-detect proxy geolocation and set matching timezone, language, and locale. No mismatches that trigger anti-bot detection.

Python

@dataclass
class Identity:
    proxy: ProxyConfig           # IP + port + auth
    fingerprint: BrowserFingerprint # screen, fonts, webGL, canvas
    tls_profile: str              # "chrome_120" for curl_cffi
    cookies: dict[str, str]      # Persistent session cookies
    user_agent: str
    timezone: str                 # Matches proxy geo
    blocked_on: list[str]         # Domains where flagged

class IdentityManager:
    def get_identity(self, domain: str) -> Identity:
        # Get clean identity not blocked on this domain
        # Least-recently-used rotation

    def mark_blocked(self, identity: Identity, domain: str):
        # Flag identity as detected on this domain

Layer 6

Metrics

Track everything on top of the event system. Success rates, retry counts, token costs, best techniques per protection type. Enables self-optimization over time.

Per-Job Tracking

Duration, token cost, retries, failure types, technique used, data rows scraped, overall success. Every job produces a metrics record.

Aggregate Queries

success_rate(domain), avg_cost_per_job(), best_technique_for("cloudflare"), cost_trend(30) — data-driven decisions.

Self-Optimization

Metrics feed back into the Strategy layer. The system auto-selects the cheapest technique with the highest success rate for each protection type.

Layer 7

Input Directors

Three ways in. Automated Upwork feed, interactive chat, and KB ingestion. All produce the same Directive object that enters the pipeline.

⚡

Upwork Director Pre-Filtered

Polls Upwork, scores each job (profit / difficulty / success chance) before any scraping. Only jobs above the score threshold enter the pipeline. Don't waste cycles on bad jobs.

💬

Chat Director

You type, it scrapes. Direct commands to the agent loop. Full access to all tools. Your personal scraping interface.

📚

Ingest Director

Feed it YouTube channels, dev reports, tutorials. Extracts scraping techniques and indexes them into the Knowledge Base.

Directive Format

Python

@dataclass
class Directive:
    source: str       # "upwork" | "human" | "ingest"
    type: str         # "scrape_job" | "direct" | "kb_ingest"
    description: str  # What to do
    target_url: str | None = None
    metadata: dict = field(default_factory=dict)

Layer 8

Event Router

Lightweight async event system inspired by clawhip. Agents emit typed events, the router delivers them to configured sinks. Keeps notification logic outside the agent's context window.

Event Types

job.found New Upwork job detected

job.scored Job scored (profit/difficulty/success)

job.filtered Job below score threshold

scrape.started Scraping attempt begun

scrape.recon_complete Site recon done

scrape.strategy_applied KB rules matched

scrape.plan_ready LLM planned for gaps

scrape.executing Actively scraping

scrape.completed Data collected

scrape.failed Attempt failed

scrape.verified Data quality confirmed

kb.entry_added New knowledge indexed

kb.entry_decayed Technique confidence dropped

kb.ingest_complete YouTube/report processed

scrape.retry Ralph loop retry (typed failure)

agent.error Agent-level error

metrics.job_complete Full job metrics recorded

Sinks

Real-time alerts for completed scrapes, failures, and new Upwork jobs. Compact and alert formats.

Dashboard

WebSocket push to the web UI. Live status of running jobs, agent iterations, and KB growth.

File Log

Append-only JSONL event log. Every event persisted for replay, debugging, and analytics.

Implementation

Project Structure

Clean Python package layout. One module per concern. All tools follow the same protocol.

nexus/ nexus/ core/ agent_loop.py # Main agent loop provider.py # LLM provider abstraction session.py # Session state + compaction permissions.py # Permission engine tool_router.py # Tool execution + concurrency pipeline/ engine.py # Pipeline stage runner stages/ recon.py plan.py scrape.py verify.py report.py deliver.py ingest.py pipelines.py # Pre-built pipeline definitions tools/ base.py # Tool protocol + registry scraping/ recon.py curl_scrape.py browser_scrape.py proxy_manager.py captcha_solver.py data/ kb.py file_ops.py code_execute.py media/ dewatermark.py getty_download.py youtube.py srt_tanker.py media_convert.py intel/ upwork.py web_search.py web_fetch.py notify/ telegram.py notify.py directors/ upwork.py chat.py ingest.py kb/ store.py search.py events/ router.py types.py sinks/ telegram.py dashboard.py file.py state/ session_store.py job_store.py event_log.py api/ server.py # FastAPI HTTP interface routes/ chat.py jobs.py kb.py config.py nexus-kb/ # Knowledge Base files nexus-data/ # Runtime state config.toml Dockerfile requirements.txt

Configuration

Config

Single TOML file. Environment variable substitution for secrets. Glob-matched event routing.

TOML

[nexus]
name = "Nexus"
data_dir = "./nexus-data"
kb_dir = "./nexus-kb"

[provider]
default = "deepseek"                 # Testing phase
# default = "ollama"                   # Production: self-hosted Gemma 4

[provider.deepseek]
api_key = "${DEEPSEEK_API_KEY}"
base_url = "https://api.deepseek.com/v1"
model = "deepseek-chat"

[provider.ollama]
base_url = "http://gemma-server:11434"  # Dedicated Gemma 4 server
model = "gemma4"

[provider.anthropic]               # Optional fallback
api_key = "${ANTHROPIC_API_KEY}"

[agent]
max_iterations = 50               # Ralph loop safety limit
max_tokens_per_session = 100000
compaction_threshold = 80000

[upwork]
enabled = true
poll_interval_minutes = 30
keywords = ["web scraping", "data extraction", "crawler"]
min_budget = 50

[telegram]
enabled = true
bot_token = "${TELEGRAM_BOT_TOKEN}"
chat_id = "${TELEGRAM_CHAT_ID}"

[events.routes]
# Glob-matched event routing
"scrape.completed" = { sink = "telegram", format = "compact" }
"scrape.failed" = { sink = "telegram", format = "alert" }
"job.found" = { sink = "telegram", format = "compact" }
"*" = { sink = "file", format = "raw" }

[permissions]
mode = "auto"                      # "auto" | "interactive" | "bypass"

Roadmap

Build Order

Four phases from foundation to full autonomy. Each phase ends with a working milestone.

Phase 0

Foundation

core/provider.py — LLM abstraction (Anthropic first)
tools/base.py — Tool protocol + registry + router (with context-based filtering)
core/agent_loop.py — Basic loop (call LLM → execute tools → repeat)
core/session.py — Message history + basic compaction
tools/data/file_ops.py — Read/write files (first tools to test loop)
tools/intel/web_fetch.py — Fetch URLs
directors/chat.py — Interactive mode so you can talk to it

You can chat with Nexus and it can read files + fetch URLs

Phase 1

Scraping Core + KB

tools/scraping/recon.py — Site reconnaissance
tools/scraping/curl_scrape.py — HTTP scraping via curl_cffi
tools/scraping/browser_scrape.py — Browser scraping via Playwright
tools/scraping/proxy_manager.py — Proxy rotation
tools/scraping/identity_manager.py — Full browser identity system
kb/store.py + kb/search.py — Knowledge Base with quality scoring
tools/data/kb.py — KB as agent tools
Ingest existing dev reports into KB (Upwork + Dewatermark)
pipeline/stages/strategy.py — Strategy layer (KB rules before LLM)

Nexus can recon a site, apply known techniques, and scrape

Phase 2

Pipelines, Multi-Agent & Automation

pipeline/engine.py — Stage runner with context-based tool filtering
pipeline/stages/ — Recon → Strategy → Plan → Scrape → Verify → Report
pipeline/stages/verify.py — Hardened Ralph Loop (typed failures, strict retries)
Multi-agent support — parallel scraping workers (scraping is parallel by nature)
events/router.py + events/sinks/telegram.py — Notifications
metrics/collector.py + metrics/store.py — Job metrics tracking
directors/upwork.py — Upwork job feed with pre-filtering (score before scrape)
state/job_store.py — Job tracking

Nexus autonomously scrapes Upwork jobs in parallel, scores them, and notifies you

Phase 3

Media & API Tools

Wrap existing APIs as tools (dewatermark, getty, youtube, srt tanker, NCA toolkit)
directors/ingest.py — YouTube channel ingestion
tools/vision/video_analyze.py — Frame extraction + visual analysis

Full tool suite available, KB growing from YouTube content

Phase 4

Polish & Self-Optimization

api/server.py — FastAPI dashboard with metrics views
Session resume/persistence
KB auto-decay (confidence drops on unused/failing techniques)
Metrics-driven self-optimization (auto-select best technique per protection)
Permission refinement

Production-ready self-optimizing autonomous scraping agent

Layer 15

Reports

Daily summaries of what Nexus did, what it caught, and what broke. Auto-generated from the event router and health-check logs. Real reports start landing once the daily scan + self-fix loop is wired up.

Coming soon

No reports yet

Once the daily scanner + auto-fix loop is online, this view will list each run, what was checked, what was repaired, and what still needs human attention.

Planned report types

daily.health Per-service health roll-up over 24h

daily.errors Errors detected, grouped by service + cause

daily.fixes Issues Nexus auto-resolved (with diff)

daily.attention Issues that require manual action

daily.workflows n8n executions: success/failure rates per channel

daily.api_drift API behavior changes detected vs. last known good

weekly.trends Cost, throughput, and reliability trends week-over-week

Nexus

Architecture

Core Runtime

Agent Loop

Tool System

Model Provider

Session & Compaction

Agent Loop

Tool Router

Pipeline Engine

Scrape Pipeline Autonomous

Direct Scrape Pipeline Interactive

Strategy Layer KB Rules First

Don't let the LLM plan everything

Context-Based Tool Exposure

Each stage only sees relevant tools

Hardened Verify Loop Ralph Loop

Typed failures + strict retry limits

KB Ingest Pipeline Learning

Stage Reference

Knowledge Base

KB Agent Tools

kb_search

kb_get_site_spec

kb_add_entry

kb_add_failure

Quality Control Auto-Ranking

Success Rate Tracking

Confidence Decay

Tools

Scraping Tools Core

Data Tools Read/Write

Media Tools Existing APIs

Intel Tools Discovery

Notification & Vision Tools Notify / Future

Identity System

Full Identity Profiles

Block Tracking

Geo-Aware Generation

Metrics

Per-Job Tracking

Aggregate Queries

Self-Optimization

Input Directors

Upwork Director Pre-Filtered

Chat Director

Ingest Director

Directive Format

Event Router

Event Types

Sinks

Telegram

Dashboard

File Log

State & Persistence

Project Structure

Config

Build Order

Foundation

Scraping Core + KB

Pipelines, Multi-Agent & Automation

Media & API Tools

Polish & Self-Optimization

Reports

No reports yet

Planned report types

Agents

Sam

`kb_search`

`kb_get_site_spec`

`kb_add_entry`

`kb_add_failure`