Autonomous web scraping agent runtime. Inspired by Claude Code's architecture, purpose-built for scraping at scale.
Humans set direction. Nexus does the labor.
Seven layers working together. Input flows down from Directors through the Pipeline Engine into the Agent Loop, which calls Tools. Knowledge Base informs every decision. Events flow out to notification sinks.
The heartbeat of Nexus. An async loop that calls LLMs, executes tools, and iterates until the job is done.
Receives directive → builds prompt with context → calls LLM → parses tool calls → executes tools → feeds results back → repeats until done or max iterations hit.
Unified Tool protocol with registry. Read-only tools run concurrently via asyncio.gather. Mutation tools run sequentially. Permission checks gate every execution.
Provider-agnostic LLM abstraction. Testing: DeepSeek API (cheap, tool_use support). Production: self-hosted Gemma 4 via Ollama (zero cost at scale). One config change to swap.
Conversation state with token budget tracking. Auto-compacts when threshold is reached. Save/load for session resume across restarts.
class AgentLoop: provider: ModelProvider tool_registry: ToolRegistry permission_engine: Permissions session: Session async def run(self, directive: str) -> AgentResult: while not done and iteration < max_iterations: response = await self.provider.complete(self.session.messages) tool_calls = extract_tool_calls(response) if not tool_calls: break results = await self.tool_router.execute(tool_calls) self.session.append_tool_results(results) return AgentResult(...)
class ToolRouter: async def execute(self, calls: list[ToolCall]) -> list[ToolResult]: read_only = [c for c in calls if self.registry[c.name].is_read_only] mutations = [c for c in calls if not self.registry[c.name].is_read_only] # Concurrent for read-only, sequential for mutations results = await asyncio.gather(*[self._run(c) for c in read_only]) for c in mutations: results.append(await self._run(c)) return results
Structured workflows that chain stages together. Each stage passes artifacts to the next. Inspired by OMX's planning-execution-verification model.
The Strategy stage matches recon findings against the KB and applies known rules first. The LLM only plans for gaps the KB can't cover. This saves tokens and produces more reliable results.
Recon sees web_fetch and browser_scrape. Strategy only sees kb_search. Scrape sees the full scraping toolkit. This prevents LLM confusion as tool count grows past 20+.
Max 5 retries. Each failure is classified: empty_data (rotate approach), partial_data (retry same), blocked (escalate tools), format_error (fix only), unrecoverable (stop). No infinite loops, no token burn.
| Stage | What it does |
|---|---|
ReconStage | Visit target site, identify protections, map endpoints, check KB for known solutions |
StrategyStage | Match recon findings against KB. Apply known rules first, identify gaps for LLM |
PlanStage | LLM plans ONLY for gaps the strategy couldn't cover. Picks tools and anti-detection |
ScrapeStage | Execute the plan. Build and run scraper using selected tools |
VerifyStage | Hardened Ralph Loop — typed failures, strict 5-retry limit, escalation paths |
ReportStage | Generate dev report in standardized format, update KB with new learnings |
DeliverStage | Return results to user or store for Upwork proposal |
TranscribeStage | Download YouTube video, extract transcript |
ExtractStage | LLM extracts scraping techniques from transcript |
FormatStage | Convert to standardized report format (site-spec JSON + report MD) |
IndexStage | Tag and store in Knowledge Base |
A tagged wiki of web scraping solutions. Three-layer architecture: hot index → subtree indexes → leaf detail files. The agent looks up techniques like a developer searches Stack Overflow.
kb_searchSearch by tags: ["cloudflare", "bypass"] → matching solutions with code snippets
kb_get_site_specLookup known site profile by domain → returns protections, endpoints, rate limits, auth methods
kb_add_entryStore new technique or solution with tags, code snippets, and source attribution
kb_add_failureRecord what didn't work and why — prevents the agent from repeating failed approaches
Every technique tracks success_count and fail_count. The strategy layer picks the highest-rated solution first.
Unused techniques lose confidence over time (0.95^months). Anti-bot measures evolve — old solutions shouldn't rank equally with proven recent ones.
20+ tools across 5 categories. All follow the same Tool protocol. Your existing APIs become tools via thin wrappers.
| Tool | Source | What it does |
|---|---|---|
recon | New | Visit site, detect protections, map structure |
curl_scrape | New | HTTP scraping via curl_cffi (Cloudflare bypass) |
browser_scrape | New | Browser-based scraping via Playwright/nodriver |
proxy_manager | New | Rotate/manage proxy pool |
identity_manager | New | Full browser identity: cookies, sessions, fingerprints, TLS profile |
captcha_solver | WatermarkAPI | Turnstile/CAPTCHA solving via browser pool |
| Tool | Source | What it does |
|---|---|---|
kb_search | New | Search Knowledge Base |
kb_add | New | Add to Knowledge Base |
file_read | Claw Code | Read local files |
file_write | Claw Code | Write local files |
code_execute | NCA Toolkit | Run Python code (sandboxed) |
| Tool | Source | What it does |
|---|---|---|
dewatermark | Dewatermark API | Remove watermarks from images |
getty_download | GettyImagesW | Batch download Getty images |
youtube_download | YouTube API | Download YouTube videos |
youtube_transcript | New/existing | Get video transcripts |
video_caption | SRT Tanker | Render captions onto video |
media_convert | NCA Toolkit | Convert media formats |
media_transcribe | NCA Toolkit | Speech-to-text |
| Tool | Source | What it does |
|---|---|---|
upwork_search | Upwork API | Search Upwork jobs |
upwork_job_detail | Upwork API | Get full job description |
web_search | New | General web search |
web_fetch | New | Fetch and parse any URL |
| Tool | Source | What it does |
|---|---|---|
telegram_send | Existing | Send Telegram message |
notify | New | Route event to configured sinks |
video_analyze | New (future) | Extract frames, analyze visually via LLM |
screenshot | NCA Toolkit | Screenshot a webpage |
Not just proxy rotation. A full browser identity manager — cookies, sessions, fingerprints, TLS profiles, geo-matched config. Each identity is consistent and trackable.
Each identity bundles proxy + fingerprint + TLS profile + cookies + user agent + timezone + language. Everything geo-matched to the proxy IP for consistency.
When an identity gets flagged on a domain, it's marked as blocked there but stays usable elsewhere. Least-recently-used rotation prevents overuse.
New identities auto-detect proxy geolocation and set matching timezone, language, and locale. No mismatches that trigger anti-bot detection.
@dataclass class Identity: proxy: ProxyConfig # IP + port + auth fingerprint: BrowserFingerprint # screen, fonts, webGL, canvas tls_profile: str # "chrome_120" for curl_cffi cookies: dict[str, str] # Persistent session cookies user_agent: str timezone: str # Matches proxy geo blocked_on: list[str] # Domains where flagged class IdentityManager: def get_identity(self, domain: str) -> Identity: # Get clean identity not blocked on this domain # Least-recently-used rotation def mark_blocked(self, identity: Identity, domain: str): # Flag identity as detected on this domain
Track everything on top of the event system. Success rates, retry counts, token costs, best techniques per protection type. Enables self-optimization over time.
Duration, token cost, retries, failure types, technique used, data rows scraped, overall success. Every job produces a metrics record.
success_rate(domain), avg_cost_per_job(), best_technique_for("cloudflare"), cost_trend(30) — data-driven decisions.
Metrics feed back into the Strategy layer. The system auto-selects the cheapest technique with the highest success rate for each protection type.
Three ways in. Automated Upwork feed, interactive chat, and KB ingestion. All produce the same Directive object that enters the pipeline.
Polls Upwork, scores each job (profit / difficulty / success chance) before any scraping. Only jobs above the score threshold enter the pipeline. Don't waste cycles on bad jobs.
You type, it scrapes. Direct commands to the agent loop. Full access to all tools. Your personal scraping interface.
Feed it YouTube channels, dev reports, tutorials. Extracts scraping techniques and indexes them into the Knowledge Base.
@dataclass class Directive: source: str # "upwork" | "human" | "ingest" type: str # "scrape_job" | "direct" | "kb_ingest" description: str # What to do target_url: str | None = None metadata: dict = field(default_factory=dict)
Lightweight async event system inspired by clawhip. Agents emit typed events, the router delivers them to configured sinks. Keeps notification logic outside the agent's context window.
job.found New Upwork job detectedjob.scored Job scored (profit/difficulty/success)job.filtered Job below score thresholdscrape.started Scraping attempt begunscrape.recon_complete Site recon donescrape.strategy_applied KB rules matchedscrape.plan_ready LLM planned for gapsscrape.executing Actively scrapingscrape.completed Data collectedscrape.failed Attempt failedscrape.verified Data quality confirmedkb.entry_added New knowledge indexedkb.entry_decayed Technique confidence droppedkb.ingest_complete YouTube/report processedscrape.retry Ralph loop retry (typed failure)agent.error Agent-level errormetrics.job_complete Full job metrics recordedReal-time alerts for completed scrapes, failures, and new Upwork jobs. Compact and alert formats.
WebSocket push to the web UI. Live status of running jobs, agent iterations, and KB growth.
Append-only JSONL event log. Every event persisted for replay, debugging, and analytics.
Everything on disk. Sessions are resumable. Jobs track their full lifecycle. Events are append-only.
Clean Python package layout. One module per concern. All tools follow the same protocol.
Single TOML file. Environment variable substitution for secrets. Glob-matched event routing.
[nexus] name = "Nexus" data_dir = "./nexus-data" kb_dir = "./nexus-kb" [provider] default = "deepseek" # Testing phase # default = "ollama" # Production: self-hosted Gemma 4 [provider.deepseek] api_key = "${DEEPSEEK_API_KEY}" base_url = "https://api.deepseek.com/v1" model = "deepseek-chat" [provider.ollama] base_url = "http://gemma-server:11434" # Dedicated Gemma 4 server model = "gemma4" [provider.anthropic] # Optional fallback api_key = "${ANTHROPIC_API_KEY}" [agent] max_iterations = 50 # Ralph loop safety limit max_tokens_per_session = 100000 compaction_threshold = 80000 [upwork] enabled = true poll_interval_minutes = 30 keywords = ["web scraping", "data extraction", "crawler"] min_budget = 50 [telegram] enabled = true bot_token = "${TELEGRAM_BOT_TOKEN}" chat_id = "${TELEGRAM_CHAT_ID}" [events.routes] # Glob-matched event routing "scrape.completed" = { sink = "telegram", format = "compact" } "scrape.failed" = { sink = "telegram", format = "alert" } "job.found" = { sink = "telegram", format = "compact" } "*" = { sink = "file", format = "raw" } [permissions] mode = "auto" # "auto" | "interactive" | "bypass"
Four phases from foundation to full autonomy. Each phase ends with a working milestone.
core/provider.py — LLM abstraction (Anthropic first)tools/base.py — Tool protocol + registry + router (with context-based filtering)core/agent_loop.py — Basic loop (call LLM → execute tools → repeat)core/session.py — Message history + basic compactiontools/data/file_ops.py — Read/write files (first tools to test loop)tools/intel/web_fetch.py — Fetch URLsdirectors/chat.py — Interactive mode so you can talk to ittools/scraping/recon.py — Site reconnaissancetools/scraping/curl_scrape.py — HTTP scraping via curl_cffitools/scraping/browser_scrape.py — Browser scraping via Playwrighttools/scraping/proxy_manager.py — Proxy rotationtools/scraping/identity_manager.py — Full browser identity systemkb/store.py + kb/search.py — Knowledge Base with quality scoringtools/data/kb.py — KB as agent toolspipeline/stages/strategy.py — Strategy layer (KB rules before LLM)pipeline/engine.py — Stage runner with context-based tool filteringpipeline/stages/ — Recon → Strategy → Plan → Scrape → Verify → Reportpipeline/stages/verify.py — Hardened Ralph Loop (typed failures, strict retries)events/router.py + events/sinks/telegram.py — Notificationsmetrics/collector.py + metrics/store.py — Job metrics trackingdirectors/upwork.py — Upwork job feed with pre-filtering (score before scrape)state/job_store.py — Job trackingdirectors/ingest.py — YouTube channel ingestiontools/vision/video_analyze.py — Frame extraction + visual analysisapi/server.py — FastAPI dashboard with metrics views