Nexus

Autonomous web scraping agent runtime. Inspired by Claude Code's architecture, purpose-built for scraping at scale.

Humans set direction. Nexus does the labor.

Scroll to explore the blueprint

Architecture

Seven layers working together. Input flows down from Directors through the Pipeline Engine into the Agent Loop, which calls Tools. Knowledge Base informs every decision. Events flow out to notification sinks.

Input
Director
Upwork / Chat / Ingest
Workflow
Pipeline Engine
Chainable stages
Execution
Agent Loop
LLM + Tool cycle
Core
Tool Router
Concurrent reads, sequential writes
Intelligence
Knowledge Base
Tagged scraping wiki
LLM
Model Provider
Anthropic / OpenAI / Ollama
Output
Event Router
Telegram / Dashboard / Log
Persistence
State Manager
Sessions / Jobs / Events on disk

Core Runtime

The heartbeat of Nexus. An async loop that calls LLMs, executes tools, and iterates until the job is done.

Agent Loop

Receives directive → builds prompt with context → calls LLM → parses tool calls → executes tools → feeds results back → repeats until done or max iterations hit.

Tool System

Unified Tool protocol with registry. Read-only tools run concurrently via asyncio.gather. Mutation tools run sequentially. Permission checks gate every execution.

Model Provider

Provider-agnostic LLM abstraction. Testing: DeepSeek API (cheap, tool_use support). Production: self-hosted Gemma 4 via Ollama (zero cost at scale). One config change to swap.

Session & Compaction

Conversation state with token budget tracking. Auto-compacts when threshold is reached. Save/load for session resume across restarts.

Agent Loop

Python
class AgentLoop:
    provider: ModelProvider
    tool_registry: ToolRegistry
    permission_engine: Permissions
    session: Session

    async def run(self, directive: str) -> AgentResult:
        while not done and iteration < max_iterations:
            response = await self.provider.complete(self.session.messages)
            tool_calls = extract_tool_calls(response)
            if not tool_calls:
                break
            results = await self.tool_router.execute(tool_calls)
            self.session.append_tool_results(results)
        return AgentResult(...)

Tool Router

Python
class ToolRouter:
    async def execute(self, calls: list[ToolCall]) -> list[ToolResult]:
        read_only = [c for c in calls if self.registry[c.name].is_read_only]
        mutations = [c for c in calls if not self.registry[c.name].is_read_only]
        # Concurrent for read-only, sequential for mutations
        results = await asyncio.gather(*[self._run(c) for c in read_only])
        for c in mutations:
            results.append(await self._run(c))
        return results

Pipeline Engine

Structured workflows that chain stages together. Each stage passes artifacts to the next. Inspired by OMX's planning-execution-verification model.

Scrape Pipeline Autonomous

Recon
Strategy
Plan
Scrape
Verify
Report

Direct Scrape Pipeline Interactive

Recon
Strategy
Plan
Scrape
Verify
Deliver

Strategy Layer KB Rules First

Don't let the LLM plan everything

The Strategy stage matches recon findings against the KB and applies known rules first. The LLM only plans for gaps the KB can't cover. This saves tokens and produces more reliable results.

Context-Based Tool Exposure

Each stage only sees relevant tools

Recon sees web_fetch and browser_scrape. Strategy only sees kb_search. Scrape sees the full scraping toolkit. This prevents LLM confusion as tool count grows past 20+.

Hardened Verify Loop Ralph Loop

Typed failures + strict retry limits

Max 5 retries. Each failure is classified: empty_data (rotate approach), partial_data (retry same), blocked (escalate tools), format_error (fix only), unrecoverable (stop). No infinite loops, no token burn.

KB Ingest Pipeline Learning

Transcribe
Extract
Format
Index

Stage Reference

StageWhat it does
ReconStageVisit target site, identify protections, map endpoints, check KB for known solutions
StrategyStageMatch recon findings against KB. Apply known rules first, identify gaps for LLM
PlanStageLLM plans ONLY for gaps the strategy couldn't cover. Picks tools and anti-detection
ScrapeStageExecute the plan. Build and run scraper using selected tools
VerifyStageHardened Ralph Loop — typed failures, strict 5-retry limit, escalation paths
ReportStageGenerate dev report in standardized format, update KB with new learnings
DeliverStageReturn results to user or store for Upwork proposal
TranscribeStageDownload YouTube video, extract transcript
ExtractStageLLM extracts scraping techniques from transcript
FormatStageConvert to standardized report format (site-spec JSON + report MD)
IndexStageTag and store in Knowledge Base

Knowledge Base

A tagged wiki of web scraping solutions. Three-layer architecture: hot index → subtree indexes → leaf detail files. The agent looks up techniques like a developer searches Stack Overflow.

nexus-kb/ KB-INDEX.md # Hot index — tag cloud, recent additions techniques/ README.md # Subtree index cloudflare-bypass.md # Tagged solution turnstile-solving.md jwt-extraction.md curl-cffi-impersonation.md firebase-account-farming.md rate-limit-bypass.md site-specs/ site-spec-upwork.json # Machine-readable site profile site-spec-dewatermark.json report-upwork.md # Human-readable dev report failures/ cloudflare-turnstile-click.md # What didn't work and why tools/ curl-cffi.md # Tool reference card nodriver.md playwright.md

KB Agent Tools

kb_search

Search by tags: ["cloudflare", "bypass"] → matching solutions with code snippets

kb_get_site_spec

Lookup known site profile by domain → returns protections, endpoints, rate limits, auth methods

kb_add_entry

Store new technique or solution with tags, code snippets, and source attribution

kb_add_failure

Record what didn't work and why — prevents the agent from repeating failed approaches

Quality Control Auto-Ranking

Success Rate Tracking

Every technique tracks success_count and fail_count. The strategy layer picks the highest-rated solution first.

Confidence Decay

Unused techniques lose confidence over time (0.95^months). Anti-bot measures evolve — old solutions shouldn't rank equally with proven recent ones.

Tools

20+ tools across 5 categories. All follow the same Tool protocol. Your existing APIs become tools via thin wrappers.

Scraping Tools Core

ToolSourceWhat it does
reconNewVisit site, detect protections, map structure
curl_scrapeNewHTTP scraping via curl_cffi (Cloudflare bypass)
browser_scrapeNewBrowser-based scraping via Playwright/nodriver
proxy_managerNewRotate/manage proxy pool
identity_managerNewFull browser identity: cookies, sessions, fingerprints, TLS profile
captcha_solverWatermarkAPITurnstile/CAPTCHA solving via browser pool

Data Tools Read/Write

ToolSourceWhat it does
kb_searchNewSearch Knowledge Base
kb_addNewAdd to Knowledge Base
file_readClaw CodeRead local files
file_writeClaw CodeWrite local files
code_executeNCA ToolkitRun Python code (sandboxed)

Media Tools Existing APIs

ToolSourceWhat it does
dewatermarkDewatermark APIRemove watermarks from images
getty_downloadGettyImagesWBatch download Getty images
youtube_downloadYouTube APIDownload YouTube videos
youtube_transcriptNew/existingGet video transcripts
video_captionSRT TankerRender captions onto video
media_convertNCA ToolkitConvert media formats
media_transcribeNCA ToolkitSpeech-to-text

Intel Tools Discovery

ToolSourceWhat it does
upwork_searchUpwork APISearch Upwork jobs
upwork_job_detailUpwork APIGet full job description
web_searchNewGeneral web search
web_fetchNewFetch and parse any URL

Notification & Vision Tools Notify / Future

ToolSourceWhat it does
telegram_sendExistingSend Telegram message
notifyNewRoute event to configured sinks
video_analyzeNew (future)Extract frames, analyze visually via LLM
screenshotNCA ToolkitScreenshot a webpage

Identity System

Not just proxy rotation. A full browser identity manager — cookies, sessions, fingerprints, TLS profiles, geo-matched config. Each identity is consistent and trackable.

🕵

Full Identity Profiles

Each identity bundles proxy + fingerprint + TLS profile + cookies + user agent + timezone + language. Everything geo-matched to the proxy IP for consistency.

🚫

Block Tracking

When an identity gets flagged on a domain, it's marked as blocked there but stays usable elsewhere. Least-recently-used rotation prevents overuse.

🌐

Geo-Aware Generation

New identities auto-detect proxy geolocation and set matching timezone, language, and locale. No mismatches that trigger anti-bot detection.

Python
@dataclass
class Identity:
    proxy: ProxyConfig           # IP + port + auth
    fingerprint: BrowserFingerprint # screen, fonts, webGL, canvas
    tls_profile: str              # "chrome_120" for curl_cffi
    cookies: dict[str, str]      # Persistent session cookies
    user_agent: str
    timezone: str                 # Matches proxy geo
    blocked_on: list[str]         # Domains where flagged

class IdentityManager:
    def get_identity(self, domain: str) -> Identity:
        # Get clean identity not blocked on this domain
        # Least-recently-used rotation

    def mark_blocked(self, identity: Identity, domain: str):
        # Flag identity as detected on this domain

Metrics

Track everything on top of the event system. Success rates, retry counts, token costs, best techniques per protection type. Enables self-optimization over time.

Per-Job Tracking

Duration, token cost, retries, failure types, technique used, data rows scraped, overall success. Every job produces a metrics record.

Aggregate Queries

success_rate(domain), avg_cost_per_job(), best_technique_for("cloudflare"), cost_trend(30) — data-driven decisions.

Self-Optimization

Metrics feed back into the Strategy layer. The system auto-selects the cheapest technique with the highest success rate for each protection type.

Input Directors

Three ways in. Automated Upwork feed, interactive chat, and KB ingestion. All produce the same Directive object that enters the pipeline.

Upwork Director Pre-Filtered

Polls Upwork, scores each job (profit / difficulty / success chance) before any scraping. Only jobs above the score threshold enter the pipeline. Don't waste cycles on bad jobs.

💬

Chat Director

You type, it scrapes. Direct commands to the agent loop. Full access to all tools. Your personal scraping interface.

📚

Ingest Director

Feed it YouTube channels, dev reports, tutorials. Extracts scraping techniques and indexes them into the Knowledge Base.

Directive Format

Python
@dataclass
class Directive:
    source: str       # "upwork" | "human" | "ingest"
    type: str         # "scrape_job" | "direct" | "kb_ingest"
    description: str  # What to do
    target_url: str | None = None
    metadata: dict = field(default_factory=dict)

Event Router

Lightweight async event system inspired by clawhip. Agents emit typed events, the router delivers them to configured sinks. Keeps notification logic outside the agent's context window.

Event Types

job.found New Upwork job detected
job.scored Job scored (profit/difficulty/success)
job.filtered Job below score threshold
scrape.started Scraping attempt begun
scrape.recon_complete Site recon done
scrape.strategy_applied KB rules matched
scrape.plan_ready LLM planned for gaps
scrape.executing Actively scraping
scrape.completed Data collected
scrape.failed Attempt failed
scrape.verified Data quality confirmed
kb.entry_added New knowledge indexed
kb.entry_decayed Technique confidence dropped
kb.ingest_complete YouTube/report processed
scrape.retry Ralph loop retry (typed failure)
agent.error Agent-level error
metrics.job_complete Full job metrics recorded

Sinks

Telegram

Real-time alerts for completed scrapes, failures, and new Upwork jobs. Compact and alert formats.

Dashboard

WebSocket push to the web UI. Live status of running jobs, agent iterations, and KB growth.

File Log

Append-only JSONL event log. Every event persisted for replay, debugging, and analytics.

State & Persistence

Everything on disk. Sessions are resumable. Jobs track their full lifecycle. Events are append-only.

nexus-data/ config.toml # Global config sessions/ {session_id}.json # Conversation state (resumable) jobs/ {job_id}/ directive.json # Original job recon.json # Recon results plan.json # Execution plan result.json # Scraped data report.md # Dev report (if generated) events/ events.jsonl # Event log (append-only) kb/ # Knowledge Base (see Layer 3)

Project Structure

Clean Python package layout. One module per concern. All tools follow the same protocol.

nexus/ nexus/ core/ agent_loop.py # Main agent loop provider.py # LLM provider abstraction session.py # Session state + compaction permissions.py # Permission engine tool_router.py # Tool execution + concurrency pipeline/ engine.py # Pipeline stage runner stages/ recon.py plan.py scrape.py verify.py report.py deliver.py ingest.py pipelines.py # Pre-built pipeline definitions tools/ base.py # Tool protocol + registry scraping/ recon.py curl_scrape.py browser_scrape.py proxy_manager.py captcha_solver.py data/ kb.py file_ops.py code_execute.py media/ dewatermark.py getty_download.py youtube.py srt_tanker.py media_convert.py intel/ upwork.py web_search.py web_fetch.py notify/ telegram.py notify.py directors/ upwork.py chat.py ingest.py kb/ store.py search.py events/ router.py types.py sinks/ telegram.py dashboard.py file.py state/ session_store.py job_store.py event_log.py api/ server.py # FastAPI HTTP interface routes/ chat.py jobs.py kb.py config.py nexus-kb/ # Knowledge Base files nexus-data/ # Runtime state config.toml Dockerfile requirements.txt

Config

Single TOML file. Environment variable substitution for secrets. Glob-matched event routing.

TOML
[nexus]
name = "Nexus"
data_dir = "./nexus-data"
kb_dir = "./nexus-kb"

[provider]
default = "deepseek"                 # Testing phase
# default = "ollama"                   # Production: self-hosted Gemma 4

[provider.deepseek]
api_key = "${DEEPSEEK_API_KEY}"
base_url = "https://api.deepseek.com/v1"
model = "deepseek-chat"

[provider.ollama]
base_url = "http://gemma-server:11434"  # Dedicated Gemma 4 server
model = "gemma4"

[provider.anthropic]               # Optional fallback
api_key = "${ANTHROPIC_API_KEY}"

[agent]
max_iterations = 50               # Ralph loop safety limit
max_tokens_per_session = 100000
compaction_threshold = 80000

[upwork]
enabled = true
poll_interval_minutes = 30
keywords = ["web scraping", "data extraction", "crawler"]
min_budget = 50

[telegram]
enabled = true
bot_token = "${TELEGRAM_BOT_TOKEN}"
chat_id = "${TELEGRAM_CHAT_ID}"

[events.routes]
# Glob-matched event routing
"scrape.completed" = { sink = "telegram", format = "compact" }
"scrape.failed" = { sink = "telegram", format = "alert" }
"job.found" = { sink = "telegram", format = "compact" }
"*" = { sink = "file", format = "raw" }

[permissions]
mode = "auto"                      # "auto" | "interactive" | "bypass"

Build Order

Four phases from foundation to full autonomy. Each phase ends with a working milestone.

Phase 0

Foundation

  1. core/provider.py — LLM abstraction (Anthropic first)
  2. tools/base.py — Tool protocol + registry + router (with context-based filtering)
  3. core/agent_loop.py — Basic loop (call LLM → execute tools → repeat)
  4. core/session.py — Message history + basic compaction
  5. tools/data/file_ops.py — Read/write files (first tools to test loop)
  6. tools/intel/web_fetch.py — Fetch URLs
  7. directors/chat.py — Interactive mode so you can talk to it
You can chat with Nexus and it can read files + fetch URLs
Phase 1

Scraping Core + KB

  1. tools/scraping/recon.py — Site reconnaissance
  2. tools/scraping/curl_scrape.py — HTTP scraping via curl_cffi
  3. tools/scraping/browser_scrape.py — Browser scraping via Playwright
  4. tools/scraping/proxy_manager.py — Proxy rotation
  5. tools/scraping/identity_manager.py — Full browser identity system
  6. kb/store.py + kb/search.py — Knowledge Base with quality scoring
  7. tools/data/kb.py — KB as agent tools
  8. Ingest existing dev reports into KB (Upwork + Dewatermark)
  9. pipeline/stages/strategy.py — Strategy layer (KB rules before LLM)
Nexus can recon a site, apply known techniques, and scrape
Phase 2

Pipelines, Multi-Agent & Automation

  1. pipeline/engine.py — Stage runner with context-based tool filtering
  2. pipeline/stages/ — Recon → Strategy → Plan → Scrape → Verify → Report
  3. pipeline/stages/verify.py — Hardened Ralph Loop (typed failures, strict retries)
  4. Multi-agent support — parallel scraping workers (scraping is parallel by nature)
  5. events/router.py + events/sinks/telegram.py — Notifications
  6. metrics/collector.py + metrics/store.py — Job metrics tracking
  7. directors/upwork.py — Upwork job feed with pre-filtering (score before scrape)
  8. state/job_store.py — Job tracking
Nexus autonomously scrapes Upwork jobs in parallel, scores them, and notifies you
Phase 3

Media & API Tools

  1. Wrap existing APIs as tools (dewatermark, getty, youtube, srt tanker, NCA toolkit)
  2. directors/ingest.py — YouTube channel ingestion
  3. tools/vision/video_analyze.py — Frame extraction + visual analysis
Full tool suite available, KB growing from YouTube content
Phase 4

Polish & Self-Optimization

  1. api/server.py — FastAPI dashboard with metrics views
  2. Session resume/persistence
  3. KB auto-decay (confidence drops on unused/failing techniques)
  4. Metrics-driven self-optimization (auto-select best technique per protection)
  5. Permission refinement
Production-ready self-optimizing autonomous scraping agent