Frequently Asked Questions
Common questions about S88 Runtime and Hub.
Sector88 is an AI inference platform for constrained hardware. We provide the operating layer between AI models and the systems they run on, handling the hard problems that arise when deploying large models on devices that weren't designed for them.
The platform has two components. S88 Runtime is the inference engine - it manages memory orchestration across GPU VRAM, system RAM, and storage so models run reliably without crashing, even when they exceed the hardware's native capacity. It supports multiple inference backends (vLLM, llama.cpp), multiple hardware vendors (NVIDIA, AMD, Intel, CPU-only), and runs fully offline with zero external dependencies.
S88 Hub is the management and control plane. Real-time monitoring, performance analytics, fleet management across multiple nodes, and production telemetry with Prometheus integration. Together they form a complete system for deploying, managing, and scaling AI inference in air-gapped, edge, and sovereign environments.
S88 Runtime is the inference engine. It sits between the AI model and the hardware, managing memory orchestration across VRAM, RAM, and SSD so models that would normally crash or require larger hardware run reliably on constrained systems. It orchestrates across multiple inference backends (vLLM, llama.cpp) and supports NVIDIA, AMD, Intel, and CPU-only hardware. Ships as a Docker container with an OpenAI-compatible API endpoint. Zero external dependencies - fully offline operation.
S88 Hub is the operational control plane. Real-time monitoring of VRAM, RAM, power, and thermal metrics. Comprehensive performance analysis with detailed reports and raw data exports. Fleet management for multi-node deployments. Web dashboard for visualization and control. Built-in Prometheus integration.
Sector88 works with these engines, not against them. While vLLM, Triton, Ollama, and llama.cpp optimize for throughput, Sector88 focuses on the operational layer: safe defaults, telemetry, and reliability controls for constrained/regulated environments. Roadmap items include deeper memory orchestration and policy-driven optimization.
Sector88 offers tiered access based on your deployment needs, from single-node setups to enterprise fleet management with air-gapped support.
Contact us for current pricing and deployment options tailored to your environment.
A node is any machine running S88 Runtime - server, workstation, edge device, or VM. One license key per node.
Deploying to 10+ nodes? Contact us for volume licensing.
We partner with integrators, OEMs, hardware vendors, and enterprise resellers. Pilot programs include technical evaluation on your infrastructure, baseline performance analysis, deployment support, and access to S88 for testing. Contact partnerships@sector88.co to discuss opportunities.
OOM (Out Of Memory) crashes happen when GPU VRAM is exhausted during inference, killing the process mid-request. Service goes down, requests fail, deployments become unreliable.
S88 prevents this through intelligent memory orchestration and graceful degradation - the system stays operational under resource pressure instead of crashing.
S88 is hardware-agnostic. NVIDIA CUDA, AMD ROCm, Intel Gaudi/Xeon, Apple Silicon, and CPU-only servers. If it can run an inference engine, S88 runs on it.
Yes. S88 is production-ready and available now. Request access and we will get you set up.
Yes. S88 never terminates on OOM. Back-pressure queuing, context clipping, and intelligent eviction keep the service running under memory pressure.
The system stays responsive — zero downtime from memory constraints.
Yes. S88 is purpose-built for air-gapped and sovereign deployments. Zero external dependencies at runtime.
Models load from local filesystem. Telemetry stays local via Prometheus. All data remains on-premises. Designed for SCIF, classified, and offline environments.
Any LLM supported by vLLM or llama.cpp. This includes Llama, Mistral, Qwen, DeepSeek, and thousands of HuggingFace models. Both HuggingFace format and GGUF quantized models are supported.
Community: Full S88 Runtime access with CPU inference and models up to 7B.
Pro: GPU acceleration, unlimited model sizes, Hub dashboard, and automated benchmark reports.
Enterprise: Dedicated deployment support, infrastructure assessment, and custom SLAs.
Under a minute. Pull the Docker image, run it, and you have an inference API. Full production deployment with custom configuration typically takes hours, not weeks.
S88 is built for regulated environments. Zero prompt logging by default. Audit-ready telemetry without content exposure. Configurable retention policies. Complete offline operation.
Designed for HIPAA, PCI-DSS, FedRAMP, and defense compliance requirements.
S88 scales from single GPUs to distributed clusters. Multi-GPU tensor parallelism and multi-node deployments are supported. Fleet management capabilities available through S88 Hub.
Built-in Prometheus metrics, structured event logs, and performance tracking - tokens/sec, latency, TTFT, VRAM, RAM, GPU temperature, power, and utilization.
S88 Hub provides real-time dashboards, SLO tracking, and professional scorecard generation with PDF and CSV export.
Yes. Request access and we will work with you to evaluate S88 on your hardware and use case. No commitment required.
Still have questions?
Contact Us