Bumblebee

Elixir library for running pre-trained HuggingFace neural network models — BERT, GPT-2, Whisper, CLIP, Stable Diffusion, and more — directly in Elixir without Python. Built on Nx (tensor operations) and Axon (neural network layers). Bumblebee downloads model weights from HuggingFace Hub and runs inference via EXLA (XLA GPU backend) or BinaryBackend (CPU). Enables LLM inference, text classification, NER, speech recognition, image classification, and text embedding in pure Elixir. Integrated with Livebook Smart Cells for notebook exploration.

Evaluated Mar 07, 2026 (0d ago) v0.5.x

Homepage ↗ Repo ↗ AI & Machine Learning elixir ml huggingface bert gpt whisper clip inference nx axon beam

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

Rate Limits

🔒 Security

TLS Enforcement

Auth Strength

Scope Granularity

Dep. Hygiene

Secret Handling

Local inference — no external API calls for model execution. HuggingFace download uses HTTPS. HUGGING_FACE_HUB_TOKEN via environment variable (not hardcoded). Model weights stored locally — verify model provenance from HuggingFace Hub.

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

You're building an Elixir application and need on-premise ML inference (NLP, speech, image classification) without Python microservices — Bumblebee keeps ML inference in the BEAM process.

Avoid When

You need model training, require models not yet supported by Bumblebee, need GPU-heavy inference in a non-Elixir stack, or are building a Python data science pipeline (use HuggingFace Transformers directly).

Use Cases

• Run local Whisper speech-to-text inference in Elixir agent backends — transcribe audio from agent interactions without Python or external API calls
• Generate text embeddings in Elixir for agent semantic search using Bumblebee's BERT/sentence-transformer models — embed user queries and match against agent knowledge base
• Classify agent input text (sentiment, intent, toxicity) using fine-tuned BERT models via Bumblebee — run model inference in the same Elixir process as agent logic
• Use Bumblebee with Nx.Serving for batched inference — multiple agent requests share the same model instance with automatic batching for GPU efficiency
• Explore HuggingFace models in Livebook using Bumblebee Smart Cells — test different models for agent tasks before integrating into production Elixir code

Not For

• Teams expecting PyTorch/HuggingFace Transformers parity — Bumblebee supports a subset of popular models; exotic or newly released architectures may not be available; check hexdocs for supported models list
• GPU-heavy model training — Bumblebee is inference-focused; model fine-tuning is not supported; use Python/PyTorch for training then export and load weights in Bumblebee
• Non-Elixir stacks — Python + HuggingFace Transformers is the standard for ML; Bumblebee is for Elixir teams who want to stay in the BEAM ecosystem

Interface

REST API

GraphQL

gRPC

MCP Server

SDK

Yes

Webhooks

Authentication

Methods: api_key

OAuth: No Scopes: No

HuggingFace API token required for downloading gated models (Llama, etc.) via Bumblebee.load_model/2 with HF Hub. Public models download without auth. Token set via HUGGING_FACE_HUB_TOKEN environment variable.

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

Bumblebee is Apache 2.0 licensed, maintained by Dashbit (José Valim). Free for all use. Model weights downloaded from HuggingFace Hub — public models free.

Agent Metadata

Pagination

none

Idempotent

Full

Retry Guidance

Not documented

Known Gotchas

⚠ Model downloads on first use — Bumblebee.load_model/2 downloads weights from HuggingFace Hub on first call; large models (Whisper medium = 1.5GB) can take minutes; cache models in Docker images for production
⚠ EXLA backend requires XLA compilation on first run — first tensor operation with EXLA backend triggers JIT compilation taking 30-120 seconds; pre-warm in app startup via a dummy inference call
⚠ Nx.Serving is required for concurrent inference — direct Bumblebee.predict/2 calls block the caller process; use Nx.Serving for async, batched, and concurrent inference in production
⚠ Not all HuggingFace models are supported — Bumblebee supports specific architectures (BERT, GPT-2, Whisper, CLIP, Stable Diffusion, Llama, Mistral); unsupported architectures require manual Axon model definition
⚠ Tokenizer max sequence length — most models have 512 or 1024 token limits; inputs longer than max_length are silently truncated; set truncate: :longest option and validate input lengths
⚠ Memory management for model weights — loading multiple large models simultaneously consumes significant RAM; a Llama-7B model requires 14GB+ RAM in float32; use :bfloat16 or quantized models for memory efficiency

Alternatives

huggingface-inference-api openai-api ollama-api

Full Evaluation Report

Comprehensive deep-dive: security analysis, reliability audit, agent experience review, cost modeling, competitive positioning, and improvement roadmap for Bumblebee.

AI-powered analysis · PDF + markdown · Delivered within 30 minutes

$99

Package Brief

Quick verdict, integration guide, cost projections, gotchas with workarounds, and alternatives comparison.

Delivered within 10 minutes

Score Monitoring

Get alerted when this package's AF, security, or reliability scores change significantly. Stay ahead of regressions.

Continuous monitoring

$3/mo

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-07.