![]() |
踏实的大海 · 省数字政府新媒体平台运营(2023年)项目结 ...· 1 月前 · |
![]() |
腼腆的炒粉 · Model doesn't learn ...· 1 月前 · |
![]() |
健身的罐头 · 【媒体解读】广东全面深化市场主体退出制度改革 ...· 1 年前 · |
![]() |
深沉的黄瓜 · 家族式前脸设计 疑似比亚迪驱逐舰 07 ...· 2 年前 · |
![]() |
考研的西红柿 · 刘艳钊:越野核心是靠谱 ...· 2 年前 · |
Which LLM to Use
Find the ‘right’ model for your use case.
Models
See all models →
AI pioneers train, fine-tune, and run frontier models on our GPU cloud platform.
Build with open-source and specialized multimodal models for chat, images, code, and more. Migrate from closed models with OpenAI-compatible APIs.
Chat
DeepSeek's latest open Mixture-of-Experts model challenging top AI models at much lower cost.
TRY THIS MODEL
Chat
Hybrid instruct + reasoning model (232Bx22B MoE) optimized for high-throughput, cost-efficient inference and distillation.
TRY THIS MODEL
Chat
State-of-the-art mixture-of-experts agentic intelligence model with 1 T parameters, 128K context, and native tool use
TRY THIS MODEL
Chat
SOTA 128-expert MoE powerhouse for multilingual image/text understanding, creative writing, and enterprise-scale applications.
TRY THIS MODEL
Transcribe
High-performance speech-to-text model delivering transcription 15x faster than OpenAI with support for 1GB+ files, 50+ languages, and production-ready infrastructure.
TRY THIS MODEL
Chat
Lightweight model with vision-language input, multilingual support, visual reasoning, and top-tier performance per size.
TRY THIS MODEL
Image
In-context image generation and editing model using both text and image inputs.
TRY THIS MODEL
Chat
Selective parameter activation delivers 2B/4B multimodal performance on low-resource devices, handling text, image, video, audio
TRY THIS MODEL
Chat
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.
TRY THIS MODEL
Chat
Throughput DeepSeek-R1 is a state of the art reasoning model trained with reinforcement learning. It delivers strong performance on math, code, and logic tasks – comparable to OpenAI-o1. It is especially good at tasks like code review, document analysis, planning, information extraction, and coding.
TRY THIS MODEL
Image
In-context image generation and editing model with enhanced prompt adherence.
TRY THIS MODEL
Chat
456B-parameter hybrid MoE reasoning model with 40K thinking budget, lightning attention, and 1M token context for efficient reasoning and problem-solving tasks.
TRY THIS MODEL
Chat
Powerful decoder-only models available in 7B and 72B variants, developed by Alibaba Cloud's Qwen team for advanced language processing.
TRY THIS MODEL
Chat
Upgraded DeepSeek-R1 with better reasoning, function calling, and coding, using 23K-token thinking to score 87.5% on AIME.
TRY THIS MODEL
Image
Free endpoint for the SOTA open-source image generation model by Black Forest Labs.
TRY THIS MODEL
Audio
Low-latency, ultra-realistic voice model, served in partnership with Cartesia.
TRY THIS MODEL
Chat
DeepSeek-R1 model post-trained by Perplexity AI to remove censorship and bias while preserving reasoning strength.
TRY THIS MODEL
Chat
NVIDIA NIM for GPU accelerated Llama 3.1 Nemotron 70B Instruct inference through OpenAI compatible APIs.
TRY THIS MODEL
Code
SOTA code LLM with advanced code generation, reasoning, fixing, and support for up to 128K tokens.
TRY THIS MODEL
Image
Premium image generation model by Black Forest Labs.
TRY THIS MODEL
Vision
Vision-language model with advanced visual reasoning, video understanding, structured outputs, and agentic capabilities.
TRY THIS MODEL
Chat
A 32B SLM optimized for managing complex tool-based interactions and API function calls. Its strength lies in precise execution, intelligent orchestration, and effective communication between systems – making it ideal for automation pipelines.
TRY THIS MODEL
Chat
SOTA 109B model with 17B active params & large context, excelling at multi-document analysis, codebase reasoning, and personalized tasks.
TRY THIS MODEL
Chat
A versatile and powerful 32B SLM, capable of handling varied tasks with precision and adaptability across multiple domains. Ideal for dynamic use cases that require significant computational power.
TRY THIS MODEL
Leverage pre-trained models, fine-tune them for your needs, or build custom models from scratch. Whatever your generative AI needs, Together AI offers a seamless continuum of AI compute solutions to support your entire journey.
Train
Models
Powered by the Together Inference Engine, combining research-driven innovation with deployment flexibility.
Transformer-optimized kernels: our researchers' custom FP8 inference kernels , 75%+ faster than base PyTorch
Quality-preserving quantization: accelerating inference while maintaining accuracy with advances such as QTIP
Speculative decoding: faster throughput, powered by novel algorithms and draft models trained on RedPajama dataset
Turbo: Best performance without losing accuracy
Reference: Full precision, available for 100% accuracy
Lite: Optimized for fast performance at the lowest cost
Dedicated instances : fast, consistent performance, without rate limits, on your own single-tenant NVIDIA GPUs
Serverless API : quickly switch from closed LLMs to models like Llama, using our OpenAI compatible APIs
Fine-tune open-source models like Llama on your data and run them on Together Cloud or in a hyperscaler VPC. With no vendor lock-in, your AI remains fully under your control.
together files upload acme_corp_customer_support.jsonl
"filename" : "acme_corp_customer_support.json",
"id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a",
"object": "file"
together finetune create --training-file file-aab9997-bca8-4b7e-a720-e820e682a10a
--model together compute/RedPajama-INCITE-7B-Chat
together finetune create --training-file $FILE_ID
--model $MODEL_NAME
--wandb-api-key $WANDB_API_KEY
--n-epochs 10
--n-checkpoints 5
--batch-size 8
--learning-rate 0.0003
"training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a",
"model_output_name": "username/togethercomputer/llama-2-13b-chat",
"model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat",
"Suffix": "Llama-2-13b 1",
"model": "togethercomputer/llama-2-13b-chat",
"n_epochs": 4,
"batch_size": 128,
"learning_rate": 1e-06,
"checkpoint_steps": 2,
"created_at": 1687982945,
"updated_at": 1687982945,
"status": "pending",
"id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f",
"epochs_completed": 3,
"events": [
"object": "fine-tune-event",
"created_at": 1687982945,
"message": "Fine tune request created",
"type": "JOB_PENDING",
"queue_depth": 0,
"wandb_project_name": "Llama-2-13b Fine-tuned 1"
Forge
the
AI frontier.
Train on
expert-built
GPU clusters.
Built by AI researchers for AI innovators, Together GPU Clusters are powered by NVIDIA GB200, H200, and H100 GPUs, along with the Together Kernel Collection — delivering up to 24% faster training operations.
Robust Management Tools
Slurm and Kubernetes orchestrate dynamic AI workloads, optimizing training and inference seamlessly.
Training-ready clusters – Blackwell and Hopper
THE AI
ACCELERATION
CLOUD
BUILT ON LEADING AI RESEARCH.
Innovations
Our research team is behind breakthrough AI models, datasets, and optimizations.
Customer Stories
See how we support leading teams around the world. Our customers are creating innovative generative AI applications, faster.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in
Mosaic ML blog
. Detailed results and methodology published
here
.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in
Mosaic ML blog
. Detailed results and methodology published
here
.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference. Detailed results and methodology published
here
.
Based on published pricing November 8th, 2023, comparing
Open AI GPT-3.5-Turbo
to
Llama-2-13B on Together Inference
using Serverless Endpoints. Assumes equal number of input and output tokens.
Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster.
Source
.
Testing methodology and results published in
this research paper.
Based on published pricing November 8th, 2023, comparing
AWS Capacity Blocks
and
AWS p5.48xlarge
instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration.