Back to Blog

Control your MFA Gated HPC from Discord with AI, for freee!

rahit
14 min read

How to remotely control your university HPC cluster from anywhere using Discord, without spending a dime.


The Problem

If you’re a researcher using High Performance Computing (HPC) clusters, you’ve probably experienced this frustration:

  • You’re away from your desk and want to check if your 48-hour training job finished
  • You need to submit a quick job but don’t have your laptop
  • Your cluster requires multi-factor authentication (Duo/2FA) every single time you log in

What if you could just send a message from your phone—anywhere, anytime—and have an AI agent handle it for you?

You: "!fir check if my training job is done"
Bot: "Your job finished 2 hours ago! The accuracy reached 94.2%."

That’s what we’ll build in this guide. Zero monthly cost. Zero vendor lock-in. Zero compromises.


What We’re Building

An autonomous AI assistant that:

  • Responds to natural language commands via Discord
  • Executes tasks on your HPC cluster
  • Runs 24/7 without your intervention
  • Handles authentication (mostly) automatically
  • Monitors cluster access and alerts you when issues arise
  • Costs $0/month to operate

Tech Stack:

  • Discord: Free messaging interface
  • Oracle Cloud VPS: Free forever (4 CPU, 24GB RAM)
  • Google Gemini API: Free AI model available through API
  • Alliance Canada Fir Cluster: Or any SSH-accessible HPC cluster with MFA
  • Python: Discord bot + SSH automation

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│  You (Discord on phone/laptop)                              │
│  "!fir check my jobs"                                       │
└─────────────────────┬───────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────────────┐
│  Discord Bot (on Oracle VPS)                                │
│  - Receives message                                         │
│  - Sends to Gemini AI                                       │
└─────────────────────┬───────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────────────┐
│  Gemini AI (Google, free)                                   │
│  - Interprets: "check my jobs" → "squeue -u username"       │
│  - Returns bash command                                     │
└─────────────────────┬───────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────────────┐
│  Reverse SSH Tunnel                                         │
│  VPS ←─────── HPC Cluster                                   │
│  Port 2222    (initiated from inside cluster)              │
└─────────────────────┬───────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────────────┐
│  HPC Cluster (Fir)                                          │
│  - Bot SSHs in via tunnel                                   │
│  - Executes: squeue -u username                             │
│  - Returns output                                           │
└─────────────────────┬───────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────────────┐
│  Results back to Discord                                    │
│  "You have 2 jobs running: job_12345 (87% done)..."        │
└─────────────────────────────────────────────────────────────┘

Key Strategy: The Reverse Tunnel

Traditional approach: VPS connects to cluster (requires cluster to allow incoming SSH)
Our approach: Cluster connects to VPS (works even with strict firewall rules)

This is why the tunnel is “reverse”—initiated from inside the cluster, bypassing the restriction.


Part 1: Setting Up the Always-On VPS (Oracle Cloud)

Why Oracle Cloud?

Most cloud providers’ free tiers expire after 12 months. Oracle Cloud’s “Always Free” tier is actually permanent:

  • 4 ARM CPUs (Ampere A1)
  • 24GB RAM
  • 200GB storage
  • Forever free (as long as you use it monthly)

This is more than enough to run our Discord bot 24/7.

Step 1.1: Create Oracle Cloud Account

  1. Go to https://www.oracle.com/cloud/free/
  2. Sign up (requires credit card but won’t charge)
  3. I used Toronto region (any region with Ampere availability will work)

Step 1.2: Launch the Instance

Configuration:

  • Shape: VM.Standard.A1.Flex
  • OCPU: 4
  • Memory: 24GB
  • Image: Ubuntu 22.04
  • CRITICAL: Paste your PC’s SSH public key (~/.ssh/id_ed25519.pub or id_rsa.pub)

If provisioning fails: Oracle’s free tier sometimes has capacity issues. Just retry every 15-30 minutes in the same availability domain. Usually provisions within a few hours. (I used a automated script. Took about 20hours to get the VPS)

Step 1.3: Note Your VPS IP

Once running, note the public IP address (e.g., 40.***.***.100). We’ll call this VPS “otterland” throughout.

Step 1.4: SSH Alias (Local Machine)

Add to your ~/.ssh/config:

Host otterland
    HostName 40.***.***.100
    User ubuntu
    IdentityFile ~/.ssh/id_ed25519

Test: ssh otterland should connect.


Part 2: Setting Up Passwordless SSH to Your Cluster

The bot needs to SSH into your cluster without human intervention. But how do we handle MFA?

The Solution: Two-Layer Authentication

  1. SSH Key: Passwordless authentication (one-time setup)
  2. ControlMaster: Reuses one authenticated session for 30 days
  3. You: Approve Duo MFA once per month

Step 2.1: Generate a No-Passphrase SSH Key

On your local machine:

ssh-keygen -t ed25519 -C "cluster-automation" -f ~/.ssh/id_cluster_automation
# Press Enter when asked for passphrase (leave it empty)

Step 2.2: Copy Key to Cluster

ssh-copy-id -i ~/.ssh/id_cluster_automation.pub username@yourcluster.edu

Enter your password and approve Duo one last time.

Step 2.3: Configure ControlMaster

Edit ~/.ssh/config on your local machine:

Host mycluster
    HostName yourcluster.edu
    User your_username
    IdentityFile ~/.ssh/id_cluster_automation
    ControlMaster auto
    ControlPath ~/.ssh/cm_sockets/%r@%h:%p
    ControlPersist 720h  # 30 days!
    ServerAliveInterval 60
    ServerAliveCountMax 10

Create the socket directory:

mkdir -p ~/.ssh/cm_sockets

Step 2.4: Test It

# First connection - will ask for Duo
ssh mycluster "hostname"

# Second connection - silent!
ssh mycluster "hostname"

If the second connection doesn’t ask for Duo, ControlMaster is working. This is the magic that makes automation possible.


Part 3: The Reverse Tunnel (Bypassing Institutional Firewalls)

Here’s the clever part: your cluster can’t accept incoming SSH connections (firewall), but it CAN initiate outgoing connections.

Step 3.1: Copy Your Key to the VPS

From your local machine:

scp ~/.ssh/id_cluster_automation* otterland:~/.ssh/

Step 3.2: Set Up SSH Config on VPS

SSH into otterland and create ~/.ssh/config:

Host mycluster
    HostName localhost
    Port 2222
    User your_username
    IdentityFile ~/.ssh/id_cluster_automation
    ControlMaster auto
    ControlPath ~/.ssh/cm_sockets/%r@%h:%p
    ControlPersist 720h
    ServerAliveInterval 60
    ServerAliveCountMax 10
mkdir -p ~/.ssh/cm_sockets

Step 3.3: Start the Reverse Tunnel FROM the Cluster

SSH into your cluster:

ssh mycluster  # From your local machine, approve Duo

Once on the cluster, start the tunnel in tmux:

# Create persistent session
tmux new -s tunnel

# Start reverse tunnel
ssh -N -R 2222:localhost:22 \
    -o ServerAliveInterval=60 \
    -o ServerAliveCountMax=3 \
    -o ExitOnForwardFailure=yes \
    ubuntu@40.233.101.100

# Detach from tmux: Ctrl+b then d

What this does:

  • Opens port 2222 on otterland
  • Forwards it to port 22 on your cluster
  • Keeps the connection alive with heartbeats
  • Runs in background via tmux

Step 3.4: Verify the Tunnel

From otterland:

ssh otterland
ssh -p 2222 localhost "hostname"

Should return your cluster’s hostname without asking for Duo.

🎉 Congratulations! Your VPS can now reach your cluster anytime, silently.


Part 4: Creating the Discord Bot

Step 4.1: Create Discord Bot

  1. Go to https://discord.com/developers/applications
  2. Click New Application → name it (e.g., “HPC Agent”)
  3. Go to Bot tab → Add Bot
  4. Under Privileged Gateway Intents → enable Message Content Intent
  5. Copy the bot token (keep secret!)

Step 4.2: Invite Bot to Your Server

  1. Go to OAuth2URL Generator
  2. Check bot scope
  3. Check permissions:
    • View Channels
    • Send Messages
    • Read Message History
  4. Copy the generated URL and open in browser
  5. Add bot to your server

Step 4.3: Get Gemini API Key

  1. Go to https://aistudio.google.com/apikey
  2. Click Create API key
  3. Copy the key

Part 5: Installing the Bot on VPS

Step 5.1: Install Dependencies

SSH into otterland:

ssh otterland

# Install uv (modern Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.cargo/env

Step 5.2: Create Project Structure

mkdir -p ~/projects/discord_agent
cd ~/projects/discord_agent

# Initialize with uv
uv init
uv add discord.py google-generativeai python-dotenv

Step 5.3: Create Environment File

nano .env
DISCORD_TOKEN=your_discord_bot_token_here
GEMINI_API_KEY=your_gemini_api_key_here
ALERT_CHANNEL_ID=your_discord_channel_id  # Right-click channel → Copy ID

Step 5.4: Create the Bot Script

nano bot.py
import discord
import google.generativeai as genai
import subprocess
import os
from dotenv import load_dotenv

load_dotenv()

DISCORD_TOKEN = os.getenv("DISCORD_TOKEN")
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

genai.configure(api_key=GEMINI_API_KEY)

intents = discord.Intents.default()
intents.message_content = True
client = discord.Client(intents=intents)

def run_on_cluster(command):
    """Execute command on cluster via SSH"""
    try:
        result = subprocess.run(
            ["ssh", "mycluster", command],
            capture_output=True,
            text=True,
            timeout=60
        )
        return result.stdout + result.stderr
    except Exception as e:
        return f"Error: {str(e)}"

# Define the tool for Gemini
tools = [
    genai.protos.Tool(
        function_declarations=[
            genai.protos.FunctionDeclaration(
                name="run_cluster_command",
                description="Execute a bash command on the HPC cluster",
                parameters=genai.protos.Schema(
                    type=genai.protos.Type.OBJECT,
                    properties={
                        "command": genai.protos.Schema(
                            type=genai.protos.Type.STRING,
                            description="The bash command to execute"
                        )
                    },
                    required=["command"]
                )
            )
        ]
    )
]

model = genai.GenerativeModel(
    model_name="gemini-1.5-flash",
    tools=tools
)

@client.event
async def on_ready():
    print(f'{client.user} is connected to Discord!')

@client.event
async def on_message(message):
    if message.author == client.user:
        return
    
    if message.content.startswith('!hpc'):
        prompt = message.content[5:].strip()
        
        await message.channel.send(f"🔄 Processing: {prompt[:100]}...")
        
        try:
            chat = model.start_chat()
            
            system_msg = """You are an HPC cluster assistant with SSH access.

When asked to check jobs, submit tasks, or interact with the cluster, use the run_cluster_command tool.

Common tasks:
- Check jobs: squeue -u username
- Submit job: sbatch script.sh
- List files: ls -lh ~/path
- Check output: cat slurm-*.out
- Cancel job: scancel <job_id>"""
            
            full_prompt = f"{system_msg}\n\nUser: {prompt}"
            response = chat.send_message(full_prompt)
            
            # Handle function calls
            while response.candidates[0].content.parts:
                part = response.candidates[0].content.parts[0]
                
                if hasattr(part, 'function_call') and part.function_call.name:
                    function_call = part.function_call
                    command = function_call.args["command"]
                    
                    # Execute on cluster
                    result = run_on_cluster(command)
                    
                    # Send result back to Gemini
                    response = chat.send_message(
                        genai.protos.Content(
                            parts=[
                                genai.protos.Part(
                                    function_response=genai.protos.FunctionResponse(
                                        name="run_cluster_command",
                                        response={"result": result}
                                    )
                                )
                            ]
                        )
                    )
                else:
                    break
            
            # Get final text response
            text_response = response.text
            
            # Handle long responses (Discord 2000 char limit)
            if len(text_response) > 2000:
                for i in range(0, len(text_response), 1900):
                    await message.channel.send(text_response[i:i+1900])
            else:
                await message.channel.send(text_response)
                
        except Exception as e:
            await message.channel.send(f"❌ Error: {str(e)}")

client.run(DISCORD_TOKEN)

Replace mycluster and username with your actual values.


Part 6: Running the Bot as a System Service

Step 6.1: Create systemd Service

sudo nano /etc/systemd/system/hpc-discord-bot.service
[Unit]
Description=HPC Discord Bot
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/projects/discord_agent
ExecStart=/home/ubuntu/.local/bin/uv run bot.py
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Step 6.2: Enable and Start

sudo systemctl enable hpc-discord-bot.service
sudo systemctl start hpc-discord-bot.service

Step 6.3: Verify It’s Running

sudo systemctl status hpc-discord-bot.service
sudo journalctl -u hpc-discord-bot.service -f

Part 7: Testing Your Bot

In Discord, try these commands:

!hpc check my running jobs
!hpc list files in my home directory  
!hpc what's the current load on the cluster
!hpc show me the last 10 lines of my latest slurm output

🎉 It should work!


Part 8: Automatic Health Monitoring

Let’s add automatic monitoring that alerts you when the cluster becomes unreachable.

Step 8.1: Create Monitor Script

cd ~/projects/discord_agent
nano monitor.py
import discord
import subprocess
import os
from dotenv import load_dotenv
from datetime import datetime

load_dotenv()

DISCORD_TOKEN = os.getenv("DISCORD_TOKEN")
ALERT_CHANNEL_ID = int(os.getenv("ALERT_CHANNEL_ID", "0"))

def check_cluster_access():
    """Check if cluster is accessible"""
    try:
        result = subprocess.run(
            ["ssh", "-o", "ConnectTimeout=10", "mycluster", "hostname"],
            capture_output=True,
            text=True,
            timeout=15
        )
        return result.returncode == 0, result.stdout + result.stderr
    except Exception as e:
        return False, str(e)

def send_alert(message):
    """Send alert to Discord"""
    if ALERT_CHANNEL_ID == 0:
        print(f"Alert: {message}")
        return
    
    intents = discord.Intents.default()
    client = discord.Client(intents=intents)
    
    @client.event
    async def on_ready():
        channel = client.get_channel(ALERT_CHANNEL_ID)
        if channel:
            await channel.send(f"🚨 **Cluster Access Alert** 🚨\n{message}")
        await client.close()
    
    client.run(DISCORD_TOKEN)

if __name__ == "__main__":
    is_accessible, output = check_cluster_access()
    
    if is_accessible:
        print(f"[{datetime.now()}] ✅ Cluster is accessible")
    else:
        alert_msg = f"""Cluster is not accessible from the VPS.

**Time:** {datetime.now()}
**Error:** {output}

**To fix:**
1. SSH into cluster: `ssh mycluster`
2. Check tunnel: `tmux attach -t tunnel`
3. If down, restart tunnel in tmux
4. Re-authenticate to reset ControlMaster"""
        
        print(f"[{datetime.now()}] ❌ Cluster NOT accessible")
        send_alert(alert_msg)

Step 8.2: Create Systemd Timer

Create the service:

sudo nano /etc/systemd/system/cluster-monitor.service
[Unit]
Description=Cluster Access Monitor
After=network.target

[Service]
Type=oneshot
User=ubuntu
WorkingDirectory=/home/ubuntu/projects/discord_agent
ExecStart=/home/ubuntu/.local/bin/uv run monitor.py

Create the timer:

sudo nano /etc/systemd/system/cluster-monitor.timer
[Unit]
Description=Run Cluster Monitor Every Hour

[Timer]
OnCalendar=hourly
Persistent=true

[Install]
WantedBy=timers.target

Enable and start:

sudo systemctl enable cluster-monitor.timer
sudo systemctl start cluster-monitor.timer

Verify:

systemctl list-timers cluster-monitor.timer

Troubleshooting Guide

Bot Not Responding

Check if running:

sudo systemctl status hpc-discord-bot.service

View logs:

sudo journalctl -u hpc-discord-bot.service -n 50

Common issues:

  • Gemini API key invalid → regenerate at Google AI Studio
  • Discord token expired → regenerate in Discord Developer Portal
  • Tunnel down → follow tunnel re-establishment below

“Duo MFA Required” Error

Your ControlMaster socket expired (happens after 30 days or cluster maintenance).

Fix:

From otterland:

# Remove old socket
rm ~/.ssh/cm_sockets/*

# Re-authenticate (approve Duo when prompted)
ssh mycluster "hostname"

# Test - should work silently now
ssh mycluster "hostname"

Tunnel Is Down

Symptoms: Bot can’t reach cluster, monitor sends alerts

Fix from your local machine:

# 1. SSH to cluster
ssh mycluster  # Approve Duo

# 2. Check if tmux session exists
tmux ls

# 3. Attach to session
tmux attach -t tunnel

# 4. If SSH process frozen, kill with Ctrl+C

# 5. Restart tunnel
ssh -N -R 2222:localhost:22 \
    -o ServerAliveInterval=60 \
    -o ServerAliveCountMax=3 \
    ubuntu@40.233.101.100

# 6. Detach: Ctrl+b then d

# 7. Verify from VPS
ssh otterland
ssh -p 2222 localhost "hostname"

Gemini API Rate Limits

Free tier: 60 requests/minute (very generous for personal use)

If you hit limits, wait a minute or upgrade to paid tier ($0.00015/1k tokens = nearly free).


Maintenance Schedule

Monthly (~5 minutes):

  • Approve Duo MFA when ControlMaster expires
  • Bot will send you alerts when this is needed

After Cluster Maintenance:

  • Re-establish tunnel (follow troubleshooting guide)
  • Takes 2 minutes

Otherwise:

  • Everything runs automatically
  • Monitor checks hourly
  • Bot auto-restarts on crashes

Security Considerations

Is this secure?

Yes, with caveats:

Pros:

  • SSH key authentication (more secure than passwords)
  • Tunnel is one-way (cluster → VPS, not exposed to internet)
  • No credentials stored in code (all in .env)
  • Bot only has your cluster permissions (can’t escalate)
  • MFA still required monthly

⚠️ Trade-offs:

  • SSH key has no passphrase (required for automation)
  • Anyone with VPS access can reach cluster (keep VPS secure)

Best practices:

  • Use a dedicated Discord server (don’t add bot to public servers)
  • Use .gitignore for .env files
  • Rotate API keys periodically
  • Monitor bot logs for suspicious activity

Limitations

Free tier limits:

  • Gemini: 60 requests/minute (plenty for personal use)
  • Oracle VPS: 4 CPUs, 24GB RAM (permanent)
  • Discord: No limits for personal bots. 2000 character limits/msg is taken care of in the code but chunking.

If you exceed limits:

  • Gemini paid tier: ~$0.15 per 1M tokens (nearly free)
  • Oracle Cloud: Can upgrade to paid, but won’t need to

Advanced Customizations

Adding More Commands

Edit bot.py system message to teach Gemini new tasks:

system_msg = """You are an HPC cluster assistant.

Common tasks:
- Check jobs: squeue -u username
- GPU usage: nvidia-smi
- Disk usage: du -sh ~/
- Module list: module list
- Queue info: sinfo
"""

Multiple Clusters

Add more SSH configs to ~/.ssh/config on VPS:

Host cluster1
    HostName localhost
    Port 2222
    User username1
    ...

Host cluster2
    HostName localhost
    Port 2223
    User username2
    ...

Set up separate tunnels on each cluster forwarding to different ports.

WhatsApp/Telegram Instead of Discord

Replace Discord client with Twilio (WhatsApp) or python-telegram-bot. Same architecture, different frontend.

File Upload/Download

Add bot commands to SCP files:

def download_from_cluster(remote_path, local_path):
    subprocess.run(["scp", f"mycluster:{remote_path}", local_path])
    return local_path

# In bot: send file to Discord
await message.channel.send(file=discord.File(local_path))

Real-World Usage Examples

Example 1: Job Monitoring

You: !hpc are my jobs still running?

Bot: You have 2 jobs running:
- job_456789: "protein_folding" - 87% complete (node gpu01)
- job_456790: "dataset_prep" - 23% complete (node cpu12)

Both look healthy!

Example 2: Quick File Check

You: !hpc show me the last 20 lines of my latest output

Bot: [Shows tail of slurm-456789.out]
Training epoch 47/50: loss=0.023, acc=0.942
Training epoch 48/50: loss=0.021, acc=0.948
...

Example 3: Emergency Job Cancellation

You: !hpc cancel all my jobs

Bot: Cancelled 2 jobs:
- Cancelled job 456789
- Cancelled job 456790

All jobs stopped.

Example 4: Checking Before Submission

You: !hpc what's the queue wait time?

Bot: Current queue status:
- GPU nodes: 3 pending jobs, ~45 min wait
- CPU nodes: 12 pending jobs, ~2 hour wait

GPU queue is relatively free!

Community Contributions Welcome

This setup works for Alliance Canada’s Fir cluster, but the architecture is universal. If you adapt this for other clusters (XSEDE, NERSC, AWS ParallelCluster, etc.), please share!

Potential improvements:

  • Support for SLURM arrays
  • Cost tracking integration
  • Jupyter notebook launching
  • Multi-user support (one bot, many users)
  • Web dashboard alongside Discord

GitHub repo (coming soon): Full code, setup scripts, and contributions welcome.


Conclusion

You now have a free, autonomous AI assistant that controls your HPC cluster via natural language from anywhere. No laptop required. No complex VPN setup. No monthly fees.

 

Questions? Issues? Drop them in the comments or reach out on Discord. I’m happy to help troubleshoot!

Found this useful? Share it with your lab mates! Every researcher fighting with SSH sessions and MFA prompts deserves better.


Written by a researcher tired of approving Duo notifications at 2 AM. Built with coffee, excitement, and Claude’s help.

Last updated: February 2026

Like
Like Love Haha Wow Sad Angry