Akhil
Home
Experience
Achievements
Blog
Tools
Contact
Resume
Home
Experience
Achievements
Blog
Tools
Contact
Resume
Portfolio

Building robust and scalable cloud-native solutions with modern DevOps practices.

Navigation

  • Home
  • Experience
  • Achievements
  • Blog
  • Tools
  • Contact

Get in Touch

akhil.alakanty@gmail.com

+1 (248) 787-9406

Austin, TX

GitHub
LinkedIn
Twitter
Email

© 2025 Akhil Reddy. All rights reserved.

Built with Next.js, Tailwind CSS, and Framer Motion. Deployed on Vercel.

    From Red to Green: Building Self-Healing CI/CD Pipelines with ChatOps and LLMs

    Tired of flaky tests and cryptic pipeline failures? Learn how we cut manual triage by 20% and accelerated recovery times by integrating an LLM-powered assistant directly into our ChatOps workflow.

    8/1/2025, 8:00:00 PM
    LLMChatOpsDevExAIOpsSRE

    The 2 AM Pipeline Failure Nightmare

    We’ve all been there. A critical deployment is blocked, the CI/CD pipeline is bleeding red, and the error logs are as cryptic as ancient hieroglyphs. The on-call engineer, jolted awake, spends the next hour digging through dashboards and re-running jobs, only to find the cause was a transient network hiccup—a "flaky" failure. This scenario was happening so often that it was burning out our team and eroding confidence in our delivery process.

    Manual triage was slow, inconsistent, and a massive drain on our most valuable resource: engineer time. We needed a system that could not just report failures, but understand them and kickstart the healing process automatically.

    The Old Way: Manual Triage Misery

    1. Alert Storm: A vague #ci-alerts notification fires in Slack.
    2. Human Triage: An engineer stops their work (or wakes up) to investigate.
    3. Log Diving: They sift through thousands of lines of logs, trying to spot the real error.
    4. Ownership Puzzle: Is it a code bug? An infrastructure problem? A flaky test? They spend time figuring out who to notify.
    5. Manual Recovery: Re-run the job, create a ticket, and hope it doesn't happen again.

    This reactive loop was killing our productivity.

    The New Way: Intelligent, Self-Healing Pipelines

    We built a system that combines the collaborative power of ChatOps with the pattern-recognition capabilities of a Large Language Model (LLM).

    Here’s how it works:

    graph TD
        A[Pipeline Fails] --> B{GitLab Webhook};
        B --> C[Failure Triage Lambda];
        C --> D{Fine-Tuned LLM};
        D -- Analyzes Logs --> C;
        C --> E{Categorize Failure};
        E -- Infra Flake --> F[Auto-Retry Job];
        E -- Code Defect --> G[Create Jira Ticket & Assign];
        E -- Security Issue --> H[Page Security On-Call];
        F --> I[Notify Slack: Retrying];
        G --> J[Notify Slack: Ticket Created];
        H --> K[Notify Slack: Paging Security];
    

    How We Built It

    Our approach had two core components: a fine-tuned LLM for classification and a Slack bot for communication.

    1. Fine-Tuning the LLM Classifier: We scraped historical data from thousands of past pipeline failures, including logs, error messages, and the resulting Jira tickets (which told us the root cause). We used this dataset to fine-tune a smaller, open-source LLM to be an expert at one thing: classifying pipeline failures into categories like infra_flake, unit_test_failure, security_scan_vuln, or build_error.

    2. ChatOps Integration: A serverless function receives a webhook from our CI tool (GitLab) on any job failure. It sends the logs to our LLM for classification. Based on the response, it takes action.

    Here’s a simplified look at the JSON payload our LLM returns. This is the 'brain' that drives the automation:

    {
      "status": "failed",
      "failure_type": "infra_flake",
      "confidence_score": 0.92,
      "suggested_owner": "@platform-team",
      "suggested_action": "retry",
      "summary": "Failure detected in the terraform-apply job. Logs indicate a transient 503 error from the cloud provider's API. This is likely a temporary infrastructure issue."
    }
    

    The Impact: Giving Engineers Their Time Back

    By automating the initial triage, we achieved significant results:

    • 20% Fewer Human Interventions: The bot successfully retried and resolved flaky infrastructure failures without any human needing to intervene.
    • Faster Triage: For legitimate bugs, Jira tickets were created and assigned to the right team in seconds, not hours. Mean time to acknowledgment (MTTA) dropped significantly.
    • Happier Engineers: On-call became quieter, and developers could focus on building features instead of chasing pipeline ghosts. The system didn't just fix pipelines; it helped fix a broken process.