Machine LearningMLOpsEnterprise Software

Making Software & Platform Portfolios Intelligent with Machine Learning

AM
Ajay Malik · Founder & CEO
July 17, 2018
Making Software & Platform Portfolios Intelligent with Machine Learning

An advisory session, not a keynote

Late in 2018 TCS invited me to a smaller, closed-door enterprise advisory session in Santa Clara. It was not a keynote with stage lighting. It was a horseshoe of tables, maybe thirty people, most of them running large software estates for banks, insurers, retailers, and a couple of telcos. These were the people who owned portfolios — not one application, but hundreds of them, plus the platforms underneath, accreted over fifteen or twenty years. The mandate for the afternoon was practical: where does machine learning actually earn its keep inside an enterprise like ours?

I had been thinking about this for a while, and I wanted to steer them away from the shiny answers. Everyone in 2018 wanted to talk about a customer-facing model — churn prediction, recommendations, fraud scoring. Useful, yes. But I thought the bigger, more neglected opportunity was inward-facing: use supervised learning to make the software portfolio itself self-observing and, eventually, self-optimizing.

The argument: turn the portfolio into a labeled dataset

Here is the framing I opened with. A large software portfolio is already emitting an enormous amount of signal — logs, traces, deployment records, incident tickets, change requests, resource metrics, on-call pages. Most enterprises treat all of that as exhaust. They store it for compliance, occasionally grep it during a fire, and otherwise ignore it.

But look at it as a machine-learning practitioner and it is something else entirely: a continuously growing, richly labeled dataset. Every incident ticket is a label. Every rollback is a label. Every change that preceded an outage is a training example waiting to be used. The portfolio is telling you, in its own operational history, which patterns lead to trouble and which do not. Nobody was mining it that way.

So my argument was that you could stand up supervised models on top of that operational history to do three concrete things. First, predict which changes or releases carried elevated risk before they shipped, learned from the outcomes of every past change. Second, catch anomalies in behavior early — a service quietly drifting toward failure — instead of waiting for the pager. Third, and most ambitiously, close the loop: let the system recommend or take routine remediations it had seen work, under supervision, so the portfolio began to optimize itself.

Self-observing first. Self-optimizing second. You have to earn the second one.

The portfolio as a labeled dataset Operational signals logs · traces incidents · tickets deploys · rollbacks metrics · pages Supervised models trained on outcomes Predict change risk before release Detect anomalies before the pager Recommend fixes under supervision outcomes become new labels — the loop closes

Why it mattered to enterprises then

The people around that horseshoe were not short on data or on smart engineers. What they were short on was leverage. Their estates had grown faster than their ability to reason about them. A change advisory board reviewing hundreds of releases a week was, in practice, guessing. Incidents were investigated after the fact by whoever happened to remember a similar one. Institutional knowledge lived in a handful of senior people who were one resignation away from a gap.

Supervised learning offered a way to capture that judgment in something durable. Instead of relying on the one architect who "just knew" a certain service was fragile on Friday deploys, you could learn that pattern from the record and apply it consistently to every change. That was the real pitch: not replacing engineers, but encoding the operational judgment the enterprise had already paid for through years of incidents, so it stopped walking out the door.

And I was honest about the hard part. In 2018 getting a model into production and keeping it trustworthy was still where most efforts died. Model interpretability was not a nicety here; it was the price of entry. No release manager was going to block a deployment because a black box said "risky." The model had to say why — which features, which historical incidents it resembled — or it would be ignored. MLOps, still a young idea then, was the difference between a clever prototype and something an operations team would actually run at three in the morning.

A concrete example

I gave them one worked example. Take change-risk prediction. You assemble a training set from your own history: every change over the past couple of years, described by features you already have — which service, how large the diff, time of day, author's recent track record, dependencies touched, tests run — and labeled with what happened after: clean, degraded, or incident.

Train a gradient-boosted model on that. Now, before a release, it produces a risk score and, crucially, the top reasons: "elevated because this touches the payments gateway, the change is large, and two of the last three changes to this component caused incidents." The board stops guessing and starts triaging. High-risk changes get more eyes; low-risk ones flow through. As new changes ship and outcomes land, they become fresh labels, and the model retrains on a schedule. That is online, incremental improvement in the plainest possible form — and it made the portfolio measurably more self-observing within a quarter, without anyone pretending it was magic.

The bridge to today

The instinct in that room — mine the enterprise's own operational history, put a model in production, keep it learning, and insist it explain itself — aged well. Those are still the principles that separate real systems from demos. At StudioX we now operationalize them as autonomous AI workers and AI Missions: systems that observe the enterprise, act inside it, learn from the outcomes of their actions, and stay explainable to the people accountable for them. The ambition in 2018 was self-observing and self-optimizing portfolios; that is the work we carry forward, with better foundations underneath it.

Related on StudioX: Enterprise AI Platform · AI Workers · AI Missions

Discussion

No comments yet — start the conversation.

Join the discussion

See StudioX run.

Put autonomous AI workers to work on your own systems and knowledge.