Dhalion: self-regulating stream processing in Heron (VLDB'17)

This paper appeared in VLDB'17, and is by Avrilia Floratou (Microsoft), Ashvin Agrawal (Microsoft), Bill Graham (Twitter), Sriram Rao (Microsoft), and Karthik Ramasamy (Streamlio).
 

Dhalion aims to further reduce the complexity of configuring and managing Heron streaming applications. It provides a self-tuning/self-correcting extension to Heron to monitor, diagnose, and correct problems with the deployment. (Dhalion is named after a mythical bird with interesting healing capabilities.)


It turns out there are only 4-5 category of things that can (is likely to) go wrong in a Heron deployment. (I had summarized Heron earlier here.) The detectors and diagnosers in Dhalion check for them. The resolver as a result tries one of these corresponding 4-5 correction actions. And the effectiveness of the correctors are then rated.

Yes, you heard it right, instead of guaranteed-to-work correctors, Dhalion rates the correctors' effectiveness and decides what to try next. "After every action is performed, Dhalion evaluates whether the action was able to resolve the problem or brought the system to a healthier state. If an action does not produce the expected outcome then it is blacklisted and it is not repeated again."

This is not a fool-proof method: what if the reason corrector is ineffective because another fault/condition started developing? Another problem I can see is interference among correctors causing problems. But I think Dhalion tries/evaluates correctors one by one, otherwise the interference issues obfuscates the evaluation of correctors.

So Dhalion takes a probabilistic approach to self-tuning and recovery. But maybe that should not be a cause for alarm.  The system is already faulty, why should the correctors be guaranteed to work? If the corrector framework can try/evaluate them quickly, and converge the system back, that is all that matters. (I am not sure if Dhalion can manage to do this fast enough for most scenarios.)

Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom