Elasticsearch Catalogue

The Catalogue section explains Incident Types and how Hayei uses those to create Tickets.

To explain what is an Incident Type before, we need to understand how a system works. First, let assume we are monitoring an OpenStack cluster (OSc). An OSc might have multiple servers (a.k.a. hosts) acting as a controller, compute, and storage. Each host has programs (a.k.a. loggers) running that generate a considerable amount of logs. ELK instance collects all the logs. When Hayei gets logs from ELK, it analyzes a fraction of the logs, usually a time window of 10 to 60 seconds.

Once Hayei gets the logs, it tries to aggregate logs based on the host. For example, every host might have multiple logs from multiple loggers (at least one). A log is a piece of information about the stage of the application. The data can be just a line or a multiline traceback error. In any case, Hayei creates a Log Archetype that helps identify similar logs under the same category.

In a fraction of logs, Hayei identifies all the Log Archetypes and creates an Incident Type. The Incident Type becomes a compressed way to see all the logs in that fraction. That fraction of logs is called Ticket. The following section provides more information about Tickets.

The Catalogue lists all the Incident Types discovered in the Data Source.

An Incident Type only shows a set (without duplicates) of the Log Archetypes seen in a time window of logs. In the same view, Hayei allows to label the incident type, set the severity, and check the history of changes.

To label an Incident Type, click over the current tag (the value is “Unknown error”).

The label must be different in each case to keep incident types unique. However, if the incident type might be related to another, Hayei allows merging two incident types by assigning the same label. As a result, one incident type would disappear, and the other becomes the log archetypes’ union.

Hayei also shows the incident type history. It allows the user to track changes in time.

Another aspect that Hayei presents is the severity of the incident type. Five levels help to indicate how much attention a ticket demands.

For Incident Type, Hayei shows possible solutions.

In Hayei, a user can propose a solution.

In Hayei, users can vote (thumb up or down) for a solution or even edit the current one.

Hayei uses the Log Archetypes to search solutions on the web (StackExchange, Launchpad, Google). In some cases, Hayei extracts pieces of code that might work on solving the problem.

Hayei also provides a list of URLs that might discuss ways to overcome the current incident.

Hayei can trigger actions to solve a problem automatically when an Incident Type reappears over time. The purpose of actions is to improve user experience and avoid open support tickets with the same incident type that appears in other environments.

The categories of the actions are API and SSH. API calls are calls that Hayei will use to fix a problem in an environment from where Hayei created the ticket. SHH commands that will be run directly on a server. For this feature, Hayei would need specific permissions to make changes.

For now, Hayei supports only API calls via cURL.

All the API actions are visible. 

Hayei allows testing each API action in the Manage section.

For SSH command execution, the logic is similar to the API calls.

For a quick view, Hayei also displays all SSH actions.

For traceability, Hayei keeps track of the execution history.

The last section is Alerts. Here a user can see and add alerts by Incident Type or Logger.

The process of setting a new alert is similar to the method explained before in section ElasticSearch Alerts. The main difference is that a user can add an alert when a ticket is open from a known incident type.

It is also possible to create an alert when Hayei detects a logger combination in a new ticket. For example, it is useful when an issue might trigger logs from two different loggers (a.k.a. applications).

Hayei keeps track of the alert in the alert history section.

In a view, Hayei shows details about the new tickets that meet the conditions defined in the alert.