Home Software Engineering IT Services Incident Management
07 | 02 | 2012
Notizie flash

A career as a Cobol programmer might not be as sexy as slinging Java code or scripting in Ruby, but if you buckle down and learn hoary old Cobol, you could land one of the safest, most secure jobs in IT.

Analyst reports indicate that Cobol salaries are on the upswing. The language is easy to learn, there's a healthy demand for the skills, and offshore Cobol programmers are in short supply -- plus, the language itself holds the promise of longevity. All that loose talk about mainframes going away has subsided, and companies committed to big iron need Cobol pros to give them love.

(fonte computerworld)

Incident Management PDF Print E-mail
Written by Administrator   
Saturday, 05 December 2009 17:09

It can happen that an IT manager will be responsible for a maintenance service.

During this activity it may happen that he/she will have to handle emergency situation, that can be caused for example by wrong production releases, problems in test environment, problem in development environment, till in the worst cases production problems.

I'm presenting here a series of actions that may be useful when we need to manage such "nasty" situations.

As first step: try to calm down yourself and the people involved

The worst thing in such cases is to act based on our emotions: there might be an angry costumer and we will try to do everything to set up the situation. In this moments it is important to calm down and do not act based on our impulse, in order to avoid making the situation even worse. Sometimes the best thing is to leave the communication with customers and client to a senior manager or to someone less involved in the problem. If someone working for you is managing the situation, try to calm him/her down and maybe consider that you can handle the communication with the client. Evaluate the problem in relation to the project and if possible classify it
It may be that who reported the problem didn't evaluate it correctly or has a dramatic vision of the impact. We should try to obtain the maximum information and to understand the impact of what happened. We need to get rid of all the assumption that people around us are doing and we need to go back to the facts. At this point it will be possible to define the priority of the problem (sometimes it is done automatically) and the causes that originated it.

If needed (and possible) apply a workaround to get the situation to a stable condition.
You will apply this option only when it is needed to get online again an high priority service. If possible so we will apply or suggest workaround that give back the functionality, even if not at 100%. Careful - sometimes a workaround che cause even more damage than the original problem, so before applying it, you should always discuss with the service manager. Just as an example, if the problem was caused by a wrong software change, by using the normal change management tool, is usually possible to get the software (!! but sometimes not the data!!) back to what it was previously.

Involve all the persons that can help of that are touched by the problem.
If possible take everybody in teh same room (or a call conference) for a meeting.
You should find people impacted by the problem and experts who may help in the situation. Based on the urgency you may decide to have the meeting immediatly or within a certain time lap (usually one day, at maximum one week). Avoid to gather crowdy groups (if needed take representatives) as the solving power could decrease.
Once we get to the meeting, we need to reconsider what happened and review our hypotesis on the causes and on the possible solutions. In case we are not able to get easily to an agreement, we should try to adopt all the techniques we know to reach consensus. Indeed we need to remeber that we have usually strict time constraints.  
The following are good question to get to the point:

  • What was the last moment without the problem? In better terms what was the last safe consistency point? Are we able to find a clean "image" of our application? Is it a hardware or software problem?
  • Can we isolate the problem?
  • Can we simply apply more resources to solve the problem? Do we need "brute force" actions?
  • Do we need to restore data bases? Can we do automatically?
  • Are there disaster recovery procedures? Can we apply them?

It could be needed to have a previous brainstorm session, but it is really important that people in the meeting have clear the time priority on which they are working.

When you will decide about the actions to pursue, your decision will be one (or more if there are more steps) among the following:

  • Do nothing. This is not the preferred option, but it is always an option to consider, at least in the immediate time
  • If the problem was caused by a change in the software and there was no impact on data, then we should return to the situation before the software change and reproduce in the test environment the error situation.
  • Perform manual work-around. If the situation was caused by a specific defect that will not happen again (for example after a data migration) this may be an economic option.
  • Involve other responsibles and decide common and concurrent modification. Tipically this may happen when there are problems within interfaces or, in complex environment, when the effect of one application may be on a second application.
  • Plan a solution and a schedule for the software modification.
  • If the extension of the problem is larger, in the sens that we are facing damages on many parts of the IT system (or even the whole IT system), it will be needed to activate the normal disaster recovery procedures. In such cases you will have to follow all the steps defined in such procedures.

In the meantime, we always need to remember to keep informed and involve in critical decision our client and eventually a user representative.

Once we agreed on the actions:

  • Define an action plan simple and controllable, so that we will be able to periodically update the client. The higher will be the urgency, the simpler will be the action plan (eventually we will replan everything once out of the "danger zone"). Once we have the plan, remeber to get approval from the responsible of the service.
  • Define the responsibles for the actions and then agree on their milestones. Check that the plan was well understood. There shouldn't be any assumption on what to do.
  • Execute the plan
  • Check that everything will be done as agreed. Based on the priority of the problem define specific checkpoints (every hour, every day, within a given number of days).

When the emergency will be over and the situation will be back to the normal operativity, use what happen as starting point for a process improvement. You should understand what went wrong in our processes, so that we had the emergency.Was it possible to avoid the problem? If yes, how? Can we change our working processes to avoid the same situation in the future?

Even emergency processes should go trough the same analysis, so that we will undergo the continuous improvement habit, which is tipical of the best organizations.


( 0 Votes )
Last Updated on Sunday, 07 February 2010 14:08
 
Shinystat
Tag Clouds
  • Italian - Italy
  • English (United Kingdom)
Archivio Articoli
< December 2009 >
Mo Tu We Th Fr Sa Su
  1 2 3 4 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31