SlideShare a Scribd company logo
International Journal of Mathematics and Statistics Invention (IJMSI)
E-ISSN: 2321 – 4767 P-ISSN: 2321 - 4759
www.ijmsi.org Volume 5 Issue 3 || March. 2017 || PP-31-36
www.ijmsi.org 31 | Page
Job Failure Analysis in Mainframes Production Support
Ranjani KV1
, R. Roseline Mary2
1
(Department of Computer Science, Christ University, Bangalore India)
2
(Department of Computer Science, Christ University, Bangalore India)
Abstract: A major part of batch processing on mainframe computers consists of several thousand batch jobs
which run every day. This network of jobs runs every day to update day-to-day transaction. There are frequent
failures which can cause a high delay in the batch and also degrade the performance & efficiency of the
application. Permanent solution can be done to frequently occurring job failures to avoid the delay in batch and
to improve performance & efficiency of the application. In this paper, we have analyzed the frequently
occurring batch job failure recorded in Know Error Databases (KEBD) for past one year based on different
categories. Frequently failed jobs obtained are categorized based on application, failure-type, job-runs and the
resolution. Different results are obtained in the weka tool based on the different categories. From the various
results obtained it can be concluded that the frequent failures are occurring in MSD application. On further
analysis on this frequently failed jobs showed that data and network issue are causing the major job failures in
which most of the jobs were daily processing jobs. In order to fix the failure the jobs was resolved by restarting
the job from the overrides or by restarting the job from the top.
Keywords: Batch jobs, Failure analysis, Know Error Databases (KEDB), Resolution, weka
I. INTRODUCTION
Batch processing on mainframe computers consists of several thousand batch jobs which run every day. This
network of jobs shows the day-to-day business transaction that are updated during the night with interrelations
requiring scheduling and prioritizing the jobs to assure all batch jobs run in the scheduled order within Service
Level Agreement(SLA). The job scheduler helps in identifying the times at which the job runs on specific days
and the dependencies of the batch job will also be seen in the scheduler. Scheduled jobs run on specific days and
at different times ensures updating on business holidays as well without any loss of data. The execution of some
jobs is dependent on the other jobs because output data from first job is used as an input data to second data.
This data dependency is also associated with various upstream and vendor. Scheduler also helps in identifying
the status of the job i.e. under the execution, waiting for other jobs to complete, arriving of data or file from the
upstream or vendor, Error if the job has failed etc. There would be a delay or postponing in the job run due to
dependent job in failed state which further delays application batch and downstream jobs dependent on the
failed jobs. The resolution of that failed job is fixed by referring to the Know Error Databases (KEDB). KEDB
contains previously failed job details which include the Job Name, Application, Return code, Error Message,
Resolution and the person who resolved it. If the job details are not present in the KEDB, then it is first time
failure for particular job and the record is added for future reference. Based on the previously occurred error in
the production the job failures are categorized based on input arrival times and the type of failure occurred.
Based on the failure occurred that are stored or updated in the KEDB, the failure is categorized into different
types like technical issue, network issue, contention, space issue, data issue, cancellation, new jobs, developer
mistake or incorrect scheduling. Based on these categorizations the analysis is made on, how many times the job
failed due to same error and the resolution done to fix the error.
To analyze the pattern of the job failures, The KEDB for the previously failed jobs for the past one year is
obtained. With this the frequent job failure are analyzed for following categories:
 The type of the job that has failed (like the processing, transmission, database)
 The job failure type (like data issue, network issue, database, deadlock)
 The action taken to fix the job failure at that time (like the job was restarted from top, restarted from failed
step, marked the job as complete).
 The job names.
Based on the categorization, analysis is done to improve performance and avoid the recurring failures of the
system by implementing or suggesting the permanent fix for the specific job. This analysis is done with the help
of weka tool. The job failure dataset obtained for the past one year is taken and inserted into the weka for the
analysis and the respective results obtained are tabulated are shown in this research paper.
Job Failure Analysis In Mainframes Production Support
www.ijmsi.org 32 | Page
II. PROBLEM DESCRIPTION
A batch jobs on mainframe will often run every day to update day-to-day transaction and there are frequent
failures. These failures can cause a high delay in the batch as well as degrade the performance and efficiency of
the application. In order to avoid the delay in batch, degradation of performance & efficiency of the application
a permanent solution could be done to frequently occurring job failures. This is done by analyzing the pattern of
the job failure and the resolution steps taken to fix the failure. Once the pattern frequency of the failure and the
resolution steps to fix are analyzed, the failure can be avoided in future by fixing it permanently by analyzing
the pattern of the job failure with the historical failures recorded by the support team in a project for keeping
record of the failed jobs.
III. LITERATURE REVIEW
In [1] ,the paper ”Job Failure Analysis and its implications in a Large-scale Production Grid” determines an
analysis of job failures in large-scale data intensive grid. Job failures in large-scale heterogeneous environments
are due to a variety of possible causes, system problems which was due to node, disk or network issues. Errors
also occurred at different levels as the software stack was more and more complex. Based on the job failures
they are represented in three periods in the production, characterize the inter-arrival times and life spans of
failed jobs. Different failure types are distinguished and the analysis is carried out. Based on the failure pattern,
historical failure is taken into account in decision making. Based on the analysis the cooperation and
accountability issues are briefly addressed, evaluated the effectiveness and feasibility.
In COBOL application (Banking system) critical outages were due to common causes. ” Towards Assuring
Non-Recurrence of faults Leading to Transaction outages – An Experiment with Stable Business Application”
[2] determines that, to reduce the cost and efforts for maintaining a legacy business application was a challenge
for capturing faults at early stage of software – known to prevent defects in production. Analysis was performed
on these common causes to detect the causes using structured and COBOL flow analysis. From an structural
analysis and control flow analysis techniques the faults was automatically detected using the all occurrences of
the faults which would potentially lead to multiply failures during production.
In [3]” Towards a Training-Orientated Adaptive Decision Guidance and Support System” determines that,
strategic approaches are needed to troubleshoot system failures by first identifying the component causing
failure to solve the problems. In this paper they have addressed the domain of administration of DB2 on z/OS
Mainframes system. The framework dynamically extracts knowledge from various correlated data sources
containing system related data from the problem solving procedures of human experts. The research paper
applies text and data mining techniques for knowledge extraction a rule based system for knowledge
representation and problem categorization and case based system for providing decision support.
Based on the error codes categorization for the job failures” Getting back Up: Understanding How Enterprise
Data Backups fail” [4] determines that, the jobs that run on each system are monitored and checked if they are
completed successfully. Error characteristics are done based on the production, development and test, Number
of unique error codes Number of most frequent error codes. Error causes are due to misconfigurations, system
error and information messages (unusual). The above characterizations are done with the historical data and the
analysis performed for decision support.
IV. METHODS AND MATERIALS
To analyze the pattern of the batch job failures in mainframe computers, the KEDB for the previously failed
jobs for the past one year is obtained. The most frequent job failures were recorded and analyzed with the weka
tool upon these categorization:
 Application
 Failure-type
 Resolution
 Job Type
 Job runs
1. Application: In this research paper the failed jobs are categorized into application based on the different
servers the job runs.
2. Failure-Type: The job failure are classified as below Data, Network, Delay, Deadlock
a) Data: The failure is classified as data issue is there is any discrepancy in the file received. Examples of data
issues could be due to the following reasons:
 Incorrect file received from the upstream or the vendor.
 Junk values in the file which has caused the data inconsistency.
Job Failure Analysis In Mainframes Production Support
www.ijmsi.org 33 | Page
 The format of the file is not as expected for example; the data or the values in the files are not in the correct
format as expected.
 Missing values in the file or the file is empty.
b) Network: The interaction with batch processing is mainly through a network of transmitting or receiving the
files from either one server to another or through DB2 tables. The failure is classified as network issue when
there is any issue in interacting with the batch processing. Examples of the network issue could be due to the
following reasons:
 Server unavailability to while extraction the file or uploading the file during the job execution.
 Resource unavailability, this could be due to the file or the table is been used by other jobs.
 System or the application down while the user is trying to access the data.
c) Delay: When there is any feed delay from the upstream or the vendor the batch jobs go into the failed status.
The delay would happen due to the following reasons mentioned below:
 The file is very huge (that is, contains more number of records than the expected.
 The scheduled release activity either in our application or in the upstream.
 The jobs going into contention waiting for the files which are used by the other jobs.
d) Deadlock: When the batch is executing concurrently, a deadlock can happen when one job is trying to access
the file or database and is waiting on the other job for release the lock on that file or the database.
3. Job-Type: The job type specifies the type of job failures that happen while the batch is running. The failed
jobs in past year recorded are categorized into the following:
a) Processing: The mainframes batch jobs that run during the night updates the data or the transactions that has
happened in the day time. The processing jobs have failed due to the following reasons:
 Insufficient storage: While updating the day-to-day transactions into a dataset or DB2 tables.
 Null values: When there is an empty file or dataset received or missing data in a specific column.
 Incorrect data formats: When the data received is not in the same format the values usually received like the
date formats or the characters in place of numeric values.
 Load Balancing: When the jobs are automatically submitted on different CPU’s where the specific
application jobs do not have access to run on that CPU.
b) FTP Transmission: File Transfer Protocol (FTP) transmission is process of transmitting the files between
servers.
c) NDM: Network Data Mover (NDM) transmission is process of transmitting the files between servers with a
direct connect by installing the server details at both the end before transmitting any details between the
servers. This type of transmission is much faster and more secure while comparing with the FTP
transmission.
The FTP transmission and NDM transmission jobs have failed due to the following reasons:
 Connectivity/Access Issue: When the server details are not installed at both the ends so the jobs fail due to
access issue while accessing the servers.
 File unavailability: When the file is not available as the file generation at the upstream is still in progress.
d) Database: The database jobs are the jobs that append the latest data into the database and take the backup of
those databases into the datasets for future reference. These jobs failed due to the following reasons:
 Resource/ Datasets Unavailability: Where the specific database or the table is in use by other batch jobs.
 Space Issues: While updating the day-to-day transactions into a dataset or DB2 tables.
4. Job-Runs: The Job-Runs categorized with Daily, Weekly, Monthly jobs based on the pattern the job runs.
a) Daily: The jobs are categorized into daily where the jobs runs daily that is Mon-Friday or Tuesday to
Saturday.
b) Weekly: The jobs are categorized into weekly where the job runs any one day of the week.
c) Monthly: The jobs are categorized into monthly where the job runs once in month.
5. Resolution: The resolution steps taken to resolve the job failure when it has failed are categorized in the
following categories:
a) Restarted-the-job-from-top: The failed jobs are restarted from the top when they have failed before executing
either due to access issue or network connectivity issue.
b) Restarted-the-job-failed-step: The failed jobs are restarted from the failed step when the processing of the
jobs is interrupted due to the resource unavailability, network issue, unexpected return code from the job, etc.
c) Marked-the-job-as-complete: The failed jobs are marked as completed when all the processing completed
and the job has thrown any acceptable return code.
Job Failure Analysis In Mainframes Production Support
www.ijmsi.org 34 | Page
d) Restarted-the-job-with-the-overrides: The failed jobs are restarted with the overrides when the job has failed
due to syntax errors, space issue or file not available due to delay or any other reasons where the job needs
any manual modification to complete the job successfully.
V. RESULTS AND DISCUSSION
The various results obtained with the above categorization with Weka tool for the records are shown as below
with the respective observations.
1. Application: Based on Application, below figure Fig.1 shows the classification for different categories. The
MSD application has the more number of job failures compared to the IMS application jobs in the past one year.
78.61% of the times the jobs have failed in MSD application and 21.38% of the times the jobs have failed in
IMS application.
Fig. 1: Application classification for different categorizes
2. Failure- Type: Based on Failure-Type, below figure Fig.2 shows the Failure-Type classification. It shows
that there are 47.48% of the times the jobs are failing due to the data issue, 43.08% of the times the jobs are
failing due to Network issue, 5.74% of the times the jobs are failing due to delay and 3.45% of the times the jobs
are failing due to deadlock.
Fig. 2: Failure-Type classification for different categorizes
3. Resolution: Based on Resolution, below figure Fig.3 shows the Resolution-Type classification. It shows that
there are 30.81% of the times the failed jobs have resolved by restarting the job from the top, 32.38% of the
times the failed jobs have resolved by restarting the job from failed step, 13.20% of the times the failed jobs
have resolved by marking the job as complete and 23.58% of the times the failed jobs have resolved by
restarting the job with the overrides.
Job Failure Analysis In Mainframes Production Support
www.ijmsi.org 35 | Page
Fig. 3: Resolution classification for different categorizes
4. Job-Type: Based on Job-Type, figure Fig 4 shows the Job-Type classification. It shows that there are 59.74%
of the times the Processing jobs have failed, 29.87% of the times the FTP Transmission jobs have failed, 5.34%
of the times the Database jobs have failed and 5.03% of the times the NDM Transmission jobs have failed.
Fig. 4: Job type classification for different categorizes
5. Job-Runs: The below figure Fig 5 shows the Job-Run classification. It shows that there are 93.08% of the
times daily jobs have failed, 3.45% of the times weekly jobs have failed and 3.45% of the times the monthly
jobs have failed for the job failure taken for the past one year.
Fig. 5: Job Runs classification for different categorizes
Job Failure Analysis In Mainframes Production Support
www.ijmsi.org 36 | Page
VI. CONCLUSION AND FUTURE ENHANCEMENT
With the analysis of the pattern of the job failures recorded in KEDB for the previously failed jobs for the past
one year are obtained into different categories. The results obtained in the weka tool for the frequent job failures
based on the different categories are tabulated as below:
Categories Classifications Job Failure in %
Application MSD 78.61
Failure-Type Data Issue 47.48
Network Issue 43.08
Resolution Restarting the job from the top 30.81
Restarting the job from failed step 32.38
Job-Type Processing 59.74
Job-Runs Daily 93.08
The above result implies, job failure behavior based on the various classifications and the percentage of job
failures with same error. These results help us understand the major failures occur during night batch
processing, with the type of failure occurring and the resolution preformed to fix the failure.
For further enhancement, the results could be analyzed with respect to the specific jobs. By analyzing the
specific job failure with the same error can provide more accurate results to reduce the frequent job failures.
These recurring failures can be permanently fixed by implementing the changes in source code so as to avoid
the future job failures for the same error. This would help in increasing the performance, efficiency and stability
of the system.
REFERENCES
[1]. H. Li, D. Groep, L. Wolters and J. Templon, "Job Failure Analysis and Its Implications in a Large-scale Production Grid",
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06
$20.00 © 2006.
[2]. A. Agrawal and R. Naik, "Towards Assuring Non-recurrence of Faults Leading to Transaction Outages – An Experiment with
Stable Business Applications", ISEC’11, February. 23–27, 2011, Thiruvananthapuram, Kerela, India. Copyright © 2011 ACM 978-
1-4503-0559-4/11/02…$10.00, 2011.
[3]. F. Zulkernine, P. Martin, S. Soltani, W. Powley, S. Mankovskii and M. Addleman, "Towards a Training-Oriented Adaptive
Decision Guidance and Support System", 2010.
[4]. G. Amvrosiadis and M. Bhadkamkar, "Getting Back Up: Understanding How Enterprise Data Backups Fail", 2009.
[5]. S. Kavulya, J. Tan, R. Gandhi and P. Narasimhan, "An Analysis of Traces from a Production MapReduce Cluster", 10th IEEE/ACM
International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010). May 17-20, 2010, Melbourne, Victoria,
Australia., 2010.
[6]. “Differences Between NDM and FTP", Difference Between, 2017. [Online]. Available:
http://www.differencebetween.net/technology/differences-between-ndm-and-ftp/.
[7]. “7 Benefits of Using a Known Error Database (KEDB)", The ITSM Review, 2017. [Online]. Available:
http://www.theitsmreview.com/2012/04/7-benefits-of-using-a-kedb/.

More Related Content

DOCX
Soumya sree Sridharala
PPT
Introduction to Software Engineering SE1
PPTX
Software engineering rogers pressman chapter 7
PDF
System Analysis and Design Program
PDF
Application Assessment - Executive Summary Report
PDF
Microsoft SQL Server Seven Deadly Sins of Database Design
PDF
ODD: Extending Requirements Analysis 1.2
Soumya sree Sridharala
Introduction to Software Engineering SE1
Software engineering rogers pressman chapter 7
System Analysis and Design Program
Application Assessment - Executive Summary Report
Microsoft SQL Server Seven Deadly Sins of Database Design
ODD: Extending Requirements Analysis 1.2

What's hot (20)

PDF
Application Performance: 6 Steps to Enhance Performance of Critical Systems
PDF
Six steps-to-enhance-performance-of-critical-systems
PPTX
Maintenance, Re-engineering &Reverse Engineering in Software Engineering
PPTX
Unit3 Software engineering UPTU
PPTX
A presentation on forward engineering
 
PDF
Cost Based Performance Modelling
DOC
Print report
PPT
PPTX
Software maintenance
PPTX
PPT
Intoduction to software engineering part 1
PPT
Project Management System
PPT
Chap03
PPTX
Software engineering
PPT
Software Process in Software Engineering SE3
PDF
SE18_Lec 01_Introduction to Software Engineering
PPT
Software design, software engineering
PPTX
Requirements Engineering (CS 5032 2012)
PDF
Report on medical center
DOCX
CTTS Case Study
Application Performance: 6 Steps to Enhance Performance of Critical Systems
Six steps-to-enhance-performance-of-critical-systems
Maintenance, Re-engineering &Reverse Engineering in Software Engineering
Unit3 Software engineering UPTU
A presentation on forward engineering
 
Cost Based Performance Modelling
Print report
Software maintenance
Intoduction to software engineering part 1
Project Management System
Chap03
Software engineering
Software Process in Software Engineering SE3
SE18_Lec 01_Introduction to Software Engineering
Software design, software engineering
Requirements Engineering (CS 5032 2012)
Report on medical center
CTTS Case Study
Ad

Viewers also liked (18)

PDF
Equation of everything i.e. Quantum Fields: the Real Building Blocks of the U...
PDF
A Minimum Spanning Tree Approach of Solving a Transportation Problem
PDF
Poisson-Mishra Distribution
PDF
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
PDF
Oscillation and Convengence Properties of Second Order Nonlinear Neutral Dela...
PDF
Should Astigmatism be Corrected until the Age of Three? Results of a Six-year...
PDF
Numerical Simulation of Flow in a Solid Rocket Motor: Combustion Coupled Pres...
PDF
The Krylov-TPWL Method of Accelerating Reservoir Numerical Simulation
PDF
Shigellosis and Socio-Demography of hospitalized Patients in Kano, North-West...
PDF
Corelation between Central Corneal Thicknes, Gender and Age in Bulgarian Chil...
PDF
Studies on Mortars and Concretes with Pozzolonic Admixture
PDF
Oil Shale Ex-Situ Process - Leaching Study of Spent Shale
PDF
Antibacterial Activity of Schiff Bases Derived from OrthoDiaminocyclohexane, ...
PPTX
3Com 3C16971
PDF
How to handle AdWords Device Modifiers
PDF
Por utag 2017 2020 ro
PPT
As You Go
PPT
IBM's use of virtual worlds - Roo Reynolds' presentation at Eduserv Foundatio...
Equation of everything i.e. Quantum Fields: the Real Building Blocks of the U...
A Minimum Spanning Tree Approach of Solving a Transportation Problem
Poisson-Mishra Distribution
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
Oscillation and Convengence Properties of Second Order Nonlinear Neutral Dela...
Should Astigmatism be Corrected until the Age of Three? Results of a Six-year...
Numerical Simulation of Flow in a Solid Rocket Motor: Combustion Coupled Pres...
The Krylov-TPWL Method of Accelerating Reservoir Numerical Simulation
Shigellosis and Socio-Demography of hospitalized Patients in Kano, North-West...
Corelation between Central Corneal Thicknes, Gender and Age in Bulgarian Chil...
Studies on Mortars and Concretes with Pozzolonic Admixture
Oil Shale Ex-Situ Process - Leaching Study of Spent Shale
Antibacterial Activity of Schiff Bases Derived from OrthoDiaminocyclohexane, ...
3Com 3C16971
How to handle AdWords Device Modifiers
Por utag 2017 2020 ro
As You Go
IBM's use of virtual worlds - Roo Reynolds' presentation at Eduserv Foundatio...
Ad

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
OOP with Java - Java Introduction (Basics)
PDF
composite construction of structures.pdf
PPTX
Welding lecture in detail for understanding
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
ETO & MEO Certificate of Competency Questions and Answers
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Foundation to blockchain - A guide to Blockchain Tech
OOP with Java - Java Introduction (Basics)
composite construction of structures.pdf
Welding lecture in detail for understanding
Arduino robotics embedded978-1-4302-3184-4.pdf
Lesson 3_Tessellation.pptx finite Mathematics
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
bas. eng. economics group 4 presentation 1.pptx
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Lecture Notes Electrical Wiring System Components
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
ETO & MEO Certificate of Competency Questions and Answers

Job Failure Analysis in Mainframes Production Support

  • 1. International Journal of Mathematics and Statistics Invention (IJMSI) E-ISSN: 2321 – 4767 P-ISSN: 2321 - 4759 www.ijmsi.org Volume 5 Issue 3 || March. 2017 || PP-31-36 www.ijmsi.org 31 | Page Job Failure Analysis in Mainframes Production Support Ranjani KV1 , R. Roseline Mary2 1 (Department of Computer Science, Christ University, Bangalore India) 2 (Department of Computer Science, Christ University, Bangalore India) Abstract: A major part of batch processing on mainframe computers consists of several thousand batch jobs which run every day. This network of jobs runs every day to update day-to-day transaction. There are frequent failures which can cause a high delay in the batch and also degrade the performance & efficiency of the application. Permanent solution can be done to frequently occurring job failures to avoid the delay in batch and to improve performance & efficiency of the application. In this paper, we have analyzed the frequently occurring batch job failure recorded in Know Error Databases (KEBD) for past one year based on different categories. Frequently failed jobs obtained are categorized based on application, failure-type, job-runs and the resolution. Different results are obtained in the weka tool based on the different categories. From the various results obtained it can be concluded that the frequent failures are occurring in MSD application. On further analysis on this frequently failed jobs showed that data and network issue are causing the major job failures in which most of the jobs were daily processing jobs. In order to fix the failure the jobs was resolved by restarting the job from the overrides or by restarting the job from the top. Keywords: Batch jobs, Failure analysis, Know Error Databases (KEDB), Resolution, weka I. INTRODUCTION Batch processing on mainframe computers consists of several thousand batch jobs which run every day. This network of jobs shows the day-to-day business transaction that are updated during the night with interrelations requiring scheduling and prioritizing the jobs to assure all batch jobs run in the scheduled order within Service Level Agreement(SLA). The job scheduler helps in identifying the times at which the job runs on specific days and the dependencies of the batch job will also be seen in the scheduler. Scheduled jobs run on specific days and at different times ensures updating on business holidays as well without any loss of data. The execution of some jobs is dependent on the other jobs because output data from first job is used as an input data to second data. This data dependency is also associated with various upstream and vendor. Scheduler also helps in identifying the status of the job i.e. under the execution, waiting for other jobs to complete, arriving of data or file from the upstream or vendor, Error if the job has failed etc. There would be a delay or postponing in the job run due to dependent job in failed state which further delays application batch and downstream jobs dependent on the failed jobs. The resolution of that failed job is fixed by referring to the Know Error Databases (KEDB). KEDB contains previously failed job details which include the Job Name, Application, Return code, Error Message, Resolution and the person who resolved it. If the job details are not present in the KEDB, then it is first time failure for particular job and the record is added for future reference. Based on the previously occurred error in the production the job failures are categorized based on input arrival times and the type of failure occurred. Based on the failure occurred that are stored or updated in the KEDB, the failure is categorized into different types like technical issue, network issue, contention, space issue, data issue, cancellation, new jobs, developer mistake or incorrect scheduling. Based on these categorizations the analysis is made on, how many times the job failed due to same error and the resolution done to fix the error. To analyze the pattern of the job failures, The KEDB for the previously failed jobs for the past one year is obtained. With this the frequent job failure are analyzed for following categories:  The type of the job that has failed (like the processing, transmission, database)  The job failure type (like data issue, network issue, database, deadlock)  The action taken to fix the job failure at that time (like the job was restarted from top, restarted from failed step, marked the job as complete).  The job names. Based on the categorization, analysis is done to improve performance and avoid the recurring failures of the system by implementing or suggesting the permanent fix for the specific job. This analysis is done with the help of weka tool. The job failure dataset obtained for the past one year is taken and inserted into the weka for the analysis and the respective results obtained are tabulated are shown in this research paper.
  • 2. Job Failure Analysis In Mainframes Production Support www.ijmsi.org 32 | Page II. PROBLEM DESCRIPTION A batch jobs on mainframe will often run every day to update day-to-day transaction and there are frequent failures. These failures can cause a high delay in the batch as well as degrade the performance and efficiency of the application. In order to avoid the delay in batch, degradation of performance & efficiency of the application a permanent solution could be done to frequently occurring job failures. This is done by analyzing the pattern of the job failure and the resolution steps taken to fix the failure. Once the pattern frequency of the failure and the resolution steps to fix are analyzed, the failure can be avoided in future by fixing it permanently by analyzing the pattern of the job failure with the historical failures recorded by the support team in a project for keeping record of the failed jobs. III. LITERATURE REVIEW In [1] ,the paper ”Job Failure Analysis and its implications in a Large-scale Production Grid” determines an analysis of job failures in large-scale data intensive grid. Job failures in large-scale heterogeneous environments are due to a variety of possible causes, system problems which was due to node, disk or network issues. Errors also occurred at different levels as the software stack was more and more complex. Based on the job failures they are represented in three periods in the production, characterize the inter-arrival times and life spans of failed jobs. Different failure types are distinguished and the analysis is carried out. Based on the failure pattern, historical failure is taken into account in decision making. Based on the analysis the cooperation and accountability issues are briefly addressed, evaluated the effectiveness and feasibility. In COBOL application (Banking system) critical outages were due to common causes. ” Towards Assuring Non-Recurrence of faults Leading to Transaction outages – An Experiment with Stable Business Application” [2] determines that, to reduce the cost and efforts for maintaining a legacy business application was a challenge for capturing faults at early stage of software – known to prevent defects in production. Analysis was performed on these common causes to detect the causes using structured and COBOL flow analysis. From an structural analysis and control flow analysis techniques the faults was automatically detected using the all occurrences of the faults which would potentially lead to multiply failures during production. In [3]” Towards a Training-Orientated Adaptive Decision Guidance and Support System” determines that, strategic approaches are needed to troubleshoot system failures by first identifying the component causing failure to solve the problems. In this paper they have addressed the domain of administration of DB2 on z/OS Mainframes system. The framework dynamically extracts knowledge from various correlated data sources containing system related data from the problem solving procedures of human experts. The research paper applies text and data mining techniques for knowledge extraction a rule based system for knowledge representation and problem categorization and case based system for providing decision support. Based on the error codes categorization for the job failures” Getting back Up: Understanding How Enterprise Data Backups fail” [4] determines that, the jobs that run on each system are monitored and checked if they are completed successfully. Error characteristics are done based on the production, development and test, Number of unique error codes Number of most frequent error codes. Error causes are due to misconfigurations, system error and information messages (unusual). The above characterizations are done with the historical data and the analysis performed for decision support. IV. METHODS AND MATERIALS To analyze the pattern of the batch job failures in mainframe computers, the KEDB for the previously failed jobs for the past one year is obtained. The most frequent job failures were recorded and analyzed with the weka tool upon these categorization:  Application  Failure-type  Resolution  Job Type  Job runs 1. Application: In this research paper the failed jobs are categorized into application based on the different servers the job runs. 2. Failure-Type: The job failure are classified as below Data, Network, Delay, Deadlock a) Data: The failure is classified as data issue is there is any discrepancy in the file received. Examples of data issues could be due to the following reasons:  Incorrect file received from the upstream or the vendor.  Junk values in the file which has caused the data inconsistency.
  • 3. Job Failure Analysis In Mainframes Production Support www.ijmsi.org 33 | Page  The format of the file is not as expected for example; the data or the values in the files are not in the correct format as expected.  Missing values in the file or the file is empty. b) Network: The interaction with batch processing is mainly through a network of transmitting or receiving the files from either one server to another or through DB2 tables. The failure is classified as network issue when there is any issue in interacting with the batch processing. Examples of the network issue could be due to the following reasons:  Server unavailability to while extraction the file or uploading the file during the job execution.  Resource unavailability, this could be due to the file or the table is been used by other jobs.  System or the application down while the user is trying to access the data. c) Delay: When there is any feed delay from the upstream or the vendor the batch jobs go into the failed status. The delay would happen due to the following reasons mentioned below:  The file is very huge (that is, contains more number of records than the expected.  The scheduled release activity either in our application or in the upstream.  The jobs going into contention waiting for the files which are used by the other jobs. d) Deadlock: When the batch is executing concurrently, a deadlock can happen when one job is trying to access the file or database and is waiting on the other job for release the lock on that file or the database. 3. Job-Type: The job type specifies the type of job failures that happen while the batch is running. The failed jobs in past year recorded are categorized into the following: a) Processing: The mainframes batch jobs that run during the night updates the data or the transactions that has happened in the day time. The processing jobs have failed due to the following reasons:  Insufficient storage: While updating the day-to-day transactions into a dataset or DB2 tables.  Null values: When there is an empty file or dataset received or missing data in a specific column.  Incorrect data formats: When the data received is not in the same format the values usually received like the date formats or the characters in place of numeric values.  Load Balancing: When the jobs are automatically submitted on different CPU’s where the specific application jobs do not have access to run on that CPU. b) FTP Transmission: File Transfer Protocol (FTP) transmission is process of transmitting the files between servers. c) NDM: Network Data Mover (NDM) transmission is process of transmitting the files between servers with a direct connect by installing the server details at both the end before transmitting any details between the servers. This type of transmission is much faster and more secure while comparing with the FTP transmission. The FTP transmission and NDM transmission jobs have failed due to the following reasons:  Connectivity/Access Issue: When the server details are not installed at both the ends so the jobs fail due to access issue while accessing the servers.  File unavailability: When the file is not available as the file generation at the upstream is still in progress. d) Database: The database jobs are the jobs that append the latest data into the database and take the backup of those databases into the datasets for future reference. These jobs failed due to the following reasons:  Resource/ Datasets Unavailability: Where the specific database or the table is in use by other batch jobs.  Space Issues: While updating the day-to-day transactions into a dataset or DB2 tables. 4. Job-Runs: The Job-Runs categorized with Daily, Weekly, Monthly jobs based on the pattern the job runs. a) Daily: The jobs are categorized into daily where the jobs runs daily that is Mon-Friday or Tuesday to Saturday. b) Weekly: The jobs are categorized into weekly where the job runs any one day of the week. c) Monthly: The jobs are categorized into monthly where the job runs once in month. 5. Resolution: The resolution steps taken to resolve the job failure when it has failed are categorized in the following categories: a) Restarted-the-job-from-top: The failed jobs are restarted from the top when they have failed before executing either due to access issue or network connectivity issue. b) Restarted-the-job-failed-step: The failed jobs are restarted from the failed step when the processing of the jobs is interrupted due to the resource unavailability, network issue, unexpected return code from the job, etc. c) Marked-the-job-as-complete: The failed jobs are marked as completed when all the processing completed and the job has thrown any acceptable return code.
  • 4. Job Failure Analysis In Mainframes Production Support www.ijmsi.org 34 | Page d) Restarted-the-job-with-the-overrides: The failed jobs are restarted with the overrides when the job has failed due to syntax errors, space issue or file not available due to delay or any other reasons where the job needs any manual modification to complete the job successfully. V. RESULTS AND DISCUSSION The various results obtained with the above categorization with Weka tool for the records are shown as below with the respective observations. 1. Application: Based on Application, below figure Fig.1 shows the classification for different categories. The MSD application has the more number of job failures compared to the IMS application jobs in the past one year. 78.61% of the times the jobs have failed in MSD application and 21.38% of the times the jobs have failed in IMS application. Fig. 1: Application classification for different categorizes 2. Failure- Type: Based on Failure-Type, below figure Fig.2 shows the Failure-Type classification. It shows that there are 47.48% of the times the jobs are failing due to the data issue, 43.08% of the times the jobs are failing due to Network issue, 5.74% of the times the jobs are failing due to delay and 3.45% of the times the jobs are failing due to deadlock. Fig. 2: Failure-Type classification for different categorizes 3. Resolution: Based on Resolution, below figure Fig.3 shows the Resolution-Type classification. It shows that there are 30.81% of the times the failed jobs have resolved by restarting the job from the top, 32.38% of the times the failed jobs have resolved by restarting the job from failed step, 13.20% of the times the failed jobs have resolved by marking the job as complete and 23.58% of the times the failed jobs have resolved by restarting the job with the overrides.
  • 5. Job Failure Analysis In Mainframes Production Support www.ijmsi.org 35 | Page Fig. 3: Resolution classification for different categorizes 4. Job-Type: Based on Job-Type, figure Fig 4 shows the Job-Type classification. It shows that there are 59.74% of the times the Processing jobs have failed, 29.87% of the times the FTP Transmission jobs have failed, 5.34% of the times the Database jobs have failed and 5.03% of the times the NDM Transmission jobs have failed. Fig. 4: Job type classification for different categorizes 5. Job-Runs: The below figure Fig 5 shows the Job-Run classification. It shows that there are 93.08% of the times daily jobs have failed, 3.45% of the times weekly jobs have failed and 3.45% of the times the monthly jobs have failed for the job failure taken for the past one year. Fig. 5: Job Runs classification for different categorizes
  • 6. Job Failure Analysis In Mainframes Production Support www.ijmsi.org 36 | Page VI. CONCLUSION AND FUTURE ENHANCEMENT With the analysis of the pattern of the job failures recorded in KEDB for the previously failed jobs for the past one year are obtained into different categories. The results obtained in the weka tool for the frequent job failures based on the different categories are tabulated as below: Categories Classifications Job Failure in % Application MSD 78.61 Failure-Type Data Issue 47.48 Network Issue 43.08 Resolution Restarting the job from the top 30.81 Restarting the job from failed step 32.38 Job-Type Processing 59.74 Job-Runs Daily 93.08 The above result implies, job failure behavior based on the various classifications and the percentage of job failures with same error. These results help us understand the major failures occur during night batch processing, with the type of failure occurring and the resolution preformed to fix the failure. For further enhancement, the results could be analyzed with respect to the specific jobs. By analyzing the specific job failure with the same error can provide more accurate results to reduce the frequent job failures. These recurring failures can be permanently fixed by implementing the changes in source code so as to avoid the future job failures for the same error. This would help in increasing the performance, efficiency and stability of the system. REFERENCES [1]. H. Li, D. Groep, L. Wolters and J. Templon, "Job Failure Analysis and Its Implications in a Large-scale Production Grid", Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006. [2]. A. Agrawal and R. Naik, "Towards Assuring Non-recurrence of Faults Leading to Transaction Outages – An Experiment with Stable Business Applications", ISEC’11, February. 23–27, 2011, Thiruvananthapuram, Kerela, India. Copyright © 2011 ACM 978- 1-4503-0559-4/11/02…$10.00, 2011. [3]. F. Zulkernine, P. Martin, S. Soltani, W. Powley, S. Mankovskii and M. Addleman, "Towards a Training-Oriented Adaptive Decision Guidance and Support System", 2010. [4]. G. Amvrosiadis and M. Bhadkamkar, "Getting Back Up: Understanding How Enterprise Data Backups Fail", 2009. [5]. S. Kavulya, J. Tan, R. Gandhi and P. Narasimhan, "An Analysis of Traces from a Production MapReduce Cluster", 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010). May 17-20, 2010, Melbourne, Victoria, Australia., 2010. [6]. “Differences Between NDM and FTP", Difference Between, 2017. [Online]. Available: http://www.differencebetween.net/technology/differences-between-ndm-and-ftp/. [7]. “7 Benefits of Using a Known Error Database (KEDB)", The ITSM Review, 2017. [Online]. Available: http://www.theitsmreview.com/2012/04/7-benefits-of-using-a-kedb/.