SlideShare a Scribd company logo
1
Cosequential Processing and the
Sorting of Large Files
2
Definition
• Cosequential operations involve the coordinated
processing of two or more sequential lists to
produce a single output list.
• This is useful for merging (or taking the union) of
the items on the two lists and for matching (or
taking the intersection) of the two lists.
• These kinds of operations are extremely useful in
file processing.
3
Overview
• Part 1:
– Development of a general model for doing co-
sequential operations.
– Illustration of this model’s use for simple
matching and merging operations.
– Application of this model to a more complex
general ledger program
• Part 2:
– Multi-Way Merging
– External Sort-Merge
4
A Model for Implementing
Cosequential Processes: Matching I
• Adams
• Carter
• Chin
• Davis
• Foster
• Garwick
• James
• Johnson
• Karns
• Lambert
• Miller
• Adams
• Anderson
• Andrews
• Bech
• Burns
• Carter
• Davis
• Dempsey
• Gray
• James
• Johnson
• Katz
• Peters
Matching Names in Two Lists
5
A Model for Implementing
Cosequential Processes: Matching II
Matching names in two lists: Matters to Consider:
• Initializing: we need to arrange things so that the
procedure gets going properly.
• Getting and accessing the next list item: we need
simple methods to do so.
• Synchronizing: we have to make sure that the current
item from one list is never so far ahead of the current
item on the other that a match will be missed.
• Handling end-of-file conditions
• Recognizing Errors
• Matching the names efficiently -->Good synchronization
6
A Model for Implementing
Cosequential Processes: Matching III
Synchronization
• Let Item(1) be the current item from list 1 and
Item(2) be the current item from list 2.
• Rules:
– If Item(1) < Item(2), get the next item from list 1.
– If Item(1) > Item(2), get the next item from list 2.
– If Item(1) = Item(2), output the item and get the
next items from the two lists.
7
A Model for Implementing
Cosequential Processes: Merging I
• The matching procedure can easily be
modified to handle merging of two lists.
• An important difference between matching
and merging is that with merging, we must
read completely through each of the lists.
• We have to recognize, however, when one
of the two lists has been completely read
and avoid reading again from it.
8
Application of the Cosequential Model
to a General Ledger Program I
• The problem: To design a general ledger posting
program as part of an accounting system.
• The system contains:
– A journal file: with the monthly transactions that
are ultimately to be posted to the ledger file.
– A ledger file containing month-by-month
summaries of the values associated with each of the
bookkeeping accounts.
• Posting involves associating each transaction with its
account in the ledger.
9
Application of the Cosequential Model
to a General Ledger Program II
• How is the posting process implemented?
• Solution 1: Build an index for the ledger organized
by account number. ==> 2 problems: 1) lots of
seeking back and forth; 2) the journal entries relating
to one account are not collected together.
• Solution 2: collect all the journal transactions that
relate to a given account by sorting the journal
transactions by account number and working through
the ledger and the sorted journal cosequentially.
10
Application of the Cosequential Model
to a General Ledger Program III
• Goal of our program: To produce a printed version
of the ledger that not only shows the beginning and
current balance for each account but also lists all the
journal transactions for the month.
• From the point of view of the ledger accounts, the
posting process is a merge (even unmatched ledger
accounts appear in the output). From the point of view
of the journal accounts, the posting process is a match.
• Our program must implement a combined merge/match
while simultaneously printing account title lines,
individual transactions and summary balances.
11
Application of the Cosequential Model
to a General Ledger Program IV
• Summary of the steps involved in processing the ledger
entries:
– Immediately after reading a new ledger object, print
the header line and initialize the balance for the next
month from the previous month’s balance.
– For each transaction object that matches, update the
account balance.
– After the last transaction for the account, print the
balance line.
12
Application of the Cosequential Model
to a General Ledger Program V
The posting process has three cases:
• If the ledger account number is less then the journal
transaction account number, then print the ledger account
balance and then read in the next ledger account and print
its title line if the account exists.
• If the account numbers match, then add the transaction
amount to the account balance, print the description of the
transaction, and read the next journal entry.
• If the journal account is less than the ledger account, then it
is an unmatched journal account. Print an error message and
continue with the next transaction.
13
A K-Way Merge Algorithm
Let there be two arrays:
• An array of k lists and
• An array of k index values corresponding to the current
element in each of the k lists, respectively.
Main loop of the K-Way Merge algorithm:
• Find the index of the minimum current item, minItem
• Process minItem(output it to the output list)
• For i=0 until i=k-1 (in increments of 1)
– If the current item of list i is equal to minItem then
advance list i.
• Go back to the first step.
14
A Selection Tree for Merging
Large Number of Lists
• The K-Way Merging Algorithm just described works
well if k < 8. Otherwise, the number of comparisons
needed to find the minimum value each step of the way
is very large.
• Instead, it is easier to use a selection tree which allows
us to determine a minimum key value more quickly.
• Merging k lists using this method is related to log2k (the
depth of the selection tree) rather than to k.
• Updating selection trees is not easy ==> Keep a tree of
losers (Knuth, 73).
15
Keeping Trees of Losers rather
than Trees of Winners I
• Advantages of the Tree of Losers:
• When using a tree of winners, the records with which the
winner has to be compared--so as to find the next winner--
are located in different subtrees. Updating such a tree is
not very convenient.
• When using a tree of losers,
– The value of each leaf (apart from the smallest, the
winner) occurs only once in an internal node)
– All the records with which the winner has to be
compared lie on a path from the winner leaf to the root.
– As long as each node in the tree has a pointer to its
parent, then it is very easy to find the next winner.
16
Keeping Trees of Losers rather
than Trees of Winners II
• Algorithm for updating a selection tree of losers:
• T is a pointer to an internal node in the tree of losers
• topoftree is a flag indicating if updating has reached the
root
T <-- parent of Buffer[s]
topoftree <-- false
repeat if key(Buffer(loser(T))) < key(Buffer[s])
then interchange loser(T) and s
if T = root
then topoftree <-- true
else T <-- parent of node pointed to by T
until topoftree.
17
An Efficient Approach to Sorting
in Memory
• When we previously discussed sorting a file that is small
enough to fit in memory, we assumed that:
– We would read the entire file from disk into memory.
– We would sort the records using a standard sorting
procedure, such as shellsort.
– We would write the file back to disk.
• If the file is read and written as efficiently as possible and
if the best sorting algorithm is used, it seems that we
cannot improve the efficiency of this procedure.
• Nonetheless, we can improve it by doing things in
parallel: we can do the reading or writing at the same
time as the sorting.
18
Overlapping Processing and I/O:
Heapsort
• Heapsort can be combined with reading from the disk and
writing to the disk as follows:
– The heap can be built while reading the file.
– Sorting can be done while writing to the file.
• Heaps show certain similarities with selection trees, but they
have a somewhat looser structure.
• Heaps have three important properties:
– Each node has a single key and that key is greater than
or equal to the key at its parent node.
– A Heap is a complete binary tree.
– Storage can be allocated sequentially as an array with left
and right children of node i located at index 2i and 2i+1
respectively. ==> Pointers are unnecessary.
19
Building the Heap
Insert(NewKey) {
• if (NumElements=MaxElements) return false
• NumElement++
• HeapArray[NumElements]= NewKey
• int k=NumElements; int parent;
• while (k>1)
{ parent=k/2
if (Compare(k, parent) >= 0) break;
else Exchange(k, parent);
k=parent}
• Return true}
20
Building the Heap while Reading
the File I
• Rather than seeking every time we want a new record, we
read blocks of records at a time into a buffer and operate
on that block before moving to a new block.
• The input buffer for each new block of keys becomes part
of the memory area set up for the heap. Each time we read
a new block, we just append it to the end of the heap.
• The first new record is then at the end of the heap array, as
required by the insert function.
• Once a record is inserted, the next new record is at the end
of the heap array ready to be inserted as well.
21
Building the Heap while Reading
the File II
• Reading block saves on seek time, but it does not allow to
build the heap while reading input.
• In order to do so, we need to use multiple buffers: as we
process the keys in one block from the file, we can
simultaneously read later blocks from the file.
• Question: How many buffers should be used and where
should we put them?
• Answer: the number of buffers is the number of blocks in
the file, and they are located in sequence in the array.
• Note: since building the heap can be faster than reading
blocks, there may be some delays in processing.
22
Heap Sorting I
There are three repetitive steps involved in sorting the keys:
• Determine the value of the key in the first position of the
heap (i.e., the smallest value).
• Move the largest value in the heap (last heap element)
into the first position, and decrease the number of
elements by one. At this point, the heap is out of order.
• Reorder the heap by exchanging the largest element with
the smaller of its children and moving down the tree to
the new position of the largest element until the heap is
back in order.
23
Heap Sorting II
Remove()
• val=HeapArray[1];
• HeapArray[1]=HeapArray[NumElements];
• NumElements--;
• int k=1; int newK;
• while (2*k <= NumElements){
– if (Compare(2*k, 2*k+1)) < 0) newK=2*k; else
newK=2*k+1;
– if (Compare(k, newK) <0) break;
– Exchange(k,newK);
– k=newK;}
• return val;}
24
Heap Sorting while Writing to
the File
• The smallest record in the heap is known during the first
step of the sorting algorithm. Therefore, it can be buffered
until a whole block is known.
• While that block is written onto the disk a new block can be
processed and so on.
• Since every time a block can be written to disk, the heap
size decreases by one block, that block can be used as a
buffer. i.e., we can have as many output buffers as there are
blocks in the file.
• Since all the I/O is sequential, this algorithm works as well
with disks and tapes. As well, a minimum amount of
seeking is necessary and thus the procedure is efficient.
25
An Efficient way of Sorting Large
Files on Disks: MergeSort
• A solution for this problem was previously presented in the
form of the Keysort algorithm. However, Keysort has two
shortcomings:
– Once the key were sorted, it was expensive to seek each
record in sorted order and then write them to the new,
sorted file.
– If the file contains many records, even the index is too
large to fit in memory.
• Solution: (1) Break the file into several sorted subfiles
(runs), using an internal sorting method; and (2) merge the
runs. ==> MergeSort
26
MergeSort: Advantages
• It can be applied to files of any size.
• Reading of the input during the run-creation step is
sequential ==> Not much seeking.
• Reading through each run during merging and writing the
sorted record is also sequential. The only seeking necessary
is as we switch from run to run.
• If heapsort is used for the in-memory part of the merge, its
operation can be overlapped with I/O
• Since I/O is largely sequential, tapes can be used.
27
How much Time does a
MergeSort take?
Simplifying assumptions:
• Only one seek is required for any single sequential access.
• Only one rotational delay is required per access.
Expensive steps (i.e. involving I/O) occurring in MergeSort
• During the sort phase:
– Reading all records into memory for sorting and forming
runs.
– Writing sorted runs to disk
• During the merge phase:
– Reading sorted runs into memory for merging.
– Writing sorted file to disk.
28
What kinds of I/O take place during the
Sort and the Merge phases?
• Since, during the sort phase, the runs are created using
heapsort, I/O is sequential. No performance improvement
can ever be gained in this phase.
• During the reading step of the merge phase, there are a lot
of random accesses (since the buffers containing the
different runs get loaded and reloaded at unpredictable
times). The number and size of the memory buffers holding
the runs determine the number of random accesses.
Performance improvements can be made in this step.
• The write step of the merge phase, is not influenced by the
way in which we organize the runs.
29
The Cost of Increasing the File
Size
• In general, for a K-way merge of K runs where each run is as
large as the memory space available, the buffer size for each
of the runs is:
(1/K)* size of memory space = (1/K) * size of each run.
• So K seeks are required to read all of the records in each
individual run and since there are K runs altogether, the
merge operation requires K2
seeks.
• Since K is directly proportional to N, the number of records,
SortMerge is an O(N2
) operation, measures in terms of seeks.
30
What can be done to Improve
MergeSort Performance?
There are different ways in which MergeSort’s
efficiency can be improved:
• Allocate more Hardware such as disk drives,
memory, and I/O channels.
• Perform the merge in more than one step,
reducing the order of each merge and increasing
the buffer size for each run.
• Algorithmically increase the lengths of the initial
sorted runs.
• Find ways to overlap I/O Operations.
31
Hardware-Based Improvements
• Increasing the amount of memory: helps make the buffers
larger and thus reduce the numbers of seeks.
• Increasing the Number of Dedicated Disk Drives: If we
had one separate read/write head for every run, then no time
would be wasted seeking.
• Increasing the Number of I/O Channels: With a single I/O
Channel, no two transmission can occur at the same time.
But if there is a separate I/O Channel for each disk drive,
then I/O can overlap completely.
• But what if hardware based improvements are not possible?
32
Decreasing the Number of Seeks
Using Multiple-Step Merges
• The expensive part of the MergeSort algorithm is related to
all the seeking performed during the reading step of the
merge phase. A lot of seeks are involved because of the
large number of runs that get merged simultaneously.
• In multi-step merging, we do not try to merge all runs at one
time. Instead, we break the original set of runs into small
groups and merge the runs in these groups separately. More
buffer space is available for each run, and, therefore, fewer
seeks are required per run).
• When all the smaller merges are completed, a second pass
merges the new set of merged runs.
33
Increasing Run Lengths Using
Replacement Selection
Replacement Selection Procedure:
• Read a collection of records and sort them using heapsort. The
resulting heap is called the primary heap.
• Instead of writing the entire primary heap in sorted order, write
only the record whose key has the lowest value.
• Bring in a new record and compare the values of its key with that
of the key that has just been output.
– If the new key value is higher, insert the new record into its
proper place in the primary heap along with the other records
that are being selected for output.
– If the new record’s key value is lower, place the record in a
secondary heap of records with key values smaller than those
already written.
• Repeat Step 3 as long as there are records left in the primary
heap and there are records to be read. When the primary heap is
empty, make the secondary heap into the primary heap and
repeat steps 2 and 3.
34
Analysis of Run Length Selection
• Question 1: Given P locations in memory, how long a run
can we expect replacement selection to produce on average?
• Answer 1: On average we can expect a run length of 2P.
• Question 2: What are the costs of using replacement
selection?
• Answer 2: Replacement Selection requires much more
seeking in order to form the runs. However, the reduction in
the number of seeks required to merge the runs usually
more than offsets that extra cost.
35
Replacement Selection +
MultiStep Merging
• In practice, Replacement Selection is not used
with a one-step merge procedure.
• Instead, it is usually used in a two-step merge
process.
• The reduction in total seek and rotational delay
time is most affected by the move from one-step
to two-step merges, but the use of Replacement
Selection is also somewhat useful.
36
Using Two Disk Drives with
Replacement Selection
• Replacement Selection offers an opportunity to save on
both transmission and seek times in ways that memory
sort methods do not.
• We could use one disk drive to do only input operations
and the other one to do only output operations.
• This means that:
– Input and Output can overlap ==> Transmission time
can be decreased by up to 50%.
– Seeking is virtually eliminated.
37
More Drives? More Processor?
• We can make the I/O process even faster by using more
than two disk drives.
• If I/O becomes faster than processing, then more
processors can be used. Different network architectures
can be used for that:
– Mainframe computers
– Vector and Array processors
– Massively parallel machines
– Very fast local area networks and communication
software.

More Related Content

PDF
MG6088 SOFTWARE PROJECT MANAGEMENT
DOCX
Bank management system
PPTX
Visualizing Progress.pptx
PPTX
Unix case-study
PPTX
Spm unit iii-risk-working in teams
PPTX
Distributed operating system
PPT
Agile software development
PPTX
Microkernel architecture
MG6088 SOFTWARE PROJECT MANAGEMENT
Bank management system
Visualizing Progress.pptx
Unix case-study
Spm unit iii-risk-working in teams
Distributed operating system
Agile software development
Microkernel architecture

What's hot (20)

PPTX
Unit iv -Transactions
PPT
Distributed & parallel system
PPTX
NIST Cloud Computing Reference Architecture
PPT
Chapter 14 replication
PPT
Swap-space Management
PPTX
Disk and File System Management in Linux
PPTX
The database applications
PPTX
Fault tolerance in distributed systems
PPTX
Online Food Ordering System
PPTX
Chapter 12
PPTX
GFS & HDFS Introduction
PPTX
Virtualization
PPTX
PPT
Communications is distributed systems
PPTX
Library Management System Waterfall Model
PPTX
UML Diagrams for Real estate management system
PPT
Unit 4
PPT
15. Transactions in DBMS
PPTX
Database Technology- ITM
Unit iv -Transactions
Distributed & parallel system
NIST Cloud Computing Reference Architecture
Chapter 14 replication
Swap-space Management
Disk and File System Management in Linux
The database applications
Fault tolerance in distributed systems
Online Food Ordering System
Chapter 12
GFS & HDFS Introduction
Virtualization
Communications is distributed systems
Library Management System Waterfall Model
UML Diagrams for Real estate management system
Unit 4
15. Transactions in DBMS
Database Technology- ITM
Ad

Similar to Cosequential processing and the sorting of large files (20)

PPTX
Database management system ch1 & 2.pptx
PPTX
Transactions
PDF
introduction of Data structure with example
PPTX
Data Structures and Algorithm - Module 1.pptx
PDF
Unit I Data structure and algorithms notes
PDF
DataBaseManagementSystems-BTECH--UNIT-5.pdf
PPTX
Unit 4 dbms
PPTX
Lecture 3 - Data Structure File Organization
PDF
CH5_Query Processing and Optimization.pdf
PPT
Unit II_Searching and Sorting Algorithms.ppt
PPTX
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
PPTX
CSC718 Operating Systems and Parallel Programming
PPT
Data structure
PPTX
Algorithm and C code related to data structure
PPTX
8- The Processor (1).pptx with detailed explained
PDF
Unit 1 OF DS FOR AI DS BTRCH OF DS FOR AI DS BTRCH .pdf
PPTX
Data Structures in C
PDF
Programming techniques
PPTX
Programing techniques
Database management system ch1 & 2.pptx
Transactions
introduction of Data structure with example
Data Structures and Algorithm - Module 1.pptx
Unit I Data structure and algorithms notes
DataBaseManagementSystems-BTECH--UNIT-5.pdf
Unit 4 dbms
Lecture 3 - Data Structure File Organization
CH5_Query Processing and Optimization.pdf
Unit II_Searching and Sorting Algorithms.ppt
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
CSC718 Operating Systems and Parallel Programming
Data structure
Algorithm and C code related to data structure
8- The Processor (1).pptx with detailed explained
Unit 1 OF DS FOR AI DS BTRCH OF DS FOR AI DS BTRCH .pdf
Data Structures in C
Programming techniques
Programing techniques
Ad

More from Devyani Vaidya (20)

PPT
PPT
Fundamental file structure concepts &amp; managing files of records
PPT
Introduction to the design and specification of file structures
PPTX
Mobile Phone Cloning
PPTX
Data warehousing
PPTX
secued cloud
PPTX
Cloud Cmputing Security
PPTX
Cloud Security
PPTX
Wireless network
PPT
Environmental law
PPTX
Wireless mobile charging using microwaves
PPTX
Secure Cloud Issues
PPTX
Energy Harvesing Through Reverse Electrowetting
PPT
Wireless Charging Of Mobile
PPTX
Applet programming
PPTX
Seminar on telephone directory
PPTX
History of Laptop
PPTX
Ppt on open and close door using Applet
PPTX
Resource management
PPTX
Ppt on use of biomatrix in secure e trasaction
Fundamental file structure concepts &amp; managing files of records
Introduction to the design and specification of file structures
Mobile Phone Cloning
Data warehousing
secued cloud
Cloud Cmputing Security
Cloud Security
Wireless network
Environmental law
Wireless mobile charging using microwaves
Secure Cloud Issues
Energy Harvesing Through Reverse Electrowetting
Wireless Charging Of Mobile
Applet programming
Seminar on telephone directory
History of Laptop
Ppt on open and close door using Applet
Resource management
Ppt on use of biomatrix in secure e trasaction

Recently uploaded (20)

PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Institutional Correction lecture only . . .
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Cell Types and Its function , kingdom of life
PDF
Pre independence Education in Inndia.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Cell Structure & Organelles in detailed.
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Insiders guide to clinical Medicine.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
master seminar digital applications in india
Microbial diseases, their pathogenesis and prophylaxis
Institutional Correction lecture only . . .
102 student loan defaulters named and shamed – Is someone you know on the list?
Cell Types and Its function , kingdom of life
Pre independence Education in Inndia.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Renaissance Architecture: A Journey from Faith to Humanism
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Anesthesia in Laparoscopic Surgery in India
Cell Structure & Organelles in detailed.
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Supply Chain Operations Speaking Notes -ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Microbial disease of the cardiovascular and lymphatic systems
Insiders guide to clinical Medicine.pdf
RMMM.pdf make it easy to upload and study
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
master seminar digital applications in india

Cosequential processing and the sorting of large files

  • 1. 1 Cosequential Processing and the Sorting of Large Files
  • 2. 2 Definition • Cosequential operations involve the coordinated processing of two or more sequential lists to produce a single output list. • This is useful for merging (or taking the union) of the items on the two lists and for matching (or taking the intersection) of the two lists. • These kinds of operations are extremely useful in file processing.
  • 3. 3 Overview • Part 1: – Development of a general model for doing co- sequential operations. – Illustration of this model’s use for simple matching and merging operations. – Application of this model to a more complex general ledger program • Part 2: – Multi-Way Merging – External Sort-Merge
  • 4. 4 A Model for Implementing Cosequential Processes: Matching I • Adams • Carter • Chin • Davis • Foster • Garwick • James • Johnson • Karns • Lambert • Miller • Adams • Anderson • Andrews • Bech • Burns • Carter • Davis • Dempsey • Gray • James • Johnson • Katz • Peters Matching Names in Two Lists
  • 5. 5 A Model for Implementing Cosequential Processes: Matching II Matching names in two lists: Matters to Consider: • Initializing: we need to arrange things so that the procedure gets going properly. • Getting and accessing the next list item: we need simple methods to do so. • Synchronizing: we have to make sure that the current item from one list is never so far ahead of the current item on the other that a match will be missed. • Handling end-of-file conditions • Recognizing Errors • Matching the names efficiently -->Good synchronization
  • 6. 6 A Model for Implementing Cosequential Processes: Matching III Synchronization • Let Item(1) be the current item from list 1 and Item(2) be the current item from list 2. • Rules: – If Item(1) < Item(2), get the next item from list 1. – If Item(1) > Item(2), get the next item from list 2. – If Item(1) = Item(2), output the item and get the next items from the two lists.
  • 7. 7 A Model for Implementing Cosequential Processes: Merging I • The matching procedure can easily be modified to handle merging of two lists. • An important difference between matching and merging is that with merging, we must read completely through each of the lists. • We have to recognize, however, when one of the two lists has been completely read and avoid reading again from it.
  • 8. 8 Application of the Cosequential Model to a General Ledger Program I • The problem: To design a general ledger posting program as part of an accounting system. • The system contains: – A journal file: with the monthly transactions that are ultimately to be posted to the ledger file. – A ledger file containing month-by-month summaries of the values associated with each of the bookkeeping accounts. • Posting involves associating each transaction with its account in the ledger.
  • 9. 9 Application of the Cosequential Model to a General Ledger Program II • How is the posting process implemented? • Solution 1: Build an index for the ledger organized by account number. ==> 2 problems: 1) lots of seeking back and forth; 2) the journal entries relating to one account are not collected together. • Solution 2: collect all the journal transactions that relate to a given account by sorting the journal transactions by account number and working through the ledger and the sorted journal cosequentially.
  • 10. 10 Application of the Cosequential Model to a General Ledger Program III • Goal of our program: To produce a printed version of the ledger that not only shows the beginning and current balance for each account but also lists all the journal transactions for the month. • From the point of view of the ledger accounts, the posting process is a merge (even unmatched ledger accounts appear in the output). From the point of view of the journal accounts, the posting process is a match. • Our program must implement a combined merge/match while simultaneously printing account title lines, individual transactions and summary balances.
  • 11. 11 Application of the Cosequential Model to a General Ledger Program IV • Summary of the steps involved in processing the ledger entries: – Immediately after reading a new ledger object, print the header line and initialize the balance for the next month from the previous month’s balance. – For each transaction object that matches, update the account balance. – After the last transaction for the account, print the balance line.
  • 12. 12 Application of the Cosequential Model to a General Ledger Program V The posting process has three cases: • If the ledger account number is less then the journal transaction account number, then print the ledger account balance and then read in the next ledger account and print its title line if the account exists. • If the account numbers match, then add the transaction amount to the account balance, print the description of the transaction, and read the next journal entry. • If the journal account is less than the ledger account, then it is an unmatched journal account. Print an error message and continue with the next transaction.
  • 13. 13 A K-Way Merge Algorithm Let there be two arrays: • An array of k lists and • An array of k index values corresponding to the current element in each of the k lists, respectively. Main loop of the K-Way Merge algorithm: • Find the index of the minimum current item, minItem • Process minItem(output it to the output list) • For i=0 until i=k-1 (in increments of 1) – If the current item of list i is equal to minItem then advance list i. • Go back to the first step.
  • 14. 14 A Selection Tree for Merging Large Number of Lists • The K-Way Merging Algorithm just described works well if k < 8. Otherwise, the number of comparisons needed to find the minimum value each step of the way is very large. • Instead, it is easier to use a selection tree which allows us to determine a minimum key value more quickly. • Merging k lists using this method is related to log2k (the depth of the selection tree) rather than to k. • Updating selection trees is not easy ==> Keep a tree of losers (Knuth, 73).
  • 15. 15 Keeping Trees of Losers rather than Trees of Winners I • Advantages of the Tree of Losers: • When using a tree of winners, the records with which the winner has to be compared--so as to find the next winner-- are located in different subtrees. Updating such a tree is not very convenient. • When using a tree of losers, – The value of each leaf (apart from the smallest, the winner) occurs only once in an internal node) – All the records with which the winner has to be compared lie on a path from the winner leaf to the root. – As long as each node in the tree has a pointer to its parent, then it is very easy to find the next winner.
  • 16. 16 Keeping Trees of Losers rather than Trees of Winners II • Algorithm for updating a selection tree of losers: • T is a pointer to an internal node in the tree of losers • topoftree is a flag indicating if updating has reached the root T <-- parent of Buffer[s] topoftree <-- false repeat if key(Buffer(loser(T))) < key(Buffer[s]) then interchange loser(T) and s if T = root then topoftree <-- true else T <-- parent of node pointed to by T until topoftree.
  • 17. 17 An Efficient Approach to Sorting in Memory • When we previously discussed sorting a file that is small enough to fit in memory, we assumed that: – We would read the entire file from disk into memory. – We would sort the records using a standard sorting procedure, such as shellsort. – We would write the file back to disk. • If the file is read and written as efficiently as possible and if the best sorting algorithm is used, it seems that we cannot improve the efficiency of this procedure. • Nonetheless, we can improve it by doing things in parallel: we can do the reading or writing at the same time as the sorting.
  • 18. 18 Overlapping Processing and I/O: Heapsort • Heapsort can be combined with reading from the disk and writing to the disk as follows: – The heap can be built while reading the file. – Sorting can be done while writing to the file. • Heaps show certain similarities with selection trees, but they have a somewhat looser structure. • Heaps have three important properties: – Each node has a single key and that key is greater than or equal to the key at its parent node. – A Heap is a complete binary tree. – Storage can be allocated sequentially as an array with left and right children of node i located at index 2i and 2i+1 respectively. ==> Pointers are unnecessary.
  • 19. 19 Building the Heap Insert(NewKey) { • if (NumElements=MaxElements) return false • NumElement++ • HeapArray[NumElements]= NewKey • int k=NumElements; int parent; • while (k>1) { parent=k/2 if (Compare(k, parent) >= 0) break; else Exchange(k, parent); k=parent} • Return true}
  • 20. 20 Building the Heap while Reading the File I • Rather than seeking every time we want a new record, we read blocks of records at a time into a buffer and operate on that block before moving to a new block. • The input buffer for each new block of keys becomes part of the memory area set up for the heap. Each time we read a new block, we just append it to the end of the heap. • The first new record is then at the end of the heap array, as required by the insert function. • Once a record is inserted, the next new record is at the end of the heap array ready to be inserted as well.
  • 21. 21 Building the Heap while Reading the File II • Reading block saves on seek time, but it does not allow to build the heap while reading input. • In order to do so, we need to use multiple buffers: as we process the keys in one block from the file, we can simultaneously read later blocks from the file. • Question: How many buffers should be used and where should we put them? • Answer: the number of buffers is the number of blocks in the file, and they are located in sequence in the array. • Note: since building the heap can be faster than reading blocks, there may be some delays in processing.
  • 22. 22 Heap Sorting I There are three repetitive steps involved in sorting the keys: • Determine the value of the key in the first position of the heap (i.e., the smallest value). • Move the largest value in the heap (last heap element) into the first position, and decrease the number of elements by one. At this point, the heap is out of order. • Reorder the heap by exchanging the largest element with the smaller of its children and moving down the tree to the new position of the largest element until the heap is back in order.
  • 23. 23 Heap Sorting II Remove() • val=HeapArray[1]; • HeapArray[1]=HeapArray[NumElements]; • NumElements--; • int k=1; int newK; • while (2*k <= NumElements){ – if (Compare(2*k, 2*k+1)) < 0) newK=2*k; else newK=2*k+1; – if (Compare(k, newK) <0) break; – Exchange(k,newK); – k=newK;} • return val;}
  • 24. 24 Heap Sorting while Writing to the File • The smallest record in the heap is known during the first step of the sorting algorithm. Therefore, it can be buffered until a whole block is known. • While that block is written onto the disk a new block can be processed and so on. • Since every time a block can be written to disk, the heap size decreases by one block, that block can be used as a buffer. i.e., we can have as many output buffers as there are blocks in the file. • Since all the I/O is sequential, this algorithm works as well with disks and tapes. As well, a minimum amount of seeking is necessary and thus the procedure is efficient.
  • 25. 25 An Efficient way of Sorting Large Files on Disks: MergeSort • A solution for this problem was previously presented in the form of the Keysort algorithm. However, Keysort has two shortcomings: – Once the key were sorted, it was expensive to seek each record in sorted order and then write them to the new, sorted file. – If the file contains many records, even the index is too large to fit in memory. • Solution: (1) Break the file into several sorted subfiles (runs), using an internal sorting method; and (2) merge the runs. ==> MergeSort
  • 26. 26 MergeSort: Advantages • It can be applied to files of any size. • Reading of the input during the run-creation step is sequential ==> Not much seeking. • Reading through each run during merging and writing the sorted record is also sequential. The only seeking necessary is as we switch from run to run. • If heapsort is used for the in-memory part of the merge, its operation can be overlapped with I/O • Since I/O is largely sequential, tapes can be used.
  • 27. 27 How much Time does a MergeSort take? Simplifying assumptions: • Only one seek is required for any single sequential access. • Only one rotational delay is required per access. Expensive steps (i.e. involving I/O) occurring in MergeSort • During the sort phase: – Reading all records into memory for sorting and forming runs. – Writing sorted runs to disk • During the merge phase: – Reading sorted runs into memory for merging. – Writing sorted file to disk.
  • 28. 28 What kinds of I/O take place during the Sort and the Merge phases? • Since, during the sort phase, the runs are created using heapsort, I/O is sequential. No performance improvement can ever be gained in this phase. • During the reading step of the merge phase, there are a lot of random accesses (since the buffers containing the different runs get loaded and reloaded at unpredictable times). The number and size of the memory buffers holding the runs determine the number of random accesses. Performance improvements can be made in this step. • The write step of the merge phase, is not influenced by the way in which we organize the runs.
  • 29. 29 The Cost of Increasing the File Size • In general, for a K-way merge of K runs where each run is as large as the memory space available, the buffer size for each of the runs is: (1/K)* size of memory space = (1/K) * size of each run. • So K seeks are required to read all of the records in each individual run and since there are K runs altogether, the merge operation requires K2 seeks. • Since K is directly proportional to N, the number of records, SortMerge is an O(N2 ) operation, measures in terms of seeks.
  • 30. 30 What can be done to Improve MergeSort Performance? There are different ways in which MergeSort’s efficiency can be improved: • Allocate more Hardware such as disk drives, memory, and I/O channels. • Perform the merge in more than one step, reducing the order of each merge and increasing the buffer size for each run. • Algorithmically increase the lengths of the initial sorted runs. • Find ways to overlap I/O Operations.
  • 31. 31 Hardware-Based Improvements • Increasing the amount of memory: helps make the buffers larger and thus reduce the numbers of seeks. • Increasing the Number of Dedicated Disk Drives: If we had one separate read/write head for every run, then no time would be wasted seeking. • Increasing the Number of I/O Channels: With a single I/O Channel, no two transmission can occur at the same time. But if there is a separate I/O Channel for each disk drive, then I/O can overlap completely. • But what if hardware based improvements are not possible?
  • 32. 32 Decreasing the Number of Seeks Using Multiple-Step Merges • The expensive part of the MergeSort algorithm is related to all the seeking performed during the reading step of the merge phase. A lot of seeks are involved because of the large number of runs that get merged simultaneously. • In multi-step merging, we do not try to merge all runs at one time. Instead, we break the original set of runs into small groups and merge the runs in these groups separately. More buffer space is available for each run, and, therefore, fewer seeks are required per run). • When all the smaller merges are completed, a second pass merges the new set of merged runs.
  • 33. 33 Increasing Run Lengths Using Replacement Selection Replacement Selection Procedure: • Read a collection of records and sort them using heapsort. The resulting heap is called the primary heap. • Instead of writing the entire primary heap in sorted order, write only the record whose key has the lowest value. • Bring in a new record and compare the values of its key with that of the key that has just been output. – If the new key value is higher, insert the new record into its proper place in the primary heap along with the other records that are being selected for output. – If the new record’s key value is lower, place the record in a secondary heap of records with key values smaller than those already written. • Repeat Step 3 as long as there are records left in the primary heap and there are records to be read. When the primary heap is empty, make the secondary heap into the primary heap and repeat steps 2 and 3.
  • 34. 34 Analysis of Run Length Selection • Question 1: Given P locations in memory, how long a run can we expect replacement selection to produce on average? • Answer 1: On average we can expect a run length of 2P. • Question 2: What are the costs of using replacement selection? • Answer 2: Replacement Selection requires much more seeking in order to form the runs. However, the reduction in the number of seeks required to merge the runs usually more than offsets that extra cost.
  • 35. 35 Replacement Selection + MultiStep Merging • In practice, Replacement Selection is not used with a one-step merge procedure. • Instead, it is usually used in a two-step merge process. • The reduction in total seek and rotational delay time is most affected by the move from one-step to two-step merges, but the use of Replacement Selection is also somewhat useful.
  • 36. 36 Using Two Disk Drives with Replacement Selection • Replacement Selection offers an opportunity to save on both transmission and seek times in ways that memory sort methods do not. • We could use one disk drive to do only input operations and the other one to do only output operations. • This means that: – Input and Output can overlap ==> Transmission time can be decreased by up to 50%. – Seeking is virtually eliminated.
  • 37. 37 More Drives? More Processor? • We can make the I/O process even faster by using more than two disk drives. • If I/O becomes faster than processing, then more processors can be used. Different network architectures can be used for that: – Mainframe computers – Vector and Array processors – Massively parallel machines – Very fast local area networks and communication software.