11 April 1994
As with other facets of call processing, image taking presents challenges that can be overcome only through understanding and foresight. This document hopes to explain the problem as it is currently understood, and to present possible solutions to the reader.
Image Taking Today
An image is a memory dump to offline storage; a snapshot of the current state of the system. It normally takes from 45 to 90 minutes to create an image, and call processing is expected to be up and running all the while. When call processing updates protected memory while an image is being taken, the information is buffered until the dump is completed. If this buffer overflows, the process traps and aborts the image.
MCPS will have several call processing instances, which shared data. This implies that data will be updated more often than in a single call processor system. If there are X writes to protected store with one call processor, then there may be N*X writes to protected store with N call processors under MCPS. Consequently, the probability of buffer overflow during image taking (and therefore trapping) increases. But it is important that the probability of trapping does not increase.
Another problem is the possibility of creating an image on several call processors simultaneously. There may be a huge data bottleneck in this case. If the time to take simultaneous images is increased to more than a few hours, then the image taking time may well spill over into prime call processing time. This is truly undesirable.
Facts
- Typically, images are taken during low-demand hours: from 1:00 AM to around 3:00 to 4:00 AM.
- Images take 45 to 90 minutes to create.
- Protected memory can be overwritten to a limited degree during the image taking process (256 writes maximum).
- Updated data is time critical. It must be distributed ASAP.
- In MCPS, any call processing instance may have its image taken.
- In MCPS, multiple call processing instances may have their images taken simultaneously.
- Of course, the Customer does not want traps, and does not want any image to abort.
Questions
- What kind of bottleneck does image taking cause?
- What kind of bottleneck would simultaneous multiple image taking cause?
Function Calls
There are several functions provided to accommodate protected writes. The impact of these functions range in politeness levels, from kind to brutal.
UNPROTECTDS and PROTECTDS
These routines may be used to remove data store protection but only in certain circumstances. Anyone can use these during restart code but only CIs may use these after restarts. This is to protect from applications changing protected data during a dump. Therefore, what do you do if you are not a CI and want to change protected store in non-restart code?
COND_UNPROTECTDS
This routine will unprotect DS if there is not a dump in progress. A boolean value is returned to indicate whether protection removal was possible [successful]. But what if I need to write to protected store even if a dump is in progress?
WRITE_PROTECTED_STORE
This will keep the dump in a consistent state. However, only a limited number of calls (256) to this routine can be made during a dump so use this sparingly. Implication: this is not an acceptable call for our MCPS data updates!
UNSAFE_WRITE_PROTECTED_STORE
This is the same as WRITE_PROTECTED_STORE but does not try to keep the dump consistent. This is for people who are currently using WRITE_PROTECTED_STORE for things that do not need to be dumped. NOTE: This is not for the general user!
INFORM_DSPROT_UPDATE
(We won't actually call this one; it's just FYI). This procedure is called by WRITE_PROTECTED_STORE. This is used to inform the dump program that a piece of store is about to be changed. The dumper can remember what was at this location so that consistent dumps can be made.
Solution 1: Local Buffering
When a call processor distributes updated data, it holds it for awhile. (If it is having its image taken, it could hold this data until the image is completed.) If other processors explain that they're busy getting their images taken, then the data is held until it is requested from all those processors who asked for it to be held. If nobody responds, the data times out and is deallocated.When a processor is having its image taken and receives updated information, the processor responds that it can't accept the data yet. This processor then sets a flag to query the sender for the data when the image taking is over.
Advantage: data is locally buffered; memory usage is minimized.
Disadvantage: We might not be allowed to message that much.
Disadvantage: Local memory is limited.
Solution 2: Global Buffering
This is a variant of Local Buffering. If messaging is more of a bottleneck than local RAM, then each processor buffers up and updates it receives, and performs its own update when image taking is complete. In face, since 256 writes are "free" before the dumper traps, it may be possible to go ahead and write the first N updates (N<256) and buffer the remainder.Advantage: This does not rely on ACKing: messaging is minimized.
Disadvantage: This is memory inefficient. Instead of buffering the memory in one place, it now is buffered in potentially several places.
Disadvantage: Local memory is limited.
Solution 3: Pipe Dream
A hybrid between Local and Global Buffering is where each processor has access to a "status word" containing the state of all processors. In this way a processor would know who to distribute data to, and who to wait on.
Advantage: Data is locally buffered.
Advantage: ACKing is minimized.
Disadvantage: The "status word" may not be feasible to implement, or may not exist.
Disadvantage: Local memory is limited.
Wow! This must be from the Tiger Team.
ReplyDeleteI wish I still had the code we wrote to move table data from the CM to an AP!
Delete