Microsystems adhere to Australian Standards AS/NZS 4360:1999, this Standard
provides a generic guide for the establishment and implementation of
the risk management process involving the identification, analysis,
evaluation, treatment and ongoing monitoring of risks.
Context
It is the standard policy of Microsystems to ensure that there are three
copies of all data at any one stage.
Data falls into two categories:
Microsystems
Data
Client Data
Microsystems Data
Microsystems Data consists of all data related to the company itself.
The two vital areas for Microsystems data backup are:
Project Data
Office Administration
Data
Currently Microsystems
data is backed up at the following locations:
Locally on site
to another server in case of critical failure two copies (One hot
swappable)
Off site at the
Microsystems web server
By ensuring that
three copies are always present/current it severely reduces data loss
should unexpected network failures occur.
Client Data
Client data consists of all images captured and all data relevant to
those images. This data is uploaded simultaneously to our web server
and local hotswappable machine from the source machine nightly, hence
the three copies. Any subsequent data changes made on the Webserver
by clients is then replicated back to Microsystems servers.
This ensures that at all times there is three current copies of client
data. By backing up this way we have two major advantages:
The backup is
off -site, giving Microsystems added protection against a local disaster
in either building (Microsystems premises and ISP premises) such as
fire and theft.
The backup is
on a RAID 5 Array, which ensures added protection, as data redundancy
is present.
The process starts
by making an FTP (File Transfer Protocol) connection to the Web Server.
This is enabled by typing a specific address (which may be obtained
upon written request to Microsystems) and logging in with the administrator
password. (or an alternative password with rights to write to the directory).
The administrator password (which is changed regularly) is known to
only select people within Microsystems. No person outside of Microsystems
is privy to this information.
New data is loaded
onto the Microsystems server and uploaded onto the Microsystems server
based at the ISP. The ISP server in turn updates the Microsystems server
with the relevant changes, which in turn is backed up to hard disk once
a day and to CD-ROM once a week. . As a result, in the unlikely event
of a complete system failure, loss of client data is minimized.
By backing up this way, Microsystems have several major advantages:
The backup
is off-site, giving "Microsystems" added protection
against a local disaster in the building such as fire, theft
etc.
The backup is
on a RAID 5 Array, giving added protection, as data redundancy is
present.
Microsystems
can switch servers within a reasonable time frame.
Security settings
are carried over to each site.
Disaster Recovery
Microsystems has built in redundancy as part of its standard solutions
architecture to protect and maximise the availability of data to its
client's. In addition, Microsystems performs regular disaster recovery
testing. As part of this redundancy approach, Microsystems employ:
Uninterrupted
power supplies.
Redundancy with
the ISP, communication links and servers.
RAID servers.
As part of its continuous
desire to improve upon its service and data security, Microsystems plan
to move its head office and data centre by mid 2002. The new site is
being developed to house the primary client servers into a specially
designed underground (and flood-proof) server bunker to provide addition
physical security. To ensure data integrity between these systems, Microsystems
validate the synchronization of data between servers via a weekly software
process. In addition both the primary Microsystems and ISP servers are
both hot swappable RAID5 and where possible, Microsystems employs standard
technologies and hardware for interoperability in the event of a disaster.
Identify Risks
What can happen?
Should an unexpected disaster occur, the major outcome could result
in serious data loss affecting Microsystems normal operations and affecting
clients of Microsystems whose data is invaluable to their organisation.
How can it happen?
This can occur in various ways. A major fire, flood, theft or collapse
of the building could see all data become irretrievable thus the need
to ensure that there is a current backup of data off-site.
Hardware failure is the most realistic disaster that could occur. A
failure of one of Microsystems file servers could result in complete
loss of data, regardless of the type of failure (e.g hard-drive crash,
short circuit on motherboard etc). The on-site backups ensure that
time
to restore operations to normal is minimised.
Analyse Risks
What prevention do we have to prevent occurrence?
As previously explained, Microsystems ensure that there are always three
current backups of data. This enables prevention of loss of data at
any one time should a failure/disaster occur. Naturally, you cannot
predict or prevent a disaster occurring but by implementing a disaster
recovery plan it minimises the impact on the company and clients.
What are the chances of this happening?
You can never predict when a disaster will occur. Although the likelihood
of fire, theft etc is rare, it can and does happen. With three backups,
Microsystems has ensured that the effect on clients is minimal.
Hardware failure is a different issue. Whilst the majority of failures
can be prevented, wear and tear does occur and by consistently monitoring
the current status of hardware within the organisation, it provides
the opportunity to replace suspect hardware before major failure occurs.
Again, the three backups provide redundancy should an unexpected hardware
failure occur.
What are the consequences?
Should a disaster/failure occur, not only is there the inconvenience
to all clients but Microsystems could be liable for legal action from
clients should the disruption affect their operations.
Other consequences that may affect Microsystems are a loss of business
and a loss of confidence/trust by clients.
What is the level of Risk?
There are two issues to consider here. These issues are Hardware and
Disaster prevention. From a hardware perspective, Microsystems is continually
examining the latest technologies and implementing appropriate solutions
thus effectively negating the risk of hardware failure.
Disaster prevention is harder to determine the level of risk as it cannot
be predicated.
However, measures have been taken to ensure that Microsystems minimises
the impact of a disaster. These measures include ensuring that the data
is kept behind a fireproof door, is not subject to flooding and meets
all the necessary quality assurance requirements of ISO 9002: 1994.
Risk Consequences
Procedure in event of failure
In the event of a hardware failure, the process is as follows:
Remove/replace
affected hardware and test to see if problem is resolved, unless it
is a hard-drive crash then the next step applies. If problem is resolved
then normal operations should resume.
Substitute failed
hardware with temporary server.
Copy data to
substitute server.
Test substitute
server.
Ensure normal
operations can be resumed.
Rectify affected
hardware.
Restore repaired
hardware and data and remove substitute server.
In the event of a disaster such as fire then the following process applies:
Organise replacement
hardware/software.
Install required
software.
Copy data to
replacement hardware(server)
Test server.
Ensure normal
operations can be resumed.
Identifying the problem
Obviously a disaster will destroy most if not all of the hardware and
software held on-site at Microsystems. Thus, if this situation occurs
then identifying the problem is irrelevant.
Identifying the problem in a hardware aspect is a lot more complex.
PC component failures actually fall into three main periods, chronologically:
Infancy: Many
components fail very soon after they are put into service. How long
this takes depends on the component; for example, processors sometimes
fail as soon as they are first put into a system. Many other parts
fail within a week or a month of being put into use. Failures within
this period are caused by defects and poor design that cause an item
to be legitimately bad. These are called infant mortality failures
and the failure rate in this period is relatively high.
Normal Operating
Life: If a component does not fail within its infancy, it will generally
tend to remain trouble-free over its operating lifetime.. The failure
rate during this period is typically quite low.
Wear out: After
a component reaches a certain age, it enters the period where it begins
to wear out, and failures start to increase. When this occurs of course
is a matter of luck and also how well you take care of your PC. For
example, processors tend to last years longer if they are operated
in a cool environment as opposed to a warm one. The period where failures
start to increase is called the wear out phase of component life.
Using a process
of elimination to determine the hardware fault follows a set order.
This order is reflected in the table below:
Component
Infant
Mortality Rate
Typical
Time to Wear out (years)
Likelihood
of Failure Before Wear out
Likelihood
of Obsolescence Before Wear out
Power Supply
Low
3-Jun
Moderate
Very
Low
Motherboard
Moderate
4-Jul
Low
High
Processor
Low
7+
Very
Low
Very
High
System Memory
Moderate
to High
7+
Very
Low
High
Video Card
Low
to Moderate
5-Jul
Low
High
Monitor
Low
to Moderate
5-7+
Moderate
to High
Very
Low
Hard Disk Drive
Moderate
to High
3-May
Moderate
to High
Moderate
Floppy Disk Drive
Low
7+
Low
Low
CD-ROM Drive
Moderate
3-May
Moderate
High
Modem
Low
5-7+
Low
High
Keyboard
Very
Low
3-May
Moderate
Low
Mouse
Very
Low
1-Apr
Moderate
to High
Very
Low
Using the table
to restore to normal operations eliminates certain components. Components
that will not affect normal operations are the mouse, keyboard, modem,
cd-rom drive, floppy disk drive, monitor and video card.
Resolving the problem
Using the process of elimination as described above assists in resolving
the problem.
Typically the process of resolving a problem relates to the particular
hardware and listed below is a short description of what process occurs
to each individual hardware component.
Power Supply:
Some problems with power supplies can be repaired, but in practice
they rarely are. The main reason is economics: power supplies are
cheap, and they take only a few minutes to swap.
Motherboard:
Motherboards are complicated multi-layer circuit boards and cannot
usually be repaired. Some simple problems can be fixed by the manufacturer
of the board; this usually means swapping some chip or other component
on the board out in favour of a replacement, but this is not often
done.
Processor: A
failed processor cannot be repaired. It needs to be replaced. In the
real world, an actual failure of a processor is extremely rare unless
it is abused, typically by insufficient cooling over a long period
of time etc.
System Memory:
Memory chips cannot be repaired. Memory modules can be repaired by
a company with the right equipment, by diagnosing which chip is flawed
(assuming a failure of the memory and not the module circuit board)
and replacing it with a good chip.
Hard Disk Drive:
Hard disk problems have very few solutions that are available to anyone
but the original manufacturer, or specialized data recovery firms.
Depending on which component has failed determines what course of action
is taken. For future reference an analysis of all failed components
will be undertaken to see what preventative measures can be implemented.
What we do for the customer
In the event of failure/disaster, Microsystems will endeavour to advise
all clients of the estimated downtime and will update clients on the
operational status on a consistent basis until the problem is resolved.
If the failure/disaster relates to the Web Server then client access
to this will be affected, but should the failure/disaster relate to
on-site at Microsystems, then clients will not be visibly affected but
may experience delays in an update of data.