ROOT CAUSE FAILURE ANALYSIS of ELECTRIC SUBMERSIBLE PUMPING SYSTEMS

ESP applications have tremendous potential, but are often blocked by a fixed view of how things should be done. Such culture change is not easy! “Regional Reliability Engineers” help ESP users as well as the ESP manufacturers isolate the exact mental barriers that are causing these applications to prematurely fail. We point out the ways to keep equipment up and running by eliminating chronic failures and avoiding sporadic failures. We raise sensitivity to operating conditions, operating habits, and manufacturing procedures that can result in short runs, lost production, expensive warranties, and a drain on profits for both users and manufactures. This type of change is upsetting to some and uncomfortable to most.

The objective of this paper is not to duplicate the work already done by Reliability Centers but to provide modifications to Root Cause Failures Analysis (RCFA) methods, specific to ESP applications. 


The first and main objective of this paper is to help users and manufacturers comprehend the importance and the difference between failure mode and cause of failure. The cause of failure is the underlying, or root, reason responsible for the actual failure. The second objective is to emphasize the importance of proper data collection. 

Two field cases will be included representing examples of the following types of ESP failure modes and their corresponding Root Causes of Failure:

A. mechanical failure due to chemical treatments

B. electrical failure due to elastomer – well fluid incompatibility


Root Cause Failure Analysis - reason for lack of results

Over the past few years, Root Cause Failure Analysis (RCFA) has become an “antidote” to all problems. It has the potential to significantly improve run time between failures but, as with any new tool, there are also many new traps. The concept of RCFA, originally developed for manufacturing, is a simple three-stage process: select a failure that is important, analyze the failure, and then implement recommendations. While the procedure sounds easy, it also appears too good to be true. There are several reasons why the results sought utilizing RCFA in ESP applications are not achieved. These can be classified into five main categories: 

1. “Finger Pointing”

2. Lack of Efficient Utilization of Available Technical Expertise

3. Time Constraints

4. Lack of Proper Data Collection

5. Lack of Proper Result Implementation


The first and most important category is the lack of understanding that finding the real Root Cause of Failure is in the best interest of both the manufacturer and the user. When a failure occurs, often the user will opt to change the manufacturer instead of trying to work with the manufacturer on the problematic installation. On the other hand, the manufacturer prefers, or is eager, to supply equipment which is available off the shelf instead of customizing it to meet the specific needs of the application. Each party “blames” the other for the failure. This is a no-win situation for both parties. In most cases, the reason for the “finger pointing” is that there is an unclear answer as to why the ESP failed.


The second category involves the efficient utilization of available and/or required technical expertise from ESP users and manufacturer. In some cases there is a requirement for obtaining expertise from different fields, such as equipment manufacturing, application design, production engineering, reservoir engineering, and operation within the RCFA team. With the exception of manufacturing faults, Root Cause of Failure cannot be delivered without the support and cooperation of the equipment manufacturer and equipment user within the RCFA team. 


The third category, "Time Constraint", is a direct result of the urgency of unit replacement and installation. In most cases new equipment is expected to be back downhole in less than 48 hours. This often results in a lack of time to perform proper RCFA and hence, the new units will be installed without the required modifications to correct the problem. The main purpose of RCFA is to prevent future failures. 


For whatever reason, there tends to be a lack of discipline in the approach taken to data collection. It is amazing how quickly failure data can disappear after a field failure has occurred. Generally, following such an incident, there is a lot of confusion. Nobody is sure what to do to preserve failure data, what data is required, or even considers preserving information. The main drive is to get the unit back online and into production. Often in the chaotic activities that follow, a lot of failure data can be destroyed or altered. Drive history is over-written when attempts are made to restart equipment, drive settings are changed for many reasons, lubrications and other fluids are drained out at the rig floor, etc. Along with the loss of all the above-mentioned data, the chance for uncovering the true root causes of the incident disappears. 

As the failure analyst cannot be present in all production fields at the same time, provisions have to be made to train operators and field service people to be failure data collectors. Getting to the failure data before it becomes corrupted, or lost, is the key to effective Root Cause Failure Analysis. The main problem is that as time passes following an incident, the raw sensory data that was taken in by the people who were at or around the failure scene starts to become distorted. People start to draw conclusions. If something they sensed doesn’t fit their mental models of what the scene should contain, they may discount it and only inform the failure analyst of their conclusions about what happened as opposed to providing the failure analyst with the raw data. Doing a good job of "PRESERVING FAILURE DATA" is a key step in conducting Root Cause Failure Analysis. There is no better way to avoid future incidents than to learn from past mistakes. 


The last pitfall, and potentially the most dangerous, is the lack of a consistent reporting mechanism to inform decision-makers of our findings so that we can effect a real change in the ESP manufacturing or application. It is at this stage, many failure analyses fail either because the analysis team was unable to clearly articulate their findings and recommendations to the decision makers in a persuasive manner or, the new ESP was already installed downhole. 


Root Cause Failure Analysis - as it looks today in ESP industry

Identifying the root cause of failures for ESP systems is a difficult, error-prone and time-consuming task. Each integral element of the ESP system can fail to perform as expected in many different ways and each failure can, in turn, influence other elements of the ESP system, resulting in final system failure. Reservoir, surface facilities, or power grid can also fail to perform as expected and will contribute to, or cause the ESP system to fail. Currently, more than 80% of a failure analyst’s time is spent trying to collect enough data, whereas less than 20% is spent determining the RCF. A single drive problem, for example, can give rise to many symptoms, some of which can propagate far from the source. The root cause shares many symptoms with other possible causes - the production to surface can be lost, the electrical readings can be unbalanced, etc.  The root cause is not obvious from looking at individual symptoms and may, in fact, be unobservable. Access to the reservoir does not exist; all that is observable are the symptoms that the problem has indirectly caused, which must be accurately correlated to obtain the correct conclusion.  Once the root cause has been identified, all the events caused by it are explained and need no further analysis. For example, if a failure has been identified as being the result of a chemical treatment, it can be inferred that other ESP’s in this field that have been similarly treated may also fail in the same manner. 


There are many types of information required to support root cause analysis. Two fundamental requirements for Root Cause Failure Analysis include sufficient knowledge of:

1. the particular problems that can cause each event, and 

2. the sequence of events that lead to each particular problem 


This knowledge is difficult to acquire and maintain, as it involves two distinct areas of expertise:

 1. A deep understanding of the failure modes; the effects of the reservoir, production, operating procedure, application design and ESP component design, and,

2. Knowledge of how these components are inter-related in each specific production system at each point in time.


From this, it is obvious that proper failure analysis cannot be properly completed without constant interaction between the ESP manufacturer and the ESP user. Both sides have expertise in different areas of the ESP production system, and both sides complement each other. 


Root Cause Failure Analysis - proper approach to failure analysis

RCFA is a disciplined problem-solving methodology, used to determine root causes of specific failure events. The following process is necessary to implement a successful RCFA:

• Determine the failure mode. This is commonly mistaken for the root cause of failure. Mode of failure is how the failure surfaces, not why the failure happened.  It is very common that people accept the statement “motor burnt” without asking what caused it.

• Determine the failure cause. Root Cause of Failure answers the question "Why?" and at same time explains “how” it happened.

• Estimate the extent of the damage and the likelihood of additional failures. It is important to search for other potential ESP failures caused by the same Root Cause of Failure.

• Design and implement the appropriate corrective action; and

• Follow-up to ensure that the corrective action is first implemented and verify its effectiveness in preventing another failure. 

The most critical part of Root Cause Failure Analysis is to determine the Root Cause of Failure. All too often, maximizing run life is not accomplished because ESP failures are not properly identified. The first flaw discovered in the failure of an ESP system is often given full responsibility for the failure. This method of analysis can result in a much shorter average runlife in a given well and/or field. Maximizing run life of ESP's can only be accomplished through proper analysis of failure modes and investigating all aspects of the ESP system. It is important to note that when investigating a single failure, the entire field operation and procedures, along with the complete history of ESP performance in that field must also be taken into account. Therefore, in order to implement a successful RCFA, the following steps must be undertaken:

• Shop personnel involved in manufacturing, field service, rig crew and operators must be aware about importance of RCFA

• Review previous Root Cause of Failure (RCF) for the given well.  In many cases, a failure trend specific to, for example, the well conditions or the field service technician involved in the equipment installation or to the operating procedure has been identified, etc.

• Review previous pull history. It is often possible that some damage from previous installations was not caught during equipment testing and the current failure is actually the result of damage caused from the previous installation. A good example is motor failure due to insulation fatigue resulting from overheating due to a plugged pump. 

• Collect the production data for the first few weeks after installation, as well as for the last few weeks before the failure. It is extremely important that somebody closely monitors the ESP unit following start-up until the well stabilized and then drive settings need to be verified. It is also good practice to test to the separator as often as possible to get an accurate accounting of production. Calculated well production from field data is often misleading, especially in new and dynamic fields. Remember, there can never be too much good-quality production data. The production data from initial weeks is used to verify well information used for application design. Data from the weeks immediately preceding the failure can provide insight to potential changes in well performance or pump wear.  

• Apply collected production data to original design and verify ESP operating conditions. This type of well maintenance can verify the reservoir data used for the original application design. Monitoring the application in this manner can often allow early detection of pump problems, or pending failure, and thereby prevent costly motor failures. Data collected should include operating parameters such as tubing and casing pressure, pumped fluid volume and composition, amps, drive output volts, operating speed, fluid level or BHP, BHT.

• Compare drive start-up documentation attached to installation report with drive settings before failure. In many cases, especially with gassy wells, it has been observed that underload settings on the drive were changed and prevented shut down of the drive when the pump gas locked. Another example of need for review of documentation resides with the common practice to use an oversized motor in low flow and hot wells to prevent motor overheating. Unfortunately, in these cases it is possible that the motor operating current can be very close to idling current, and thus the UL settings will not protect the equipment in case of pump plugging.  In these cases, it is recommended use a flow switch at the wellhead to improve equipment protection.

• Examine drive repair history. Faulty drives or low quality incoming power can be the instigators of electrical failures; often the drive at the end of power grid is exposed to the worst operating conditions. Most often, following examination, incoming power problems become very obvious and isolation of these problems can greatly assist in maximizing ESP runlife.

• Perform drive amp-chart analysis, or analysis of data available from data collectors or directly from the drive itself. It is important that operators include all required data on the amp-charts, otherwise the use of the amp-charts becomes very limited. Information that should, ideally, be recorded on amp-charts include: mode of VSC operation, operating speed, recorded amps from all three phases, downhole pressure, drive output voltage, transformer ratio, casing and tubing pressure. Many amp-charts are often missing the most important information – location, when it was put on, and when it was taken off!

• Download of drive history from drive memory prior to attempting the ESP's restart. Often when a unit fails, the first impulse is to attempt a restart, and more often than not, multiple restarts are attempted. If these attempts occur before drive history was download, important information can be lost, especially if the shutdown is actually the reason for pull. Downloading drive history should be a part of operating procedures.

• Evaluate the information available from the pull report (all fluid should be drained on the rig floor in vertical position if the motor-seal assembly cannot be sent to shop for air testing).

• Verify compatibility between the well treatment chemicals used and the materials used within the ESP, if this was not done before well treatment.

• Verify compatibility between downhole conditions and materials used within ESP.  Examples of incompatibility include Aflas with condensates; Viton with amine-based well treatments; most stainless steels in a sour environment, etc.

• Test, disassembly and inspect failed equipment as required. The dismantle inspection is a critical part of the evidence gathering process necessary to support a successful failure analysis. All evidence should be documented, regardless of its relative importance to the inspector at the time.  No interpretation of the evidence should be made during the dismantle inspection so as not to bias the evidence gathering process toward a specific cause of failure. Therefore, it is best to not begin conducting a failure analysis until after the evidence gathering process is complete.

• Delivering Root Cause of Failure by RCFA team.

• Looking for failure trends within manufacturing, well conditions, or operating methods.


Root Cause Failure Analysis and it’s Relationship to ESP's Preventive Maintenance

We live in working environments where it is difficult if not career limiting, to say "no" to job assignments from bosses and colleagues. Pulling before catastrophic failure is normally not done, except in rare instances. Units are rarely pulled until total electrical failure occurs and the unit will not restart. Many times these restart attempts, after the systems have gone to ground, destroys the evidence needed to determine the reason for failure. This becomes extremely difficult when there is more than one burn location in the system. These restart attempts also degrade the integrity of the cable, and can promote future short run failures with re-installation of the same equipment. 

Proper diagnostics of a unit that shuts down on overload could, in many instances, reduce repair costs for the unit, provided they are pulled prior to restarting. However, this is not the general procedure in the field. Many lease operators do not have the equipment or training to troubleshoot ESP systems after they have gone down and, being pressured for production will automatically attempt a restart as soon as they discover the failure.

Producing wells can be monitored, and corrections in operation such as incoming power, wellhead pressure, casing pressure, etc., can reduce the stresses and increase the life of an ESP system. Also, use of proper monitoring techniques can aid in determining needs for replacing equipment and can reduce the repair costs if units are pulled and/or resized prior to catastrophic failure. The operator is the most important determining factor on runlife and is probably the least trained in the operating and design limitations of ESP systems.

ESP Preventive Maintenance should include:

• Collection of production data and operating parameters such as tubing and casing pressure, pumped fluid volume and composition, amps, drive output volts, operating speed, fluid level, BHT as often as possible (at least biweekly)

• Application of collected production data and operating conditions to original design to verify ESP operating conditions

• Verification of compatibility between chemicals proposed for well treatments and materials used within ESP. Published data should be sufficient in most cases but it is recommended to make use of test coupons for evaluation, as there are many grades of a particular elastomer, for example, and they can react differently.

• Verification of compatibility between downhole conditions and materials used within ESP prior installation. It is not unusual for the reservoir conditions to change during the life of the field. When starting enhanced recovery techniques, the ESP system must be monitored frequently because of the potential for causing the unit to operate outside the design parameters. Consultation between Customer/User and ESP manufacturer when well optimization is the desired goal.

Thus, many of the methods used during RCFA can be applied as Preventive Maintenance in the ESP field.


Root Cause Failure Analysis - benefits of alliance with ESP manufacturer

Alliances, why do producing companies decide that they want to align with a single ESP manufacturer? Below are a few possible reasons why producers choose to create a ‘marriage’ with an ESP manufacturer.

• Lower Up-front Pricing (probably viewed number one)

• Extended Running Guarantee

• Single Source Equipment

• Inventory Management


The above reasons for an alliance are very valid, but there is one area that may be overlooked when the terms of the alliance are agreed upon. This is a relationship between two companies and too often either company leaves all the responsibility up to the other party. The producer expects the ESP manufacturer to look after it all or the ESP manufacturer just sells equipment to the producer. The result?  ESP’s come out, ESP’s go in, probably at an increasing frequency. Who is looking into why the system failed and what is going to be done to prevent this occurrence from happening again? The results of this can be detrimental to the ultimate goal of reducing operating costs to extend the producing life of  oil producing properties. If this doesn’t happen, operators may have to prematurely abandon fields, and the ESP manufacturers won’t be selling equipment any more.


As pointed out above, both parties are required to work together, and this requires resources dedicated by both the ESP manufacturer and the producer. Things just don’t happen by themselves.  There has to be people dedicated and held accountable to making the alliance work effectively. RCFA can require a lot of information and the journey that this information makes from the source to the result is a very winding road with many obstacles and detours.  


Root Cause Failure Analysis - field cases

A) Mechanical Failure Caused by Chemical Treatments

The unit was taken off line for a new drive installation. Production was apparently good, with no problems noted before shut down. Field Service could not restart the unit after replacement of the drive. The pump was pulled out and was found to be "locked" in the field. However, 5 hours after the pull, when the pump was checked in the shop, it was found to rotate freely. It appeared that as soon as fluid came in contact with the pump, it seized and the motor could not start the unit. It was suspected that some new chemicals that the customer was using to treat their water might have been causing swelling of the rubber bushings in the pump. Amine-based materials were found in the water treatment chemicals. The rubber bearings used in the pump were made from Nitrile elastomers. When the chemical treatment was applied, it resulted in causing the rubber to swell when it came in contact with pumped fluid. Replacement of the marine type rubber bearings with Ni-resist bushings resulted in no further problems with this unit. While theoretical analysis of the compatibility between the rubber and water treatment chemicals confirmed this theory, field experiments, performed at room temperature and pressure, did not. The experiment was repeated in laboratory environment using production fluid at operating temperature and pressure. As it was expected, the fluid used by customer for well treatment attacked the bearing elastomer causing swelling and a tacky film to form. Thus, it explained the seizure of the pump after contact with the pumped fluid. It was further confirmed that this reaction starts at approximately 159F, the temperature observed at the pump intake, thereby explaining the failure of the field experiment. Proving the compatibility problem was not a simple task as it was observed that as the temperature drops the tacky film resulting from the elastomer - fluid reaction became soluble in water.


B) Elastomer used within ESP assembly incompatibility with well fluid 

There were two units operating in the same environment. The first unit ran for 258.3 hours.  It had eight starts and failed 28 minutes after the last restart. The second unit ran for 532.6 hours.  There were 12 restarts and it failed 90 minutes after the last start. The oil is very light and has a gravity of API52. Electrical shorting at the pin connection between the tandem motors was the mode of failure in both cases. The seal design incorporated six bags in tree sections with two parallel bags in each section. 

During equipment disassembly it was found that all bags were subjected to chemical attack. All were permanently deformed and extended in length by about one inch.  The swelling (and extruding) around the clamp caused additional shear stress in the bags at the clamped edge. The material tear resistance (shear strength) was also significantly reduced. Furthermore, it was also observed that the relief part of the check valves had been blocked due to the swelled bag. The deterioration of the material properties of the bag and a high internal pressure buildup from motor heating during operation jointly contributed to the bag rupture. During lab test with live production fluid, it was observed that the specific chemical formulation of the elastomer utilized was not compatible with the light crude in this well. 



Root Cause Failure Analysis - conclusion

While there is a vast amount of information that can exist, the majority of times we are still working primarily with results obtained from equipment disassembly. In 35% of the cases, this might be sufficient. In 20% of the remaining cases, the RCF can be based on additional information collected by the ESP manufacturer. The key to proper RCFA, however, most often lays in information available directly from the ESP users. Unfortunately, in many cases, detailed data is either not easily available or does not exist at all. Further improvement in RCFA in the ESP field is undoubtedly in the hands of the ESP users. More manufacturer-user RCFA teams working together are becoming necessary to extend ESP run times. In order to succeed in the ever-growing global marketplace, the ESP manufacturer’s method of business operation must shift from being simple equipment providers and become more service oriented. Teamwork is the key to a successful future.