Robotic Process Automation (RPA)
Targeting Reliable UI Elements with Robust Selectors and OCR
Discover advanced techniques for element identification using anchor-based selectors, computer vision, and OCR to handle dynamic or non-standard interfaces.
In this article
The Selector Fragility Problem in Modern Automation
Traditional automation relies on stable identifiers like IDs, class names, or XPath expressions to interact with interface elements. In a perfect world, these attributes remain constant across software updates and user sessions. However, modern web frameworks and legacy desktop applications often generate dynamic attributes that change every time the page refreshes or the application restarts.
When an automation script fails because a button ID changed from submit-001 to submit-002, we encounter the selector fragility problem. This instability creates high maintenance costs for developers who must constantly update scripts to match the evolving UI. Relying solely on static attributes is no longer a viable strategy for enterprise-grade robotic process automation.
To build resilient bots, we must shift our mental model from finding an element by its name to finding it by its context. This approach mirrors how humans navigate interfaces by looking for visual landmarks and relative positions. By understanding the underlying structure and visual layout, we can implement identification strategies that survive even significant UI overhauls.
The robustness of an automation suite is inversely proportional to its reliance on absolute attributes that the developer does not control.
Why Static Selectors Fail in Legacy and Web Environments
Legacy applications often lack a structured accessibility tree, meaning they present themselves to the operating system as a single canvas or a collection of generic containers. In these scenarios, traditional inspection tools cannot see individual buttons or text fields, leaving developers with no attributes to target. This is particularly common in mainframe emulators or older Delphi and VB6 applications.
Modern web applications introduce a different challenge through dynamic CSS-in-JS libraries and obfuscated class names. These tools generate random strings for styling purposes, which serves the developer during build time but breaks automation scripts during runtime. If your selector targets a hash that changes with every deployment, your bot will break during the next CI/CD cycle.
Relational Positioning with Anchor Selectors
Anchor-based selection is a technique where you identify a stable element and use it as a reference point to find a nearby target. This is highly effective for forms where labels are static but input fields have dynamic properties. Instead of searching for the input field directly, the bot searches for the label text and then looks for the nearest editable box.
This method creates a semantic link between the label and the field, which mimics the way a human user processes a form. Even if the internal ID of the text box changes, its physical relationship to the label usually remains the same. This spatial awareness allows the bot to adapt to layout shifts that might move the entire form block without breaking the internal logic.
1def find_input_by_label(driver, label_text):
2 # Locate the stable anchor element using its visible text
3 anchor = driver.find_element_by_xpath(f"//label[contains(text(), '{label_text}')]")
4
5 # Use the anchor's position to find the nearest input field
6 # This avoids relying on dynamic IDs or changing name attributes
7 target_input = driver.find_element_by_xpath(f"//label[contains(text(), '{label_text}')]/following-sibling::input[1]")
8
9 return target_input
10
11# Usage in a realistic procurement system scenario
12account_field = find_input_by_label(browser, 'Billing Account Number')
13account_field.send_keys('99887766')When implementing anchors, it is crucial to define the search radius and the direction of the relationship. Most RPA tools allow you to specify if the target is to the right, left, top, or bottom of the anchor. Refining these parameters prevents the bot from accidentally interacting with the wrong field if multiple inputs are grouped closely together.
Multi-Anchor Strategies for Complex Tables
Data grids and tables represent one of the hardest UI patterns to automate because they often reuse identical elements across hundreds of rows. Using a single anchor might not be enough to pinpoint a specific cell, especially in infinite-scrolling lists. In these cases, you can use a dual-anchor strategy by intersecting a row header and a column header.
By identifying the unique ID in the first column and the specific header name, the bot can calculate the coordinates where they intersect. This logic remains valid even if columns are reordered or if new rows are inserted dynamically. It provides a mathematical certainty that the bot is interacting with the correct data point regardless of the table size.
Optical Character Recognition for Dynamic Text
Optical Character Recognition (OCR) bridges the gap between raw pixels and structured data by extracting text from images in real-time. This is essential for automating legacy systems that display critical information inside unsearchable bitmaps or protected PDF viewers. By treating the screen as a source of text, bots can make logic-based decisions based on what is written on the interface.
OCR engines like Tesseract or cloud-based services from AWS and Azure provide different levels of accuracy and speed. Local engines are faster and better for privacy-sensitive data, while cloud engines offer superior accuracy for handwriting or distorted text. Choosing the right engine depends on the latency requirements of your automation and the complexity of the visual data.
1def click_text_on_screen(engine, target_text):
2 # Capture current screen buffer as an image
3 screenshot = take_system_screenshot()
4
5 # Perform OCR to find coordinates of all text blocks
6 results = engine.extract_text_with_coordinates(screenshot)
7
8 for item in results:
9 if target_text in item['text']:
10 # Calculate center of the text bounding box
11 x, y = calculate_center(item['box'])
12 perform_mouse_click(x, y)
13 return True
14
15 return False # Text not found on current screenThe primary drawback of OCR is its high computational cost compared to selector-based identification. Running an OCR scan on every frame can significantly slow down a bot, especially on low-powered virtual machines. To optimize performance, developers should limit the OCR search area to a specific region of interest rather than scanning the entire desktop.
Handling OCR Inaccuracies
OCR is rarely 100 percent accurate, as it can easily confuse characters like the number zero and the letter O. To mitigate this, developers should use fuzzy matching algorithms when searching for specific keywords. Instead of looking for an exact string match, the bot can check if the extracted text is within a certain Levenshtein distance of the target word.
Another strategy involves pre-processing the image to increase contrast or remove background noise before sending it to the OCR engine. Converting the target area to grayscale and applying a threshold filter can significantly improve character recognition rates. These small image adjustments often make the difference between a failing bot and a reliable production automation.
Building Resilient Multi-Layered Identification Logic
The most sophisticated RPA solutions do not rely on a single identification method but instead use a tiered fallback system. This architectural pattern attempts to find an element using the fastest and most precise method first, such as a CSS selector. If that fails, it automatically falls back to more expensive or broader methods like anchor-based selectors or computer vision.
Implementing a fallback hierarchy ensures that the bot remains functional even if one layer of the UI changes. For example, if a developer changes a button class but keeps the icon the same, the CV layer will catch the element even after the selector layer fails. This self-healing behavior reduces the need for manual intervention and increases the overall uptime of the automation pipeline.
When designing these systems, developers must also consider the timeout and retry logic for each layer. If every layer waits thirty seconds before failing, the cumulative delay for a single missing element could become minutes. Balancing the search duration against the likelihood of success is a key skill for intermediate RPA engineers.
Resilience in RPA is not about finding a perfect selector; it is about building a system that knows how to find its way home when the map changes.
Implementing a Fallback Identification Wrapper
A common pattern is to wrap the element finding logic in a utility function that handles the various strategies internally. This keeps the main business logic of the bot clean and readable, as it only needs to call a single find_element function. The utility then manages the complexity of trying different selectors and logging the results for the developer.
This centralized approach also makes it easier to update the identification strategy across the entire project. If a new, more efficient OCR engine becomes available, you only need to update the wrapper function rather than touching hundreds of individual automation steps. This modular design is a hallmark of professional software engineering applied to the world of automation.
