OpenSearch Isnotnull() Bug: Aggregation With Text Fields

Alex Johnson
-
OpenSearch Isnotnull() Bug: Aggregation With Text Fields

Hey everyone, let's dive into a head-scratcher of a bug in OpenSearch that pops up when you're dealing with isnotnull() filters and aggregations on text fields. This one's a bit technical, but we'll break it down so it's easy to understand. If you're using OpenSearch and working with text data, especially with the keyword subfield, this might be something to keep an eye on.

The Core Issue: Where isnotnull() Goes Wrong

At the heart of the problem is how OpenSearch handles isnotnull() filters in queries that also involve aggregations, particularly when the field is a text type with a keyword subfield. Here's the deal: When you set up a text field in OpenSearch and then create a keyword subfield, you might also use ignore_above to set a limit on the length of the text that the keyword subfield will index. If you have a document where the description field is a text type, and the keyword subfield has ignore_above: 50, you will encounter the following situation: If you've got a description that is longer than 50 characters, the keyword field won't be populated for that document. When you run a PPL (Piped Processing Language) query like source=test-isnotnull-bug | where isnotnull(description) | stats count() by description, the result will contain null values in the aggregation output, even though the isnotnull(description) filter should exclude them. In short, OpenSearch incorrectly returns null values in aggregation results even when isnotnull() filter is applied to the field. This behavior messes up the results you see, especially when you're trying to count or analyze data. The query's expected behavior is to exclude the documents with null values, but the bug causes these null values to appear in the results. This can throw off the accuracy of your data analysis.

Digging into the Problem: The Technical Breakdown

Let's get a little more technical. When you use isnotnull(description), OpenSearch checks if the description field exists in the document. If the text field exists, the filter passes. However, the aggregation, which operates on the description.keyword subfield, sees null because the ignore_above setting caused the keyword subfield not to be populated for longer descriptions. This mismatch is what causes the bug. Specifically, the aggregation looks at the .keyword subfield, which is where the nulls are, not the main description field, which the isnotnull() filter uses. The missing_bucket: true setting in the aggregation further complicates things, as it includes a bucket for the missing values (which is null in this case). To really hammer this home, let's consider an example. Suppose we have a text field named description and the keyword subfield has ignore_above set to 50. Then let's have a few records with the following values:

  • Short description 1
  • This is a very long description that definitely exceeds the 50 character limit

When running the PPL query mentioned previously on this dataset, the expected outcome is that Short description 1 would be counted once, while This is a very long description that definitely exceeds the 50 character limit is excluded, because its keyword is null. However, the bug leads to null values appearing in your aggregated results, which is not what you want.

Reproducing the Bug: Step-by-Step

Reproducing this bug is pretty straightforward. Here's how you can do it:

  1. Create an Index: Start by creating an index with a text field that has a keyword subfield, setting ignore_above. For instance:

    {
      "mappings": {
        "properties": {
          "description": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 50
              }
            }
          }
        }
      }
    }
    
  2. Insert Data: Populate the index with data that includes descriptions longer than the ignore_above limit.

    {
      "description": "Short description 1",
      "value": 1
    }
    {
      "description": "This is a very long description that definitely exceeds the 50 character limit",
      "value": 100
    }
    
  3. Run the PPL Query: Execute the PPL query to trigger the bug.

    source=test-isnotnull-bug | where isnotnull(description) and description != '' | stats count() by description
    
  4. Observe the Results: You'll see null values in your aggregation results, despite the isnotnull() filter. The isnotnull() filter is correctly identifying the existence of a value in the main field. However, the subsequent aggregation is done on the subfield .keyword, where the value is null. Thus, when aggregating by this field, the missing values appear in the results.

Diving Deeper: The Root Cause

The root cause of this problem lies in how OpenSearch processes the isnotnull() filter and aggregations. The isnotnull() filter checks if the field exists, but the aggregation operates on the keyword subfield. Because the two operations reference different underlying data, the filter does not properly exclude null values in the final result. The system doesn't recognize that the isnotnull() condition should influence how aggregations handle missing values. Specifically, the OpenSearch code, in the file AggregateAnalyzer.java (lines 207-215), sets a bucketNullable flag that, by default, allows for the inclusion of missing buckets. This leads to missingBucket(true) in the CompositeAggregationBuilder.java (lines 72-74), causing null values to appear in the aggregated results. To fix it, the system needs to analyze the filters and adjust the bucketNullable setting to exclude missing buckets when there are isnotnull() filters on the aggregated field. This is the missing link: the system isn't smart enough to realize that the isnotnull() filter should tell the aggregation to ignore the missing values.

Possible Solutions: How to Fix It

While there's a bug to address in OpenSearch's code, here are some workarounds you can use in the meantime:

  • Increase ignore_above: Adjust the ignore_above setting to accommodate the longest possible description you expect to encounter. This way, the keyword subfield is always populated. But, this may not be the best option if you need to keep ignore_above small for performance reasons.
  • Use the Keyword Field Directly: If text analysis isn't essential, use the keyword field directly in your queries and mappings. This will avoid the need for the text field altogether. This is a great option if you do not need text analysis and it simplifies your queries.
  • Post-Filter Null Values: Filter out the null values in your application code after receiving the results from OpenSearch. This is a quick fix, but it requires additional processing. This is one way to avoid the bug but requires more processing in your code.
  • Remove ignore_above: Consider removing the ignore_above setting altogether if you need all values indexed. But this will negatively impact your performance if the documents have large text values.

The best solution is to enhance OpenSearch so it recognizes the isnotnull() filters and behaves correctly during aggregation. This will likely involve code changes to ensure the aggregation process respects the filter. OpenSearch's aggregation analyzer needs to extract fields with isnotnull() predicates and use this information to set the missingBucket option accordingly. This fix should make sure that the aggregated results don't include null values when filtered with isnotnull().

Conclusion: Staying Ahead

This bug highlights an important aspect of working with OpenSearch: understanding how your data is indexed and how filters interact with aggregations. Keep in mind the interplay between text fields, keyword subfields, and settings like ignore_above. Knowing about this issue will help you avoid unexpected results and ensure your data analysis is accurate. We're all hoping for a fix, but in the meantime, keep these workarounds in mind. Stay vigilant, and always double-check your results when dealing with potentially tricky combinations of filters and aggregations.

If you found this helpful, be sure to check out the OpenSearch documentation and related community forums for more information on best practices and troubleshooting tips. You can also check out the OpenSearch GitHub repository to keep up with the latest updates and bug fixes. Remember to test your queries thoroughly and keep an eye out for these types of issues. Happy searching!

For more information and related discussions, check out the official documentation:

You may also like