Using Azure AI Search REST API to check whether a file via its file name and extension has been indexed

As a follow up to my previous post:

Configuring a Logic App to run an Azure AI Search Service Indexer to index new documents for RAG

Configuring a Logic App to run an Azure AI Search Service Indexer to index new documents for RAG

I had demonstrated in the Logic App with actions that would check the status of an AI Search Service Indexer to determine whether it was running in order to decide whether a file that had just recently been submitted has been indexed. The design was based on some assumptions on how files were uploaded and I acknowledged that this method likely isn’t the best to handle scenarios where multiple files may be uploaded.

With the copying of the blob from source to target configured, the next step is to trigger the AI Search Service Indexer to start indexing the storage account so the new document can be indexed. Depending on the purpose of the documents, the indexer may be configured to run on a hourly, daily, weekly, or some other schedule that has been communicated to the users and if that’s the case, then we won’t need to trigger the indexer immediately. For the purpose of this example, we’re going to assume that documents that get uploaded are rare and the indexer does not run on a schedule so the new documents need to get immediately indexed and searchable. Let’s also assume that only one person uploads the documents because if there are multiple users uploading at the same time then several index run requests can be requested and it would be difficult to identify which request completed for which Logic App execution (I’ve tried to see if there was a unique ID for the indexer execution that I could use but there did not appear to be one).

This stuck with me for a while so I took a bit of time over the weekend if there were other ways for me to improve this. What I was able to find was that there was a Search Documents API functionality for Azure AI Search that would allow me to check whether a specific file was indexed. The documentation can be found here:

Search Documents (Azure AI Search REST API)
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents

One of the POST requests specified in the document allows you to specify contents in the body to determine whether a file with the specified file name was indexed by the indexer. The post method format is outlined as follow:

POST https://[service name].search.windows.net/indexes/[index name]/docs/search?api-version=[api-version]

Content-Type: application/json
api-key: [admin or query key]

To provide an example, take the following index and its field names configured for the AI Search Service:

The title field in this index represents the full file name and extension of the file that has been index. To test this with Postman, we would configure a POST call specifying the filter where title equals the file name of the document:

https://dev-aisearch.search.windows.net/indexes/vector-policy/docs/search?api-version=2024-07-01

{ “search”: “*”, “filter”: “title eq ‘Traffic-Manager-Description.txt'”}
If we were to search for a document name that is not present in the Azure Storage account and was not index, we would receive the following:
Having this functionality means that we can incorporate it into the Logic App to check 2 conditions:
  1. The Indexer status is not running
  2. Searching for the file name returns the confirmation that the file has been indexed

Putting this all together will result in the following Logic App workflow:

Here are the activities beginning from the Do – Until loop.
This action retrieves the status of the indexer:
This action retrieves the details of file specified to determine whether it has been indexed (note that I had hardcoded the file name for testing in the HTTP API call in the screenshot below):
This should be updated to use a dynamic value inserting the file name and extension as such:
This action, as demonstrated in my previous post, retrieve and store the status of the indexer with the function:
body(‘HTTP_-_Get_AI_Search_Service’)?[‘lastResult’]?[‘status’]
This action is where we retrieve and store the value of the title with the function:
body(‘HTTP_-_Get_AI_Search_Service_-_Search_Document’)?[‘value’]?[0]?[‘title’]
Now we will determine whether the API call has returned results for the file. In scenarios where we’ve just initiated the indexer to start indexing the file that was uploaded, the time it takes to complete may not have passed so this action checks to see if a result was returned (note the condition specified) and if not, wait 5 minutes (this can be adjusted based on how many and frequent files are uploaded), then loop again to check if the file has been indexed.
————————————————————————————————————————————————————–
Note that the value returned when a document is not found is an empty string and not the word null. Here is a sample output of the run:
The raw output can be a bit misleading as it specifies null when if you are to check the value, the value is actually an empty string:
————————————————————————————————————————————————————–
The last action for the Until will use a function to continue looping until:
  1. DocSearchResult is not equal to an empty string as denoted with ”
  2. IndexerStatus is equal to success

The function for this is:

and(not(equals(variables(‘DocSearchResult’), ”)), equals(variables(‘IndexerStatus’), ‘success’))

Note that the action Set variable – Exit loop or not is just one I put in for troubleshooting and is not necessary.

Hope this provides more information on how to check for whether a file has been indexed by the AI Search Service. I would like to acknowledge that this would not handle cases where a file already exists and would need more logic to handle this scenario (perhaps through the use of retrieving the timestamps of the file in the storage account).

Leave a Reply

Your email address will not be published. Required fields are marked *