Saturday, 6 October 2018

Using Azure Functions, Cognitive Services and Flow for classifying Office 365 SharePoint Word Documents - Part I

This article series helps us to work on a special use case of extracting information of word documents uploaded to Office 365 SharePoint libraries and then analyze the document content using Azure Cognitive Services.

We have seen before extracting tags and metadata properties of image files from Office 365 SharePoint using Microsoft Flow and Azure Cognitive Services.

Microsoft Flow has a Get File content action, but that doesn't help extracting word documents content. Only it supports extracting content of notepad as straight forward approach. Since Microsoft Flow doesnt provide any option to read the word documents content, we will be using Azure Functions to extract the content. Once we have the content, we will use Azure Cognitive service to get the tags for the content extracted. Here Microsoft Flow is used to manipulate triggers and subsequent actions. So our algorithm is will be as follows.

High level architecture for classifying SharePoint Word Documents

  1. Word document is uploaded to Office 365 SharePoint library. 
  2. Microsoft Flow will be listening the library for any document uploads. (When an file is created). This trigger is configured to look for any document uploads. 
    1. Extract the properties of documents (like File Path, Id, etc.) 
  3. Call Azure function/service to read document content
  4. From Azure function, read the content of the document using file path with the help of client context. 
  5. Sends document content back to Microsoft Flow
  6. Analyze the text to extract key phrases using Azure Cognitive services. 
  7. Update the SharePoint item (document) properties with tags extracted using Cognitive Service.

This particular part helps in extracting the content of the document using file path with the help of azure function and SharePoint client context.

Extracting Content Using Azure Function

The custom service to read the content of word document, is being hosted as Azure Function. The custom service or function is being built using C# HTTP Trigger templates with the help of SharePoint PnP libraries to extract the content of file uploaded to SharePoint libraries.

This Azure Function requires a file path, to read the file content. Let us first see how the content is extracted using Azure Function.

Using the file path, the file content is retrieved from SharePoint with the help of open XML SDKs. The following is a XML structure for any of the word document. This structure just shows the document content elements, which doesnt show any style elements (since style elements are not required for this POC).

The following snippet helps you retrieving the content of file by parsing XML.

The following snippet shows how the data is retrieved from the SharePoint by extracting the document content.

In the next post, let us look how to host and integrate this function into Microsoft flow for extracting and updating tags.