<img alt="" src="https://secure.hims1nice.com/151009.png" style="display:none;">
" style="background-color: #2c3e50;">
 

Blogs

How-to Read Office Documents with PowerShell

October 22 2020

Powershell

Modern Microsoft Office documents are created with Open XML formatting. PowerPoint, Word, and Excel documents are mostly created with an Open format that is widely documented with SDK available. There are a couple of ways to install the SDK, but with Windows 10, this process is even easier.

Installing the Open XML SDK

Note: All PowerShell Commands should be executed from an Administrative Prompt

We will install the latest version of the SDK from NuGet, if you have not used NuGet before, we will need to install NuGet as a package provider. You can do so with the following commands:

Find-PackageProvider -Name NuGet | Install-PackageProvider -Force

Register-PackageSource -Name nuget.org -Location https://www.nuget.org/api/v2 -ProviderName NuGet

Once NuGet is installed and registered, we can pull the latest Open XML SDK version by running the following command:

Install-Package -Name Open-XML-SDK -SkipDependencies

Note: The skipdependencies flag is on because the command will fail for a dependency loop otherwise.

You can visit the following link to get additional information: NuGet Open XML SDK

It is possible to download the Open XML SDK from Microsoft, but this is limited to version 2.5 which has not been updated since 2012.

Reading Document Contents

This post was born out of necessity. A customer was looking to remove 3rd party document classification from Office documents and a PowerShell script seemed like the natural answer. This process will not work for older formatted documents, but xlsx, docx, pptx, and many of their variants it will work fine.

The Open XML SDK is well documented, though examples are not written for PowerShell, so you will have to do some inferencing and get a little hacky for certain behaviors. Open XML SDK Documentation

Adding the Assembly DLL from our NuGet Package

Before we can leverage the SDK, we must load the types from the SDK DLL. If you are not sure where the package was downloaded, you can run:

$Packageinfo = Get-Package Open-XML-SDK

$Packageinfo.Source

Reading Docs with PowerShell1

In my case, navigating to C:\Program Files\PackageManagement\NuGet\Packages\Open-XML-SDK.2.9.1\ reveals the NuGet package, a signature file, and a library directory. Open the lib directory and grab the DLL under the most recent version of dot net framework.

To add types from a DLL, you can use the Add-Type cmdlet:

Add-Type -Path "C:\Program Files\PackageManagement\NuGet\Packages\Open-XML-SDK.2.9.1\lib\net46\DocumentFormat.OpenXml.dll"

Opening the Document

Next, you will want to grab a modern Office document to test with. We will store the path in a variable.

Note: You can hold shift and right-click a document to “Copy as Path” in Windows

$docpath = "C:\Documents\TestDoc.docx"

 

The following gets a bit more complicated, but essentially, we are opening the specified file to a file stream. The file is being opened for read access, as we only want to inspect the document.

FileStream Open Documentation

[System.IO.FileStream]$fileStream = [System.IO.File]::Open($docpath, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::Read)

 

Now all that is left is to open the document. This process varies by document type, and there are a few different ways to open documents, but for this demo, we will be opening the document from the file stream we created, and it will not be editable.

Open Method Wordprocessing Document

$DOCU = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument]::Open($fileStream, $false)

You should now be able to inspect the $DOCU variable and see that your document is opened for reading and inspection! The document contents can be a little tricky to navigate, but you can find info here. https://docs.microsoft.com/en-us/office/open-xml/how-to-open-a-word-processing-document-for-read-only-access#basic-document-structure

Reading Docs with PowerShell2Reading Docs with PowerShell3Reading Docs with PowerShell4

Closing the Document

When you are finished inspecting the document, be sure to close the stream and the document:

$fileStream.Dispose()

$DOCU.Dispose() 

Opening Multiple Document Types

Each document type has a different method for opening and structuring data, but we can handle them all! The following switch statement will help navigate them on the fly.

Switch -Regex ([System.IO.Path]::GetExtension($docpath)) {

    'docx|dotm|dotx' {

        [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] $DOCU = $null

        #String path to doc, bool isEditable

        $DOCU = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument]::Open($fileStream, $false)

    }

    'xlsx|xlsm|xltx|xltm' {           

        [DocumentFormat.OpenXml.Packaging.SpreadsheetDocument] $DOCU = $null

        $DOCU = [DocumentFormat.OpenXml.Packaging.SpreadsheetDocument]::Open($fileStream, $false)

    }

    'pptx|pptm|potx|potm|ppsx' {

        [DocumentFormat.OpenXml.Packaging.PresentationDocument] $DOCU = $null

        $DOCU = [DocumentFormat.OpenXml.Packaging.PresentationDocument]::Open($fileStream, $false)

    }

}

Putting it all Together

Add-Type -Path "C:\Program Files\PackageManagement\NuGet\Packages\Open-XML-SDK.2.9.1\lib\net46\DocumentFormat.OpenXml.dll"

 

$docpath = "C:\Documents\TestDoc.docx"

 

[System.IO.FileStream]$fileStream = [System.IO.File]::Open($docpath, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::Read)

 

Switch -Regex ([System.IO.Path]::GetExtension($docpath)) {

    'docx|dotm|dotx' {

        [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] $DOCU = $null

        $DOCU = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument]::Open($fileStream, $false)

    }

    'xlsx|xlsm|xltx|xltm' {           

        [DocumentFormat.OpenXml.Packaging.SpreadsheetDocument] $DOCU = $null

        $DOCU = [DocumentFormat.OpenXml.Packaging.SpreadsheetDocument]::Open($fileStream, $false)

    }

    'pptx|pptm|potx|potm|ppsx' {

        [DocumentFormat.OpenXml.Packaging.PresentationDocument] $DOCU = $null

        $DOCU = [DocumentFormat.OpenXml.Packaging.PresentationDocument]::Open($fileStream, $false)

    }

}

 

$fileStream.Dispose()

$DOCU.Dispose() 

ms-gold-partner-01

KiZAN is a Microsoft National Solutions Provider with numerous gold and silver Microsoft competencies, including gold security and gold enterprise mobility management. Our primary offices are located in Louisville, KY, and Cincinnati, OH, with additional sales offices located in Tennessee, Indiana, Michigan, Pennsylvania, Florida, North Carolina, South Carolina, and Georgia.

Posted by Nate Berrier

Website

Topics: PowerShell