Skip to main content

Client Overview

With over 15,000 employees, active in across 150 countries offering expertise in Health, Tax and Accounting, Governance, Risk and Compliance

Business Need

Their current classification process of 10-K forms was manual, error prone and not scalable. With 10M documents and 36 target categories they wanted an intelligent classification model.

Key Features

  • Input: XML files / Output: Text files  using Parser – Apache Tika + Custom
  • Evaluated DL4J, Naïve Baiyes and TensorFlow as Classifiers that run models and test set
  • Reviewer – results of classifier including audit logs, docs parsed and reviewing outliers
  • Custom Reviewer to capture results of classification iterations in terms of accuracy


  • Additional ~1800 documents included in the POC in addition to the original ~2000 to validate the accuracy.
  • Naïve Bayes provided the highest level of accuracy(~95%). Accuracy can be further enhanced by including external feature set
  • Scalability can be achieved by using / building Big Data frameworks for distributed computing