Client Overview

With over 15,000 employees, active in across 150 countries offering expertise in Health, Tax and Accounting, Governance, Risk and Compliance

Business Need

Their current classification process of 10-K forms was manual, error prone and not scalable. With 10M documents and 36 target categories they wanted an intelligent classification model.

Key Features

  • Input: XML files / Output: Text files  using Parser – Apache Tika + Custom
  • Evaluated DL4J, Naïve Baiyes and TensorFlow as Classifiers that run models and test set
  • Reviewer – results of classifier including audit logs, docs parsed and reviewing outliers
  • Custom Reviewer to capture results of classification iterations in terms of accuracy

Results

  • Additional ~1800 documents included in the POC in addition to the original ~2000 to validate the accuracy.
  • Naïve Bayes provided the highest level of accuracy(~95%). Accuracy can be further enhanced by including external feature set
  • Scalability can be achieved by using / building Big Data frameworks for distributed computing