Global Information Service Provider and Publishing Company with over 15,000 employees, active in across 150 countries offering expertise in Health, Tax, Accounting, Governance, Risk and Compliance
Their current classification process of documents was manual, error prone and not scalable. With 10M documents and 36 target categories they wanted an intelligent classification model.
Parser (Apache Tika + Custom)
- Input: XML files / Output: Text files
- Content detection and analysis framework
- Customized to process ‘Header’ section
Classifier (DL4J, Naïve Baiyes and TensorFlow)
- Evaluated machine learning and neural network tools
- Create a model and run a test set (with different ratios of training : test set)
- Audit logs, docs parsed and outliers (enrich feature set, training set, classify outliers)
- Feedback loop back to the Classifier
- To capture results of classification iterations in terms of accuracy.
- Review unclassified documents (Outliers)
- Enhance external feature set and enrich training set
- Review accuracy trends, performance etc.
- High level of accuracy using Naïve Baiyes (>95%). Accuracy to be further enhanced using external feature set
- Scalable, accurate solution design with high performance