We have been cataloging and indexing documents and texts since the dawn of the Roman Empire. In the information era, the volume of documents that are processed across enterprises are hard to manage. Often enterprises have a team manually going through each of the documents, painfully reading and categorizing them into the document management system. Of course, the process is inefficient, often final set is riddled with inaccuracies and the entire process consumes a lot of time.

Often when we are categorizing these documents, there are plenty of parameters that form the deciding factor behind the categorization. Once these parameters are decided, enterprises can use machine learning platforms to scrape, process and categorize them automatically. At CIGNEX Datamatics, we have developed an end to end (document acquisition to classification) solution leveraging leading Open Source Technologies (including Machine Learning) tools to create a model which automatically classifies the documents into a growing number of categories. The solution is scalable to support millions of documents and distributes the classification for better performance.


  • Legal Documents
  • Government Archives
  • Production related documents in Manufacturing
  • Patent Offices
  • Income related forms, Pension applications, credit claims etc. in Banking, Financial Services and Insurance
Featured Case Study

Achieving High Accuracy in Reduced Time with Machine Learning

Replacing manual classification of documents with an intelligent automated classification model. The solution includes a parser (Apache Tika + custom)  for content analysis and detection, a classifier  (D4LJ, Naïve Baiyes) to create model and run test set, a reviewer to audit logs, docs parsed and outliers, and review unclassified documents. The solution achieved high accuracy (>95%) while maintaining the high performance all completed in shorter duration to the traditional classification process.

Read Case Study

Key Features

Self Learning

With proper inputs the platform is self learning understanding from its mistakes to routinely improve its training and test sets to classify with better accuracy.

Seamless classification

Ability to classify using keywords, metadata or content filters and categorize them aligned to your document or records management platform

Scale that powers information search

Ability to organize and manage millions of documents while offering an interface that enables finding relevant information

Key Benefits

  • Saves time and cost of discovery of online documents
  • Compliance and aligned to regulations
  • Accuracy, Productivity and User satisfaction