CIGNEX and Relevance Lab have joined forces. Learn more at Relevance Lab

Blog

Text Classification using Machine Learning

18 Sep 2017

Check out the below recent headlines –

– 13th Aug 2017 - Elon Musk backed OpenAI’s bot defeats champion in Defeats of the Ancients 2 (DOTA2) –
– 15th Mar 2016 – Google’s AlphaGo AI beats world champion in Go

These headlines point to the massive proliferation of Artificial Intelligence in general and Machine Learning in particular in our everyday world.

What is Machine Learning?

At a very high level, Machine Learning is the ability of computers to learn on their own. They do so by using algorithms that traverse and analyze the huge amount of data to generate the precise ability to predict human behavior. Some other real-life everyday examples are how spam mails no longer flood your inbox, how Amazon recommends product based on your interest and past browsing history, or speech recognition tool on Google Now.

The work around adding human-like intelligence to machines started way back in 1947 with Alan Turing’s code-breaking machine and his publication “Intelligent Military”. However not much happened in the subsequent decades until 2015. By then the availability of fast processors, decreasing bandwidth costs, better storage facilities, and explosion in big data meant that the ground was ripe for Artificial Intelligence to flourish wide.

Deep learning within machine learning

Deep learning is an interesting subset of the machine learning ecosystem. This branch requires phenomenal volumes of data and a phenomenal amount of processing power to deliver results. In many cases it is used in conjunction with machine learning. An example would be the need to reduce the number of false positives within a security breach program, which can utilize both machine learning and deep learning.

Deep learning uses complex neural networks for doing computations that would further improve the practical value of AI with better applications.

Use cases of Machine Learning in Action – Text (Document) Classification

Company profile

A leading global information services and publishing company is 180+ years old and employs a strength of 15000 globally.

The problem

The company’s growing volume of publication had prompted it to contact CIGNEX for automating the classification process of all publications within the organization. While doing so, it needed the solution to be scalable for future demands, to be accurate, and to continue delivering high performance over a long time. It wanted to have a self-learning system that could hit high accuracy rates (minimum of 80%) in the classification process. The solution needed to be integrated into the existing systems and be offered as an end to end solution.

The solution construct

a) Execution flow

Data acquisition: The first step here was acquiring the right set of data. It made use of external RBS filings and extraction of XML files for data acquisition
Parsing: The next step was preparing the data by adding meta information like parsing of filter tags and meta data. This step also did basic pre-processing to get the data ready (noise words removal, tokenization and stemming)
External feature: The 3rd step was to introduce the external feature set i.e. business rules and business analysis so that the solution performs as per stakeholders’ expectation
Hypothesis/modeling: Once the above 3 steps are done, the modeling and testing of the model is carried out
Unlabeled doc classification: This gives insight into ways to correctly classify unlabeled documents
Evaluation: In this stage of hypothesis model testing, the accuracy or diff against existing classification is ascertained
Optimization: Once the above factors are in place, it becomes easier to see if the model is delivering outcomes as expected or does it need to be fine tuned further. This helps to verify the accuracy of the model selected at point#4 above

b) Technology stack and components used

Custom coding and Apache Tika used for:

Identification of documents
Conversion of XML data into text
Processing of header data

DL4J, Naïve Baiyes, Tensor Flow, and Decision Tree used for:

Classification using neural networks
Provide a training set
Create a model and test it for relevance and accuracy
Tune a model with available parameters
Validate the output with model, fine tune and iterate further

Reviewer and custom programming used for:

Capture and assess the accuracy of the results obtained from the model
Review outliers and tune the model
Classify the unclassified documents through external feature set
Enhance the training set
Review the performance trend

Key takeaways

Applicability of the use case for Machine Learning is crucial
In depth understanding of the data set is needed
When building the Machine Learning solution, it is important to maintain the larger picture focus of the overall solution
Accuracy of Training data set very important to build a good model
Optimal model performance can be observed at 60:40 or 70:30 training set: test set ratio
Pre-processing and rigorous cleaning of data is necessary
Scalability needs to be included as part of solution architecture for long term efficacy of the solution

Interested to check out more about how big data analytics and machine learning from CIGNEX can enhance your operational efficiencies? Then connect with us for a quick consultation.

CIGNEX is a global consulting company offering solutions, services and platforms on Open Source, Cloud and Automation technologies. Since 2000, CIGNEX has been delivering enterprise class solutions, which are built using leading platforms & can easily be integrated with existing systems to achieve unparalleled results. By leveraging multiple delivery models, we help organizations around the world to increase revenue, achieve business goals, gain competitive advantage, and maximize customer satisfaction while significantly reducing the cost of doing business.

As a leading System Integrator, CIGNEX provides end-to-end services on Liferay | UiPath | Kafka – Confluent | Sitecore | Red Hat | Appian | Salesforce | Servicenow | MongoDB | Drupal

For any questions, RFP or to get in touch, you can email us at info@cignex.com

CIGNEX and Relevance Lab have joined forces. Learn more at Relevance Lab

Solutions

Services

Staffing Services

Resources

About Us

Solutions

Services

Resources

About Us