Cognitum Techblog: Mixing Text Mining with Semantic Technologies

Monday, 19 January 2015

Mixing Text Mining with Semantic Technologies - sample application.

The very broad subject of processing the natural language is incredibly hot nowadays. In many cases, a regular text mining approach is not adequate to the problems that we are facing. Therefore text mining methods are mixed with Natural Language Processing(NLP) methods, like also, with semantic technologies - what gives better results. One of such a problem is how to find out, if two sentences are semantically equal or not.

The solution for the above problem could be used on many fields. One of them is detection of an abusive clauses inside a contract. Sometimes it's really hard to understand correctly, the exact meaning of a clause inside a contract, even for a specialists. For a sake of presentation I have developed a simple application prototype which attempts to solve this problem. Application was developed in C# and it uses Ontorion SDK.

Input

Before running the application we need three files:

File with contract in which we will attempt to detect abusive clauses.
File with abusive clauses.
File with ontology.

Files 1&2 can be in Ms Word or txt format:

Figure 1

The ontology file was created in Fluent Editor. It contains knowledge extracted from the files 1&2. In this sample the ontology is very small and simple:

Figure 2

Above ontology was exported to the Ontorion so I could easily explore it from within application, using Ontorion SDK.

In application we just load file with contract and file with abusive clauses, next specify the name of the Ontorion database in which the ontology resides and then we can start the matching process.

Figure 3

Algorithm

An algorithm of matching sentences does the following:

Sentences detection of files 1&2.
Tokenization of files 1&2 sentences.
Stemming of each sentence words in files 1&2.
Obtaining the knowledge from ontology by using Ontorion SDK - the relations between concepts/instances along with its annotations.
Calculating the improved semantic cosine similarity measure between all of file 1 and file 2 sentences.

This algorithm doesn't take in account a words order, what in this particular case will be an advantage. The most important part of whole process plays an ontology. More "rich" the ontology is, more accurate results we will get.

Example

As an example I have prepared one contract clause and one abusive clause. Ontology used is same as on Figure 2. We will start from the end, so from the result of comparison, and then describe it:

Figure 4

So the contract clause is as follow:

Payment for the training: after the charge, money is not refundable.

and an abusive clause is:

Payment for the service: after the payment the money is not returnable.

From the relations described in ontology we know that:

Every training is a service.

thus, "training" and "service" words, can be considered here as semantically equal. In our ontology we also defined an annotations for each instance/concept, which are considered as synonyms. So for "payment" concept we defined, inter alia, a "charge" annotation, and for "refundable" concept - "returnable" annotation. So despite the differences between words used in both sentences, the meaning of them is same, that's why the semantic similarity rate equals 1.

You can see this sample application in action on below video: