The Idea by Woobensky
The Idea by Woobensky
An AI model that reads a text document and that is able to answer any
question whose answer is in the document explicitly or implicitly.
How I started
Having this idea in mind I used the python library NLTK to create a pipeline for the text of the user’s
document to follow until it becomes a structure data. I use chunking to find the parts of the
sentences and their function (ex: subject, complement, circumstantial, verb, etc.) then I create some
SQL tables to store the parts of each sentence in a structured way. AT this point the user can upload
its document.
Now for the QandA part (what, where, when…questions). For this part I did something very similar;
I parsed the questions by chunking it in parts and by determining what information the question
was looking for (ex: what is Haiti ? = looking for the complement of the sentence: Haiti is….). After
that I created a SQL query to find the corresponding answer in the structured data base.
Then after finding the corresponding answer in the data base, I had to use it to generate a
grammatical answer (because the data base was not structured grammatically).
How it worked
I was happy to see how effective it was for explicit questions (ex: Haiti is a country what is Haiti?).
But it didn’t work well at all for implicit questions (ex: the population of Haiti is 11.1 million how
many people live in Haiti?), and the user could not even change a word with its synonym.
Everything had to be really precise. The user had to know how the author structured the phrases in
order to ask the question the right way.
And furthermore, I was using MySQL which takes time, and also because I had to chunk every
sentence in order to store them. Those caused it to take about 1 minute to read a 6 sentence
document and it took about 15 seconds to answer questions.
And the chunking was not the most effective way to parse the syntax of a sentence; it is not scalable;
that means you have to predict every possibility and write them by hand. (I used a Regular
expression chunker not a trained one).
Further improvement
For it to work better I have to change a lot of things, a lot of functionalities. First of all I need to
change the answer retrieval algorithm (the way I look for the corresponding answer), make it less
strict, and also more open to implicitness. And for speed I have to use a new DBMS that is way faster
than MySQL, and that allow more arguments in queries (ex: Select * where subject is Synonym to
“Path”). I think “elasticsearch” might be a good option. For syntax parsing I intend to use
dependency parsing which is way more scalable than chunking.
I might use word vector (with Gensim) and sentence similarity in order to retrieve answer for
implicit questions.
I can also use Spacy instead of NLTK since it has a labeled dependency parser and it’s faster than
NLTK, I think.
PAGE 1
The ineluctable obstacle
Even if I do all this improvements, I will still have big limits with implicit questions. There are things
that just understand the words of a sentence and its structure can’t allow you to know and
understand about a text. And most often this are the kind of questions the users will want to know.
For example in a document that talks about Haiti and the DR has countries that share one island, if
the users asks what country does Haiti have a border with, there is no way for the system to know
that when two countries shares an island the inevitably have one border. This situation even a
human being could not have found this answer if He doesn’t know a lot about geography.
And this idea bring me to my next point in which I will talk about “Background Knowledge”.
The situation
In every text document there is always a lot of information that the author don’t write but that are
there and that we humans fill up by ourselves (missing info.)
Ex: If we have a text that says: “Yesterday Mary was very sick He had to go to the hospital, but when
she arrived the doctors do all the can they could not save her.”. There is two questions that a model
like the one I talk about can’t answer: Did Mary die? And if so what killed her?
Only a human being knows that an author tells you that doctors could not save a patient that means
the patient died. And we also know that if a patient dies because the doctors could not save her that
means the sickness killed her.
We humans have background knowledge that allow us to understand what we read even if the
author doesn’t say everything. And we know what happens when we read about subject we are not
used to (me for example, biology), we understand every word we read but we can’t understand the
whole thing.
The solution
If we can find a way to make computers have background knowledge, they will be able, like us, to fill
up things that the author don’t say, they will be able to make inferences and logics about things the
author writes, which will allow them to answer any question the user might ask, no matter how
implicit it is.
How would this work?
It will use the same system I use for the current version of Bensky, storing information in a data
base. But it will be able to have a reference knowledge where it goes to find facts of the real world
and background information that will allow to understand the document in a deeper level.
I will use deep learning and machine learning to create an algorithm that allow it to make deep
logics about facts.
It will need a huge amount of text about a special field in order to understand documents about this
field, for example in computer science to understand a document and be able to answer question
about a book that talks about deep learning, you need a lot of background knowledge in the science.
So that’s the goal. A system that knows stuff about our world which allow it to understand natural
language texts.
PAGE 2
What I need to do
For now I’m not ready to implement those ideas, since I don’t know a lot about machine learning.
But what I need to do is gain a great amount of skills in machine learning, deep learning and math.
I will start with Tensorflow to get some basic understanding of the field and of the algorithms then I
will go deeper as I start to see what part is more important for my goal.
Then I will implement this solution, and I believe that this technology will change way we interact
with computers. It will change the way we create chatbots and even the way search engines work.
We will at last be able to have real conversations with computers and this will allow developers to
create technologies that will be able to make best use of the power that computers have.
PAGE 3