TextBrain:- Building an AI startup using Natural Language Processing
Originally Written on:- 28th May, 2019.
##StackOverflow
Tensorflow.js
Overview
This is the blog post for TextBrain,
a tool that automatically grades and validates texts. In order to validate
texts, it uses the copyleaks API to check for plagiarism. It also uses a
modified version of GPT-2 to detect the likelihood that the text was real or
fake. Then it outputs a validation score using these 2 scores. In order to
grade the text, it uses a neural network model trained on the automatic
essay/text grading dataset on Kaggle found here .
Steps in
this Tutorial:-
Step 1:- Download and run
GPT-2, hopefully already wrapped as a flask file.
Step 2:- Analyse its
structure
Step 3:- Re-Design and add
some text and stuff
Step 4:- Design a Login
and Sign Up functionality
Step 5:- Integrate a
Copy-Leaks API
Step 6:- Integrate
Tensorflow.js
Step 7:- Train and
transfer Scikit-Learn model on automatic essay/text grading dataset.
Step 8:- Display scores
Step 9:- Implement a
payment Functionality.
Step 10:- Upload to web.
What are language models and how do
they generate text?(Analyzing the structure)
In
recent years, the natural language processing community has seen the
development of increasingly larger and larger language models.
A
language model is a machine learning model that is trained to predict the next
word given an input context. As such, a model can generate text by generating
one word at a time. These predictions can even, to some extent, be constrained
by human-provided input to control what the model writes about. Due to their
modeling power, large language models have the potential to generate textual
output that is indistinguishable from human-written text to a non-expert
reader.
Language
models achieve this with incredibly accurate distributional estimates of what
words may follow in a given context. If a generation system uses a language
model and predicts a very likely next words, the generation will look similar
to what word a human would have picked in similar situation, despite not having
much knowledge about the context itself. This opens up paths for malicious
actors to use these tools to generate fake reviews, comments or news articles
to influence the public opinion.
To
prevent this from happening, we need to develop forensic techniques to detect
automatically generated text. We make the assumption that computer generated
text fools humans by sticking to the most likely words at each position, a
trick that fools humans. In contrast, natural writing actually more frequently
selects unpredictable words that make sense to the domain. That means that we
can detect whether a text actually looks too likely to be from a human writer!
GLTR represents
a visually forensic tool to detect text that was automatically generated from
large language models.
Testing
the Giant Language Model
The aim of GLTR is to take the
same models that are used to generated fake text as a tool for detection. GLTR has access to the
GPT-2 117M language model from OpenAI, one of the largest publicly available
models. It can use any textual input and analyze what GPT-2 would have
predicted at each position. Since the output is a ranking of all of the words
that the model knows, we can compute how the observed following word ranks. We
use this positional information to overlay a colored mask over the text that
corresponds to the position in the ranking. A word that ranks within the most
likely words is highlighted in green (top 10), yellow (top 100), red (top
1,000), and the rest of the words in purple. Thus, we can get a direct visual
indication of how likely each word was under the model.
While
it is possible to paste any text into the tool, we provided some examples of
fake and real texts. Notice that the fraction of red and purple words, i.e.
unlikely predictions, increases when you move to the real texts. Moreover, we
found that the informative snippets within a text almost always appear in red
or purple since these "surprising" terms carry the message of the
text.
By
hovering over a word in the display, a small box presents the top 5 predicted
words, their associated probabilities, as well as the position of the following
word. It is a fun exercise to look into what a model would have predicted.
Finally, the tool shows three different histograms that aggregate
the information over the whole text. The first one demonstrates how many words
of each category appear in the text. The second one illustrates the ratio
between the probabilities of the top predicted word and the following word. The
last histogram shows the distribution over the entropies of the predictions. A
low uncertainty implies that the model was very confident of each prediction,
whereas a high uncertainty implies uncertainty. You can observe that for the
academic text input, the uncertainty is generally higher than the samples from
the model.
Download and Run the
project from the repository below:- https://github.com/HendrikStrobelt/detecting-fake-text/ <- This is the
base repository that I am using.
If you wish to Replicate
my project directly, You can clone my Repository from here in your IPython
Notebook.
!git clone
https://github.com/soumyadip1995/TextBrain---Building-an-AI-start-up-using-NLP
&& cd TextBrain---Building-an-AI-start-up-using-NLP
Then,
Install dependencies for
Python >3.6 :
!pip install -r
requirements.txt
run server for gpt-2-small:
python server.py
In the original project
Repo:- Under client/dist we have a few files. index.html and fun.html. We will
be using those. I would suggest you to open up those files.
Flask
Flask is a micro web
framework written in Python. It is classified as a microframework because it
does not require particular tools or libraries. It has no database abstraction
layer, form validation, or any other components where pre-existing third-party
libraries provide common functions. However, Flask supports extensions that can
add application features as if they were implemented in Flask itself. Extensions
exist for object-relational mappers, form validation, upload handling, various
open authentication technologies and several common framework related tools.
Extensions are updated far more regularly than the core Flask program.
Applications that use the
Flask framework include Pinterest, LinkedIn and the community web page for
Flask itself.
Firebase
Firebase is Google's
mobile platform that helps you quickly develop high-quality apps and grow your
business.
Flask+
Firebase Integration(Re-design)
Google Firebase integration
for Flask.
The extension works in two
modes: development and production. In development, there is no communication
with the Firebase system, accounts sign-in with a simple email form.
In this project, we are
using a Flask and Firebase Integration in python in order to create a non-logic
route. We are then using an authentication for the non-logic route that was
created in the server.py file. We are doing this by calling a flask function
from javascript. We will use the name of the route and perform post action to
the route from the index.html. Here is some sample code:- https://github.com/klokantech/flask-firebase. The logic code in this
repo can help you to add a login, sign in/ sign up button and help you to
Reskin the whole page to suit your needs. And then we add a click-on function
to it from the browser as well.
We can Re-skin the html files in order to suit
our needs.
Copyleaks
The Copyleaks API is a powerful yet simple tool to integrate
within your platform and allow you to add content authentication capabilities
in just a few minutes. https://copyleaks.com/
Here you can find all the needed documentation for a seamless
integration including SDKs with code examples, methods documentation, technical
specifications and more.
Copyleaks Python SDK
Copyleaks SDK is a simple framework that allows you to scan
textual content for plagiarism and trace content distribution online, using the
Copyleaks plagiarism checker cloud.
We are using Copyleaks for any kind of plagiarism. It is
checking for Similarity on the web and not for generated text. Create an
account to obtain an api key. For more info on how on use SDKs , you can check
out this video. For more info:- Visit this
repo-> https://github.com/Copyleaks/Python-Plagiarism-Checker
Usage
Login to your CopyLeaks account using your email, API-key and
the product that you would like to use.
from copyleaks.product import Product
from copyleaks.processoptions import ProcessOptions
cloud = CopyleaksCloud(Product.Education, 'YOUR_EMAIL_HERE',
'YOUR_API_KEY_HERE')# You can change the product.
Tensorflow.js
The link below is the website which I have referred to. https://codelabs.developers.google.com/codelabs/tensorflowjs-teachablemachine-codelab/index.html#0
You will first load and run a popular pre-trained model called
MobileNet for image classification in the browser. You will then use a
technique called "transfer learning", which bootstraps our training
with the pre-trained MobileNet model and customizes it to train for your
application. Although we are not using any kind of image classification in this
scenario, so we just load that in and then it sets up mobile net by creating an
asynchronous JavaScript file that takes an image, in this case it performs
transfer learning on it using this downloaded model mobile net and it makes
prediction. So, what we did in the project was, essentially create a script
that pulls tensorflow.js from the web and then we create a grading button. then
we just copy and paste the code in index.html.
We then create this on-click function so whenever it whenever
user clicks on the grading button it's going to asynchronously load up that
mobile net model and then classify that image that we got and then display the
result in the console. So this is just an example to see if we could integrate
potential digest into our file
Train and transfer scikit learn model on automatic essay/text
grading dataset.
Feed Forward Neural Network
Using Neural Networks to predict essay grades using vectorized
essays
We use neural networks to predict the grade of the essay by
training on 90% of the data and testing on 10% of the data. The neural network
works using 3 layers, with one input layer, two layers on neurons and with one
output node.
Let's look at the architecture, initialization and cost. So, we
would use this model and train in the cloud and try and load it in our project.
The link below is the model that we will be using.
So, what this person has done is basically use a feedforward
neural network. to predict essay grades/text grades using vectorized
essays/text
We use neural networks to predict the grade of the essay/text by
training on 90% of the data and testing on 10% of the data. The neural network
works using 3 layers, with one input layer, two layers on neurons and with one
output node. This was the result below.
Results
Having tested the Neural Network on 10% of the data and trained
on 90% of the data, we get the highest Spearman score of 0.9369, with fairly
low computation time. These are fantastic results meaning that there is high
accuracy from our model.
Automatic Essay/Text Grading
We need to grade our texts and often times it is in another form
like HDF5 format. We need to convert this into a json file format and convert
the entire model into tensorflow.js so that it can be used, since the whole
thing is in JavaScript. We can also use ONNX- https://onnx.ai/
ONNX is an AI ecosystem that can be used convert one library to
another library to another library. So in this case we will be converting the
scikit learn model and load that into tensorflow.js.
Display scores and Payment Functionality.
Result
The validity is actually the average. Let me explain:-
The GPT-2 score is suppose score1 which is a scalar value and
the value from the plagiarism API is the second score Let’s say -> score2. Now
we take an average of these two scores->
(score1+score2)/2
which is the validity. This is how we calculate validity.
which you see in the picture, then we're gonna create a grading
score and the way to do that is to use the scikit learn model as mentioned
above and so that's also going to output a scalar value and that scalar
value is our text score and then you know it's a float value but
we convert it to an integer and then display it as HTML to the user and once we
have that we now have a score for both validity and the grade now we also want
to add this of this you know very crucially this payment functionality that is
easily done with stripe and we create a button for that in the index.html as
well. Stripe will take care of our payment, card and banking details as well.
A sample is shown below:- Copy and paste this using your own API
key in index.html
##Stackoverflow
var handler = StripeCheckout.configure({
key: 'XXX',
image:
'https://stripe.com/img/documentation/checkout/marketplace.png',
locale: 'auto',
token: function(token) {
And then we Run it in the browser to check our Results.
Conclusion
We did not create the Technology, we
implemented the engineering solution somebody else made the scientific
discovery but we can now serve it to people as an engineering solution but in
the way to make this into a sustainable business where you can pay yourself you
can hire people to improve the product. We can then build continuous training
pipelines.
We
can obviously have better UI design and we can have more flat ski routing this
was kind of like jumbling these java scripts in HTML and CSS all together. I used the model in
Python and served it with JavaScript but ideally all that happens in the same
language and so you know that this is something we can do to improve, so this
code is open source use it as you'd like.
Credits/Citations for this Post
2) Stripe- Github
3) Firebase- Github
4) Tensorflow.js- GitHub
6) https://codelabs.developers.google.com/codelabs/tensorflowjs-teachablemachine-codelab/index.html#0
9) The GLTR team
10) Siraj Raval
Comments
Post a Comment