TextBrain:- Building an AI startup using Natural Language Processing

Originally Written on:- 28th May, 2019.




Overview

This is the blog post  for TextBrain, a tool that automatically grades and validates texts. In order to validate texts, it uses the copyleaks API to check for plagiarism. It also uses a modified version of GPT-2 to detect the likelihood that the text was real or fake. Then it outputs a validation score using these 2 scores. In order to grade the text, it uses a neural network model trained on the automatic essay/text grading dataset on Kaggle found here .




Steps in this Tutorial:-


Step 1:- Download and run GPT-2, hopefully already wrapped as a flask file.
Step 2:- Analyse its structure
Step 3:- Re-Design and add some text and stuff
Step 4:- Design a Login and Sign Up functionality
Step 5:- Integrate a Copy-Leaks API
Step 6:- Integrate Tensorflow.js
Step 7:- Train and transfer Scikit-Learn model on automatic essay/text grading dataset.
Step 8:- Display scores
Step 9:- Implement a payment Functionality.
Step 10:- Upload to web.


The full Jupyter notebook is here.

What are language models and how do they generate text?(Analyzing the structure)
In recent years, the natural language processing community has seen the development of increasingly larger and larger language models.

A language model is a machine learning model that is trained to predict the next word given an input context. As such, a model can generate text by generating one word at a time. These predictions can even, to some extent, be constrained by human-provided input to control what the model writes about. Due to their modeling power, large language models have the potential to generate textual output that is indistinguishable from human-written text to a non-expert reader.
Language models achieve this with incredibly accurate distributional estimates of what words may follow in a given context. If a generation system uses a language model and predicts a very likely next words, the generation will look similar to what word a human would have picked in similar situation, despite not having much knowledge about the context itself. This opens up paths for malicious actors to use these tools to generate fake reviews, comments or news articles to influence the public opinion.
To prevent this from happening, we need to develop forensic techniques to detect automatically generated text. We make the assumption that computer generated text fools humans by sticking to the most likely words at each position, a trick that fools humans. In contrast, natural writing actually more frequently selects unpredictable words that make sense to the domain. That means that we can detect whether a text actually looks too likely to be from a human writer!
GLTR represents a visually forensic tool to detect text that was automatically generated from large language models.




Testing the Giant Language Model


The aim of GLTR is to take the same models that are used to generated fake text as a tool for detection. GLTR has access to the GPT-2 117M language model from OpenAI, one of the largest publicly available models. It can use any textual input and analyze what GPT-2 would have predicted at each position. Since the output is a ranking of all of the words that the model knows, we can compute how the observed following word ranks. We use this positional information to overlay a colored mask over the text that corresponds to the position in the ranking. A word that ranks within the most likely words is highlighted in green (top 10), yellow (top 100), red (top 1,000), and the rest of the words in purple. Thus, we can get a direct visual indication of how likely each word was under the model.

While it is possible to paste any text into the tool, we provided some examples of fake and real texts. Notice that the fraction of red and purple words, i.e. unlikely predictions, increases when you move to the real texts. Moreover, we found that the informative snippets within a text almost always appear in red or purple since these "surprising" terms carry the message of the text.



By hovering over a word in the display, a small box presents the top 5 predicted words, their associated probabilities, as well as the position of the following word. It is a fun exercise to look into what a model would have predicted.




Finally, the tool shows three different histograms that aggregate the information over the whole text. The first one demonstrates how many words of each category appear in the text. The second one illustrates the ratio between the probabilities of the top predicted word and the following word. The last histogram shows the distribution over the entropies of the predictions. A low uncertainty implies that the model was very confident of each prediction, whereas a high uncertainty implies uncertainty. You can observe that for the academic text input, the uncertainty is generally higher than the samples from the model.


check out the live demo of  GLTR 
here




Lets see Demo a version of GPT-2 with a few lines of code here


Download and Run the project from the repository below:- https://github.com/HendrikStrobelt/detecting-fake-text/ <- This is the base repository that I am using.
If you wish to Replicate my project directly, You can clone my Repository from here in your IPython Notebook.

!git clone https://github.com/soumyadip1995/TextBrain---Building-an-AI-start-up-using-NLP && cd TextBrain---Building-an-AI-start-up-using-NLP

Then,
Install dependencies for Python >3.6 :

!pip install -r requirements.txt

run server for gpt-2-small:

python server.py

In the original project Repo:- Under client/dist we have a few files. index.html and fun.html. We will be using those. I would suggest you to open up those files.



Flask


Flask is a micro web framework written in Python. It is classified as a microframework because it does not require particular tools or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries provide common functions. However, Flask supports extensions that can add application features as if they were implemented in Flask itself. Extensions exist for object-relational mappers, form validation, upload handling, various open authentication technologies and several common framework related tools. Extensions are updated far more regularly than the core Flask program.

Applications that use the Flask framework include Pinterest, LinkedIn and the community web page for Flask itself.


Firebase



Firebase is Google's mobile platform that helps you quickly develop high-quality apps and grow your business.

Flask+ Firebase Integration(Re-design)

Google Firebase integration for Flask.
The extension works in two modes: development and production. In development, there is no communication with the Firebase system, accounts sign-in with a simple email form.
In this project, we are using a Flask and Firebase Integration in python in order to create a non-logic route. We are then using an authentication for the non-logic route that was created in the server.py file. We are doing this by calling a flask function from javascript. We will use the name of the route and perform post action to the route from the index.html. Here is some sample code:- https://github.com/klokantech/flask-firebase. The logic code in this repo can help you to add a login, sign in/ sign up button and help you to Reskin the whole page to suit your needs. And then we add a click-on function to it from the browser as well.

We can Re-skin the html files in order to suit our needs.


Copyleaks



The Copyleaks API is a powerful yet simple tool to integrate within your platform and allow you to add content authentication capabilities in just a few minutes. https://copyleaks.com/
Here you can find all the needed documentation for a seamless integration including SDKs with code examples, methods documentation, technical specifications and more.
Copyleaks Python SDK
Copyleaks SDK is a simple framework that allows you to scan textual content for plagiarism and trace content distribution online, using the Copyleaks plagiarism checker cloud.
We are using Copyleaks for any kind of plagiarism. It is checking for Similarity on the web and not for generated text. Create an account to obtain an api key. For more info on how on use SDKs , you can check out this video. For more info:- Visit this repo-> https://github.com/Copyleaks/Python-Plagiarism-Checker




Usage



Login to your CopyLeaks account using your email, API-key and the product that you would like to use.
 ##StackOverflow
from copyleaks.product import Product
from copyleaks.processoptions import ProcessOptions
cloud = CopyleaksCloud(Product.Education, 'YOUR_EMAIL_HERE', 'YOUR_API_KEY_HERE')# You can change the product.


Tensorflow.js




You will first load and run a popular pre-trained model called MobileNet for image classification in the browser. You will then use a technique called "transfer learning", which bootstraps our training with the pre-trained MobileNet model and customizes it to train for your application. Although we are not using any kind of image classification in this scenario, so we just load that in and then it sets up mobile net by creating an asynchronous JavaScript file that takes an image, in this case it performs transfer learning on it using this downloaded model mobile net and it makes prediction. So, what we did in the project was, essentially create a script that pulls tensorflow.js from the web and then we create a grading button. then we just copy and paste the code in index.html.
We then create this on-click function so whenever it whenever user clicks on the grading button it's going to asynchronously load up that mobile net model and then classify that image that we got and then display the result in the console. So this is just an example to see if we could integrate potential digest into our file


Train and transfer scikit learn model on automatic essay/text grading dataset.
Feed Forward Neural Network
Using Neural Networks to predict essay grades using vectorized essays
We use neural networks to predict the grade of the essay by training on 90% of the data and testing on 10% of the data. The neural network works using 3 layers, with one input layer, two layers on neurons and with one output node.
Let's look at the architecture, initialization and cost. So, we would use this model and train in the cloud and try and load it in our project. The link below is the model that we will be using.

So, what this person has done is basically use a feedforward neural network. to predict essay grades/text grades using vectorized essays/text
We use neural networks to predict the grade of the essay/text by training on 90% of the data and testing on 10% of the data. The neural network works using 3 layers, with one input layer, two layers on neurons and with one output node. This was the result below.



Results

Having tested the Neural Network on 10% of the data and trained on 90% of the data, we get the highest Spearman score of 0.9369, with fairly low computation time. These are fantastic results meaning that there is high accuracy from our model.

Automatic Essay/Text Grading

We need to grade our texts and often times it is in another form like HDF5 format. We need to convert this into a json file format and convert the entire model into tensorflow.js so that it can be used, since the whole thing is in JavaScript. We can also use ONNX- https://onnx.ai/
ONNX is an AI ecosystem that can be used convert one library to another library to another library. So in this case we will be converting the scikit learn model and load that into tensorflow.js.

Display scores and Payment Functionality.

Result


The validity is actually the average. Let me explain:-
The GPT-2 score is suppose score1 which is a scalar value and the value from the plagiarism API is the second score Let’s say -> score2. Now we take an average of these two scores->
(score1+score2)/2
which is the validity. This is how we calculate validity.
which you see in the picture, then we're gonna create a grading score and the way to do that is to use the scikit learn model as mentioned above and so that's also going to output a scalar value and that scalar
value is our text score and then you know it's a float value but we convert it to an integer and then display it as HTML to the user and once we have that we now have a score for both validity and the grade now we also want to add this of this you know very crucially this payment functionality that is easily done with stripe and we create a button for that in the index.html as well. Stripe will take care of our payment, card and banking details as well.
Stripe Official Website:-https://stripe.com/




A sample is shown below:- Copy and paste this using your own API key in index.html
##Stackoverflow
var handler = StripeCheckout.configure({
  key: 'XXX',
  image: 'https://stripe.com/img/documentation/checkout/marketplace.png',
  locale: 'auto',
  token: function(token) {

And then we Run it in the browser to check our Results.

Conclusion

We did not create the Technology, we implemented the engineering solution somebody else made the scientific discovery but we can now serve it to people as an engineering solution but in the way to make this into a sustainable business where you can pay yourself you can hire people to improve the product. We can then build continuous training pipelines.
We can obviously have better UI design and we can have more flat ski routing this was kind of like jumbling these java scripts in HTML and CSS all together.  I used the model in Python and served it with JavaScript but ideally all that happens in the same language and so you know that this is something we can do to improve, so this code is open source use it as you'd like.



Credits/Citations for this Post

2) Stripe- Github
3) Firebase- Github
4) Tensorflow.js- GitHub
9) The GLTR team
10) Siraj Raval

Comments