An assignment from Mrs Ugwuishiwu on a course called Artificial Intelligence, so the first thing I did was to source for dataset, which I found on kaggle.
I looked for the raw csv link from github, then head over to google colab
Import the dataset using pandas
Now I had to combine these dataset, and label them such that fake news are assigned 0, while real news are assigned 1
Next step is to create a content column of both title and text of our dataset.
From above, you can see that content column has been added.
Now, we can clean the content column of anything that’s not a space or a letter, and then convert every letter within the content to lowercase.
def clean_text(text):
text = re.sub(r'^a-zA-Z\s', '', text) # removes everything that's not a space or letter
text = text.lower() # converts every text to lowercase
return text
df['content'] = df['content'].apply(clean_text) # apply our clean_text function to the content column
Next step is to remove every common words within the content i.e words like so, the, on, that which do not have high impact within the content.
We have to download a package of commonly used word first from nltk which is a natural language toolkit that provide us with these sets of words and other utilities.
After the above importation, we then download the below using nltk.
Now we can proceed to write a function that strip down the content column of every common words in it.
The next step is to try and reduce every words in the content column to it’s root source, e.g running → run, singing → sing, this is referred to as Lemmatization or Stemming.
We have a package in nltk that handles the operation for us, which is WordNetLemmatizer from nltk.stem module
We now head to write a function that strips down words to their root source.
The Next Step is to split our dataset into x and y coordinate, the x coordinate will be the content column, while the y coordinate will be the label column.
What next to do, is to split the dataset into test and training dataset for our model, we would do this using the train_test_split function within the scikit-learn module.
Now let split the dataset into training and test dataset
Remember that computer only understand numbers and not text, which implies that we have to convert the textual content column into it numerical equivalent.
We would be using a feature extractor method from scikit-learn called TfidfVectorizer
Then we proceed to convert the text to numeric values
Now we’ve convert to the numerical equivalence
It’s time to train our model
First, we’d use Logistic Regression
Next, we train the model using our train data that has been transformed into it’s numeric value
You can extend the project by using Naive-Bayes and RandomForest, which is same as the above
You can also try to test the model on a random article, vectorize the article then make the model predict if it is real or fake