We all have at least 1 or 2 subjects, or maybe all, we haven’t prepared for and having exams coming in the near days, in which case we, prepare important topics or questions given by our guide or friends. But, what if you don’t want to be dependent on someone’s opinion for important topics and figure it out on your own.
With the help of old question papers of that subject, we can figure out important questions ourselves manually.
Lets try to automate above manual task
For my exams, I downloaded old 4–5 years exam papers online. I’m having 6 pdfs of old papers for my subject “Big Data” stored in a directory for that subject called
So the logic here is to:
1. extract text from pdf, and find out questions using grep command.2. filter out grammer words and non required words.3. Count most repeated words and sort it accordingly.4. use fzf tool to find out questions containing those repeating words.
Let’s see how to do it and also check if it worked for my exam.
1. Collecting questions for pdf
Initially, pdf will be needed to get converted to text, which I’m using
pdfgrep for this case.
pdfgrep is a command-line utility to search text in PDF files.
sudo apt install pdfgrep
My question paper has the following pattern.
Each question has 3 questions (a,b,c) of where the ending line is marks of questions is (03, 04, 07).
Usually, it's good to start with questions with higher marks as questions with higher marks are more common than questions with lower marks.
Grepping questions of marks 07, through the following command
pdfgrep "" bda/*.pdf | grep -E "07$" > 7.txt
pdfgrep "" bda/*.pdf will show all text present in pdf which is taken as standard input for grep
grep -E "07$" will search for lines ending with
07 whose output is stored at
Words to be ignored
There will be many words which will be most common but not needed by us, like “is, what, the, which, are” etc.
I’m using a Grammar.pdf and exam_paper.pdf of any other field not related to I.T and my subject like (Chemestry), to store all unique words from it at
ignore_words.txt, as it won’t contain most of the technical words related to my subject “Big Data”.
pdfgrep --no-filename "" grammer.pdf chemestry.pdf >ignore_words.txt
2. On to Python
Following are steps to be performed, to get in scope words
1: Find all words form question paper text into a list. (7.txt in this case) and all list of words to be ingored (ignore_words.txt).2: Remove the words we need to ignore (grammer words.)3: Finally sort and count the filtered output, and store it as txt
The function below will return all words from a text file, using regex pattern which will find for all characters between
A-Z, a-z, 0–9 and
This function will be used as follows,
Removing the words we need to ignore and counting according to most common words.
I order to filter in scope words, here is the code.
Here a list named final contains filtered words from 7.txt ignoring words present in
ignore_words.txt, and count most common words in the final list, using Counter module, by which we get the words and number of times it’s repeated in
hdfs has been repeated 17 times.
"07" has occured 60 times as its present on every question, so 07 can be ignored by writing "07" into ignore_words.txt .
The final python code will look as follows, the output is stored in final.txt through the main function.
# using the above script
# output store at final.txtmain( question_papers="7.txt", ignore_words="ignore_words.txt" )
running python script, gives filtered words, at final.txt
3. Finding questions containing common words using Linux tools
Now we have got words we need to find in our questions papers through
final.txt, I’m using fzf tool for that.
In final.txt, word “
hdfs” has come “
17” times in questions
Also, the same steps can be done for other words present in
final.txt eg: Hadoop, spark.
Using fzf tool to find lines containing certain words
fzf is a general-purpose command-line fuzzy finder.
In our case of grep we will have to run a command every time for searching words, for fzf provides an interactive mode to do it easily
cat 7.txt | fzf -exact
So from this, we get all questions related to HDFS, and many of the questions are similar too. Noting down these questions from here and can prepare for it first.
How it actually worked out!
Well, it was good not that great but does the job of getting topics to learn to score average, from final.txt in prepared for questions, whose word count was not below 3.
Here I’m showing, 7 marks questions asked “during my exams” it was a total of 8 questions from which any 4 has to be attempted.
On the right bottom are questions present in old papers, based on words in final.txt, 7 mark questions.