I found my exam important questions by making a script for it

We all have at least 1 or 2 subjects, or maybe all, we haven’t prepared for and having exams coming in the near days, in which case we, prepare important topics or questions given by our guide or friends. But, what if you don’t want to be dependent on someone’s opinion for important topics and figure it out on your own.

With the help of old question papers of that subject, we can figure out important questions ourselves manually.

Lets try to automate above manual task

For my exams, I downloaded old 4–5 years exam papers online. I’m having 6 pdfs of old papers for my subject “Big Data” stored in a directory for that subject called bda.

downloaded old papers in folder Big Data Analysis

So the logic here is to:

1. extract text from pdf, and find out questions using grep command.2. filter out grammer words and non required words.3. Count most repeated words and sort it accordingly.4. use fzf tool to find out questions containing those repeating words.

Let’s see how to do it and also check if it worked for my exam.

1. Collecting questions for pdf

Initially, pdf will be needed to get converted to text, which I’m using pdfgrep for this case. pdfgrep is a command-line utility to search text in PDF files.

Ubuntu/Debian: sudo apt install pdfgrep

My question paper has the following pattern.

question pattern

Each question has 3 questions (a,b,c) of where the ending line is marks of questions is (03, 04, 07).

Usually, it's good to start with questions with higher marks as questions with higher marks are more common than questions with lower marks.

Grepping questions of marks 07, through the following command

pdfgrep "" bda/*.pdf | grep -E "07$"  > 7.txt

Here pdfgrep "" bda/*.pdf will show all text present in pdf which is taken as standard input for grepgrep -E "07$" will search for lines ending with 07 whose output is stored at 7.txt

pdfgrep output

Words to be ignored

There will be many words which will be most common but not needed by us, like “is, what, the, which, are” etc.

I’m using a Grammar.pdf and exam_paper.pdf of any other field not related to I.T and my subject like (Chemestry), to store all unique words from it at ignore_words.txt, as it won’t contain most of the technical words related to my subject “Big Data”.

pdfgrep --no-filename "" grammer.pdf chemestry.pdf >ignore_words.txt

2. On to Python

Following are steps to be performed, to get in scope words

1: Find all words form question paper text into a list. (7.txt in this case) and all list of words to be ingored (ignore_words.txt).2: Remove the words we need to ignore (grammer words.)3: Finally sort and count the filtered output, and store it as txt

The function below will return all words from a text file, using regex pattern which will find for all characters between A-Z, a-z, 0–9 and \-, _

This function will be used as follows,

words from questions 7.txt
set words to be ignored

Removing the words we need to ignore and counting according to most common words.

I order to filter in scope words, here is the code.

final output containing filtered words and sorted according to most common words

Here a list named final contains filtered words from 7.txt ignoring words present in ignore_words.txt, and count most common words in the final list, using Counter module, by which we get the words and number of times it’s repeated in 7.txteg: word hdfs has been repeated 17 times.

"07" has occured 60 times as its present on every question, so 07 can be ignored by writing "07" into ignore_words.txt .

The final python code will look as follows, the output is stored in final.txt through the main function.

# using the above script
# output store at final.txt
main( question_papers="7.txt", ignore_words="ignore_words.txt" )

running python script, gives filtered words, at final.txt

final in scope, most common words

3. Finding questions containing common words using Linux tools

Now we have got words we need to find in our questions papers through final.txt, I’m using fzf tool for that.

In final.txt, word hdfs has come17 times in questions

Also, the same steps can be done for other words present in final.txt eg: Hadoop, spark.

Using fzf tool to find lines containing certain words

fzf is a general-purpose command-line fuzzy finder.

In our case of grep we will have to run a command every time for searching words, for fzf provides an interactive mode to do it easily

Command: cat 7.txt | fzf -exact

output: cat final.txt | fzf -e

So from this, we get all questions related to HDFS, and many of the questions are similar too. Noting down these questions from here and can prepare for it first.

How it actually worked out!

Well, it was good not that great but does the job of getting topics to learn to score average, from final.txt in prepared for questions, whose word count was not below 3.

Here I’m showing, 7 marks questions asked “during my exams” it was a total of 8 questions from which any 4 has to be attempted.

On the right bottom are questions present in old papers, based on words in final.txt, 7 mark questions.

1 question in nosql types, (present in old paper)
2 questions MongoDB, (present on old paper)
0 questions on HDFS :(

InfoSec Enthusiast | Programmer