Welcome to BigData School that can get you hired on Small startups or big Product based companies.Are you tired of 30-40 hours of theoritical bigdata courses in the market. Welcome to 240+ hours of Bigdata training. Just subscribe to our popup to get access to our 80+hours of course absolutely free.Try them out and you can join our further course once you are happy with our Demos.

Scala Hackathon

Big Data    On Thursday 22nd of June 2017 12:10:06 PM By Suraz Ghimire
Assignment: Tokenize the words.

Objective:  In india we have a lot of roads that contains dignitory names, which means they contains sensitive names like gandhi, ambedkar,nehru, in which you cannot have any sensisitive word associated with that.
Eg: Gandhi road is fine, But Mad Gandhi road is not fine.
So we should see all the tokens that may be harmful when associated with the dignitory names, So that It can be removed from the road names.

Dataset: You should prepare the dataset on your own.
Use google to search for names that contains these words (gandhi,ambedkar,kalam,Rajiv,Nehru,Tagore,Patel,Teresa
Find 5 places minimum for each of these names.

Also include some other names which donot contain any such words.
Eg: Deendayal road,swami nagar,gachibowli fly over road etc.

Assuming the data collected in present in a file called data.txt. This file contains only 1 column in each row.
Sample data.txt
gandhi road
bapu gandhi road 3
gandhinagar 4th lane
Deendayal road
swami nagar
gachibowli fly over road

Write a Scala Project that

1. Reads the data from the file and inserts that into a database table.You can use any database table of your choice.
     Your code should create the database and table and insert the data from the file.

2. Once the data is loaded successfully.Write the program to query the database such that you  get all the records which contains such dignitory names.
      eg  Gandhi road,
             bapu gandhi road,

3. Tokenize them completely so that the above sample data.txt will be tokenized.
      Sample data.txt (only 3 records will be selected as other 2 doesnot contain the dignitory names)
gandhi road
bapu gandhi road 3
gandhinagar 4th lane

gandhi, road, bapu,gandhi, road,3,gandhinagar,4th,lane

4. From the above tokenized_data.txt, remove only numbers and words like 1st,2nd,3rd,100th etc.
     So now the data will be.
     gandhi, road, bapu,gandhi, road,gandhinagar,lane              //3 , 4th removed.

5. Now you also remove all the dignitory words from the above list.
    road, bapu,road,nagar,lane     //All gandhi is removed. Please observe that gandhinagar is changed to nagar only.

6.Do a word count now.(Make sure to change all the words to lowercase. so that you dont write Gandhi and gandhi 2 times.)
    road 2
    bapu 1
    nagar 1
    lane 1

  7. Write the output in a file result.txt.(Write the data in descending order of the count. The most frequent word should appear at the top)

You final product should be a Jar file, which will have a main method, that accepts 2 parameter.
1. Input path -A file or folder,
     If it is a file you read that file into the database and find out all the tokens other than the diginitory words and get the count
     If it is a folder, then you read that folder, and get the list of all the files present in that directory and write it to database and find out all the tokens other than the dignitory words and   get the count
2. output path- The file that will have  word and the count. The words are tokens other than the dignitory words. This file is sorted  in the descending order of count.

Note: The jar should be a fat/uber jar and you should be able to run without any more dependencies.
Your should solve this project in 6 Hours.

Thank you.

About Author

Suraz Ghimire