- This job program implements an inverted-index search. you can use it to search
or any big data. I use AWS EMR to run this but its portable with any hadoop platform.
To find a book in a library, there are two methods:
- Method 1:Forward Index => use index to find book, too slow
- Method 2: Inverted Index => use keyword to find the list of book id
this project uses the mapreduce framework to create inverted index for a given text
mapper split each word to pairs < word, location>
reducer merge all the same word
these are the list of words that you choose to ignore as search result. you can choose to supply or not supply it
- clone this repo
- maven clean, maven install
-
Create a EC2 keypair PEM file to used for EMR
-
Create a S3 bucket
-
Upload the jar in this repo to your s3 bucket ( You can make change and compile your own as well)
-
Upload the input files to the same s3 bucket
-
Create a EMR cluster, choose EMR version that uses hadoop version 2.7.3 (to use a different hadoop version, change the pom.xml)
-
After EMR provision finish, add step for EMR to read file from s3 ![] (https://aws.amazon.com/premiumsupport/knowledge-center/copy-s3-hdfs-emr/)
-
add a step for custom jar
- for JAR location, point to the jar in the s3 bucket
- for argument: us the following if wish to supply ignorewords:
s3://<your-bucket>/<your-input-foldere> s3://<your-bucket>/out s3://<your-bucket>/ignorewords.txt
us the following if not:
s3://<your-bucket>/<your-input-foldere> s3://<your-bucket>/out
- check output on s3 folder