Design a Lucene Search Engine

tl&dr

This job program implements an inverted-index search. you can use it to search or any big data. I use AWS EMR to run this but its portable with any hadoop platform.

what is inverted index

To find a book in a library, there are two methods:

Method 1:Forward Index => use index to find book, too slow

Method 2: Inverted Index => use keyword to find the list of book id

this project uses the mapreduce framework to create inverted index for a given text

mapper split each word to pairs < word, location>

reducer merge all the same word

what is ignoreWords.txt

these are the list of words that you choose to ignore as search result. you can choose to supply or not supply it

Pre-req

Compile the jar

clone this repo
maven clean, maven install

method 1: run on AWS EMR

Create a EC2 keypair PEM file to used for EMR
Create a S3 bucket
Upload the jar in this repo to your s3 bucket ( You can make change and compile your own as well)
Upload the input files to the same s3 bucket
Create a EMR cluster, choose EMR version that uses hadoop version 2.7.3 (to use a different hadoop version, change the pom.xml)
After EMR provision finish, add step for EMR to read file from s3 ![] (https://aws.amazon.com/premiumsupport/knowledge-center/copy-s3-hdfs-emr/)
add a step for custom jar

for JAR location, point to the jar in the s3 bucket
for argument: us the following if wish to supply ignorewords:

s3://<your-bucket>/<your-input-foldere> s3://<your-bucket>/out s3://<your-bucket>/ignorewords.txt

us the following if not:

s3://<your-bucket>/<your-input-foldere> s3://<your-bucket>/out

check output on s3 folder

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src/main/java		src/main/java
README.md		README.md
ignorewords.txt		ignorewords.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Design a Lucene Search Engine

tl&dr

what is inverted index

what is ignoreWords.txt

Pre-req

Compile the jar

method 1: run on AWS EMR

About

Releases

Packages

Languages

ritakuo/inverted_index_search

Folders and files

Latest commit

History

Repository files navigation

Design a Lucene Search Engine

tl&dr

what is inverted index

what is ignoreWords.txt

Pre-req

Compile the jar

method 1: run on AWS EMR

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages