Chaitin: search engine from scratch

INSTRUCTIONS

Prerequisites

Linux / Mac OS X
2 GB RAM allocatable to java
internet connection (to download scala-library if not present in gradle cache)
Java 8 JDK
- java -version should work
- javac -version should work
Included External Dependencies
- gradle build tool
- scala-library:2.7.11
- specs2 test framework

Build & Run

$ tar -xzvf chaitin.tgz
$ cd chaitin
$ ./gradlew dist
$ cd build/distributions
$ tar -xzvf chaitin-0.1.0.tgz
$ cd chaitin-0.1.0
$ ./chaitin.sh <path/to/simplewiki.tsv>

Sample Output

[INFO ] pid file: /tmp/chaitin.pid
indexing: simplewiki.tsv
....................................................................................................
indexed [100990] docs in [44138] msecs
search> black hole sun
102357	Soundgarden
258424	Black Hole Sun
131337	List of Soundgarden awards
131726	Fell on Black Days
3017	Star
3506	Black hole
131375	Songs from the Superunknown
102485	Superunknown
131328	Soundgarden discography
4570	A Brief History of Time
ranked [26] docs in [126] msecs

search> ^C

RUNTIME PERFORMANCE

indexing
- text tokenization / analysis
  - regex tokenizer
    - O(k) space and time in the size of the analyzed text
  - lowercase token filter
    - O(k) space and time
    - can be done in-place
- creation of in-memory inverted / forward indexes
  - O(N * k * log(k)) time complexity, O(N * k) space complexity (k: dictionary terms in the corpus; N: documents in the corpus)
- practical: 100990 documents indexed in 40 seconds
search
- text tokenization / analysis / Query formation
  - same as during indexing for a field
- query-document matching
  - O(N * log(k)) time complexity (N: documents in corpus; k: terms in query)
  - O(N) space complexity, can be reduced to O(1)
- scoring
  - O(N * j^2) time complexity (all pairs positional distance)
- practical: median query time ~ 40 milliseconds

RANKING STRATEGY

Matching: All word match (Conjunction)
- no fuzzy matching
- no sophisticated analysis/normalization
- no criteria relaxation (quorum, etc.)
Ranking: TF-IDF + proximity (phrase permutation matches are best, graceful degradation)
- O(N^2) Algorithm
- No query time boosts for Multi-field match ranking
- no differentiation between exact phrase and out-of-order phrase matches

POSSIBLE IMPROVEMENTS

In order of priority:

unit tests (completely lacking)
extraneous use of memory for indexing, search. little use of primitive arrays/compression encodings
too much copying of data, boxing/unboxing, may lead to GC pressure
No thread safety between indexing and search, no multi-threading
on-disk index representation for faster reloading
more analyzers
more queries (disjunction/quorum, fuzzy, span)
better dictionary structure for prefix/fuzzy matching
document deletion
segmented indexes, segment merging
no skip-lists while traversing posting lists
code comments
IOC / Configuration / Dependency Injection

CREDITS

scala programming language
my memory of lucene 2.x/3.x code

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chaitin: search engine from scratch

About

Releases

Packages

Languages

adichad/chaitin

Folders and files

Latest commit

History

Repository files navigation

Chaitin: search engine from scratch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages