Design your own MongoDB clone in Java

This article talks about creating your own MongoDB kind of database.

Before we delve into the implementation details let's take a peek on how MongoDB is defined:

As defined in Mongo documentation at https://www.mongodb.com/what-is-mongodb:

MongoDB is a document database with the scalability and flexibility that you want 
with the querying and indexing that you need

* MongoDB stores data in flexible, JSON-like documents, meaning fields can vary 
from document to document and data structure can be changed over time

* The document model maps to the objects in your application code, making data 
easy to work with

* Ad hoc queries, indexing, and real time aggregation provide powerful ways to
access and analyze your data

* MongoDB is a distributed database at its core, so high availability, horizontal 
scaling, and geographic distribution are built in and easy to use

Assumptions

For the simplicity of this article we will not talk about connection handling or the MongoDB REST interface.
We will not touch upon storage of documents to disk
We will not build our own indexing to keep it simple

Document

The primary objective of MongoDB is to store documents in a collections. A group of collections is known as a database.

As we understand a document is a free-form JSON style object containing multiple attributes, composed child objects which in turn have their own attributes. Every MongoDB document has a unique ID, _id, associated with it. Let's define an interface to work with this ID attribute:

public interface AbstractMongoDocument {

    public String getID();
    
    public void setID(String id);
}

The ID above is unique in a given collection and is guranteed to return the document it was assigned to. Different documents in a different collections, in same or different databases, can have the same ID.

For example:

{
  "_id" : "some-unique-id",
  "name" : "Sandeep",
  "title" : "Developer",
  "social" : {
      "github" : "sangupta",
      "twitter" : "sangupta"
  }
}

For simplicity, we will assume and assign a UUID to a document that already does not have a key. In actual scenario, MongoDB generates time-stamped machine-specific sequential values. Read more at https://docs.mongodb.com/manual/reference/method/ObjectId/

A simple implementation could be:

public class AbstractMongoDocument implements AbstractMongoDocument {

    protected String id;
    
    protected Map<String, Object>;
    
    // methods to handle document ID generation
}

Collection

We can now employ a Map to store all these documents based on this ID as the primary key. This gives us a O(1) complexity in terms of access over primary key. Now, given a document we can store it in the DB as:

public final Map<String, AbstractMongoDocument> collection = new HashMap<>();

As we will need methods to work over this collection, let's encapsulate the DS inside a class: MongoCollection

public class MongoCollection {

    protected final Map<String, MongoDocument> collection = new HashMap<>();

    public void storeDocument(MongoDocument doc) {
      if(doc == null) {
        return;
      }
      
      String id = doc.getID();
      if(id == null) {
        id = UUID.randomUUID().toString();
        doc.setID(id);
      }

      this.collection.put(id, doc);
    }
    
}

Database

A database is nothing but a list of named collections. Each database has multiple collections that are identified by name. Thus, we can again utilize a Map structure to store the same.

Map<String, MongoCollection> database = new HashMap<>();

// replacing the MongoCollection with exact datastructure:
Map<String, Map<String, MongoDocument>> database = new HashMap<>();

Convertng the DS to MongoDatabase class:

public class MongoDatabase {

    private final Map<String, MongoCollection> database = new HashMap<>();

    public MongoCollection getCollection(String name) {
        if(name == null || name.equals("")) {
            return null;
        }
        
        return this.database.get(name);
    }
}

Querying

As all documents live as natural Java objects inside the memory, we can run a query over these objects using reflection to make sure the property exists and we can also check child object attributes using the dot attribute.

// somewhere in our code
// find all objects that have the property `firstName` with value `Sandeep`
myDatabase.myCollection.find("firstName", "Sandeep");

// iterate over each document
public List<MongoDocument> find(String name, String value) {
    List<MongoDocument> documents = collection.values();
    
    List<MongoDocument> resultSet = new Arraylist<>();
    for(MongoDocument doc : documents) {
        // check if doc has a property with given name
        // we can use Java reflection for the same
        Field field = ReflectionUtils.getField(doc, name);
        if(field == null) {
            // no such field
            continue;
        }
        
        if(ReflectionUtils.isEqualValue(doc, field, value)) {
            resultSet.add(doc);
        }
    }
    
    return resultSet;
}

I previously developed Gather a simple Java library to run SQL like queries over Java collections. The same can be employed to run these queries.

Mongo GridFS

Per Mongo documentation at https://docs.mongodb.com/manual/core/gridfs/:

GridFS is a specification for storing and retrieving files that exceed the BSON-document size 
limit of 16 MB.

Documents that exceed in size can be stored as a byte[] (byte-array). They do not have child operations over them and thus, the files are either created, read, or deleted. They are not modified in place. We could build something like:

public class GridFile {

    public String name;
    
    public byte[] bytes;

}

public class MongoGridFS {
    
    private final Map<String, GridFile> files = new HashMap<>();
    
    public GridFile getFile(String name) {
        return this.files.get(name);
    }
    
    public void saveGridFile(String name, GridFile file) {
        return this.files.put(name, file);
    }
}

Indexing

Indexing can be implemented in various ways:

Use an embedded Apache lucene to index text fields.
For numeric fields, we could either rely on lucene, or build an MultiMap where the key is field value and the list contains document IDs.

Consistency

We know that MongoDB is eventually-consistent per the CAP theorem. However, for our purpose we can make it consistent by either of the following 2 ways:

Make all incoming operations on the database sequential
Use write locks on MongoDocument object when modifying to prevent concurrent writes to same field
During reads, we clone the MongoDocument object so the next immediate write does not yield us partially updated documents

Resources

FakeMongo source code is a good read on the topic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build-mongodb.md

build-mongodb.md

Design your own MongoDB clone in Java

Assumptions

Document

Collection

Database

Querying

Mongo GridFS

Indexing

Consistency

Resources

Files

build-mongodb.md

Latest commit

History

build-mongodb.md

File metadata and controls

Design your own MongoDB clone in Java

Assumptions

Document

Collection

Database

Querying

Mongo GridFS

Indexing

Consistency

Resources