Roadmap / MongoDB Deep Dive
A comprehensive MongoDB roadmap covering documents, CRUD, query operators, schema design, indexing, the aggregation pipeline, relationships, transactions, Mongoose ODM, change streams, security, performance tuning, replica sets, sharding, and production on Atlas.
Step 1 • Setup
MongoDB stores data as BSON documents in collections (not rows in tables). Get started three ways: MongoDB Atlas (fully managed cloud — free M0 cluster is generous for learning), local installation with Docker (docker run -d -p 27017:27017 mongo), or MongoDB Community Server. Compass is the official GUI — explore collections, run queries, view indexes visually, explain query plans. mongosh (MongoDB Shell) is the interactive REPL for running commands. Understand the hierarchy: cluster → database → collection → document. One Atlas free tier account covers all your learning.
Step 2 • Foundations
MongoDB documents are JSON-like objects — flexible, no fixed schema by default. Learn BSON types (String, Int32, Int64, Double, Boolean, Date, ObjectId, Array, null, Binary). CRUD: insertOne/insertMany, findOne/find, updateOne/updateMany ($set, $unset, $push, $pull, $inc), deleteOne/deleteMany, and replaceOne. Projections in find() — include/exclude fields ({name: 1, _id: 0}). Sort, limit, skip for pagination. The _id field is auto-generated as an ObjectId (12-byte unique identifier). Upsert pattern: updateOne({filter}, {$set: ...}, {upsert: true}).
Step 3 • Foundations
MongoDB's query language is rich. Comparison: $eq, $ne, $gt, $gte, $lt, $lte, $in (match any in array), $nin. Logical: $and, $or, $nor, $not. Element: $exists (field presence), $type (BSON type check). Array: $all (contains all elements), $elemMatch (element matches conditions), $size (array length). Evaluation: $regex (pattern match), $expr (use aggregation expressions in queries), $where (JS expression — avoid for performance). Text: $text with a text index for full-text search. Dot notation for querying nested fields (address.city: 'London').
Step 4 • Foundations
Update operators modify documents without replacing them. Field operators: $set (add/update field), $unset (remove field), $rename (rename field), $inc (increment/decrement number), $mul (multiply), $min/$max (set to min/max of current and given). Array operators: $push (add element), $pull (remove by condition), $addToSet (add if not exists — prevents duplicates), $pop (remove first/last), $pushWithModifier ($each + $slice + $sort for maintaining sorted, capped arrays). $[] and $[identifier] positional operators for updating specific array elements. Understand that update operators are atomic at the document level.
Step 5 • Data Modeling
Schemaless doesn't mean schema-free in production — it means flexible schema. MongoDB schema design principle: model for how your application queries data, not how it's stored. The central choice: embed (store related data in the same document) vs reference (store a foreign key like SQL). Embed when data is always queried together, accessed by the same user, and not reused. Reference when data is large (16MB document limit), shared across entities, or updated independently. Anti-patterns: deeply nested arrays that grow unboundedly, storing all user activity in one document, using MongoDB as a relational database.
Step 6 • Performance
Without indexes, MongoDB does a full collection scan for every query. Index types: single field, compound (order and direction matter for sort queries), multikey (for arrays — one entry per array element), text (for full-text search), 2dsphere (for geospatial), hashed (for sharding), partial (index only documents matching a filter), TTL (auto-delete documents after N seconds). Covered queries (index satisfies query without touching documents) are fastest. Use explain('executionStats') to see if a query uses an index (IXSCAN) or scans the collection (COLLSCAN). The ESR rule: Equality fields first, Sort fields second, Range fields last.
Step 7 • Analytics
The aggregation pipeline processes documents through a sequence of stages — each stage transforms the data. Essential stages: $match (filter — put early to use indexes), $project (reshape documents — include/exclude/compute fields), $group (aggregate — _id is the group key, accumulator operators like $sum, $avg, $min, $max, $push, $addToSet), $sort, $limit, $skip, $lookup (JOIN from another collection), $unwind (explode array into separate documents), $addFields, $count, $facet (multiple pipelines in one), $bucket (histogram). The aggregation pipeline replaces complex SQL GROUP BY + JOIN queries.
Step 8 • Data Modeling
$lookup performs a left outer join between collections — { from: 'users', localField: 'authorId', foreignField: '_id', as: 'author' }. $lookup with a pipeline allows complex join conditions and sub-aggregations. Understand the trade-off: $lookup is more expensive than embedded documents — use sparingly in hot paths. DBRefs ($ref, $id) are a convention for cross-collection references (rarely used in modern apps — store plain ObjectId instead). Patterns: Outlier Pattern (handle rare large documents differently), Bucket Pattern (group time-series data), Polymorphic Pattern (different document shapes in one collection).
Step 9 • Reliability
MongoDB 4.0+ supports multi-document ACID transactions — useful when atomicity across multiple collections or documents is required (e.g., debit account A + credit account B must both succeed or both fail). Transactions use sessions — start with startSession(), session.startTransaction(), session.commitTransaction() or session.abortTransaction(). Transactions work with replica sets and sharded clusters. Understand the trade-off: transactions have overhead (latency, retry on write conflicts) — prefer single-document atomicity (embedded documents + update operators) when possible. Don't overuse transactions as a substitute for good schema design.
Step 10 • Node.js Integration
Mongoose is the most popular MongoDB ODM for Node.js. Define Schemas (field types, required, default, min/max, enum, validate), compile them into Models, and use models for CRUD (Model.find(), Model.create(), doc.save(), doc.remove()). Schema.pre/post hooks for middleware (hash passwords before save, populate references after find). Virtual fields (derived properties not stored in MongoDB). Population (populate('authorId') auto-fetches referenced documents). Lean queries (.lean() returns plain JS objects — faster, less memory). TypeScript: use mongoose-typescript or Typegoose for type-safe models.
Step 11 • Real-Time
Change streams allow applications to subscribe to data changes in real time — insertions, updates, deletions, and replacements. They use MongoDB's oplog (operation log) under the hood and require a replica set (Atlas clusters are always replica sets). Open a change stream on a collection, database, or entire cluster. Filter changes with a pipeline ($match on operationType, namespace). Use resumeAfter to handle reconnections without missing events. Common uses: cache invalidation, audit logs, real-time notifications, syncing to Elasticsearch, and event sourcing. Change streams are transactionally consistent.
Step 12 • Security
MongoDB Atlas manages most security, but you need to understand the settings. Authentication: SCRAM (username/password), X.509 certificates, LDAP/AD integration. Authorization: built-in roles (read, readWrite, dbAdmin, clusterAdmin) and custom roles with specific privilege actions. Network access: IP allowlist and VPC peering (never open 0.0.0.0/0). TLS/SSL for in-transit encryption (enabled by default on Atlas). Encryption at rest with KMS (Atlas Encryption at Rest). Audit logging for compliance. For self-hosted: always enable auth, disable the test database, bind to specific IPs, never expose MongoDB to the public internet.
Step 13 • Performance
Diagnose slow queries with the database profiler (db.setProfilingLevel(1, { slowms: 100 }) — logs queries over 100ms to system.profile). Use explain('executionStats') to read execution plans — look for COLLSCAN, high nReturned vs keysExamined ratio. Optimize: add missing indexes, rewrite aggregation pipelines ($match as early as possible), use projections to return fewer fields, avoid $where (JS in queries), prefer $in over multiple $or conditions. Connection pooling (set maxPoolSize to prevent too many connections). Monitor with Atlas Performance Advisor — it suggests indexes automatically. Use Atlas Query Insights for long-term trend analysis.
Step 14 • Infrastructure
A replica set is a group of MongoDB instances that maintain the same data — one primary (accepts writes) and one or more secondaries (replicate from primary, can serve reads). Automatic failover: if the primary becomes unavailable, the remaining members elect a new primary within seconds. Write concerns control durability (w:1 — primary only, w:'majority' — waits for majority of members to acknowledge). Read preferences control where reads go (primary, primaryPreferred, secondary, nearest). Understand the oplog (operations log — the replication mechanism). Atlas clusters are always replica sets.
Step 15 • advanced
Scale MongoDB horizontally with sharded clusters: mongos query routers, config server replica sets, and shard replica sets. Choose shard keys based on cardinality, write distribution, and query patterns — hashed sharding for even distribution, range-based for targeted queries. Understand chunk migration, zone sharding for data locality (GDPR), and scatter-gather query overhead.
Step 16 • Operations
Data loss recovery is a critical operations skill. mongodump exports a BSON snapshot (mongodump --uri --archive --gzip), mongorestore imports it. For replica sets, dump from a secondary to avoid primary load. mongodump is not suitable for live transactions — it does not produce a consistent snapshot of a sharded cluster without additional coordination. Atlas Continuous Backup: point-in-time recovery (PITR) down to the second — Atlas Enterprise feature but worth knowing. Backup strategies: full nightly + oplog (operation log) for incremental. Test your recovery procedure quarterly — an untested backup is not a backup. Verify document count and spot-check data after restore. For large datasets, use $mongodump with --numParallelCollections. Understand the role of the oplog for replication and recovery.
Step 17 • Production
Atlas is more than a hosted MongoDB — it's a developer data platform. Atlas Search (Lucene-powered full-text search — eliminate Elasticsearch for most use cases), Atlas Vector Search (for AI/ML embeddings), Atlas Triggers (serverless functions that react to database events — like change streams but managed), Atlas App Services (hosted backend), Charts (no-code dashboards), Atlas Data Federation (query S3 and Atlas in one query). For production: enable Atlas Backup (continuous point-in-time recovery), set up alerts, use Atlas CLI for automation, configure connection string correctly (SRV format, connection pool settings), and test disaster recovery.
Privacy choices
We use optional analytical tools only if you accept. You can change this later from "Privacy settings" in the footer.