// Principle: Design for query patterns, not normalized form
// Application query: "show a blog post with all comments"
// BAD (relational thinking): separate comments collection
// Requires join on every post load
// GOOD: embed comments in post (single document read)
{
"_id": ObjectId("..."),
"title": "MongoDB Schema Design",
"body": "...",
"author": "Mayur",
"tags": ["mongodb", "schema"],
"comments": [
{ "user": "Alice", "text": "Great post!", "date": ISODate("...") },
{ "user": "Bob", "text": "Very helpful", "date": ISODate("...") }
]
}
// ↑ One query fetches post + all comments — fast!
// But if comments are 10,000+, document becomes huge
// → Reference for high-volume data (see Relationships topic)
Why it matters: Schema design is the most impactful decision in a MongoDB application. A poor schema that forces joins or scans entire arrays for every query cannot be fixed with indexes alone.
Real applications: Netflix's video catalog, Airbnb's listing details, and Amazon's product pages all use embedding for data that is always shown together but never queried independently.
Common mistakes: Directly translating your SQL ER diagram to MongoDB without considering access patterns — this leads to unnecessary lookups and poor performance.
// EMBED (one-to-few, always accessed together)
// User with address (1-3 addresses)
{
"_id": ObjectId("..."),
"name": "Mayur",
"addresses": [
{ "type": "home", "city": "Mumbai", "pincode": "400001" },
{ "type": "office", "city": "Pune", "pincode": "411001" }
]
}
// REFERENCE (one-to-many with shared or large data)
// Author with books (books queried independently, shared between authors)
// authors collection
{ "_id": ObjectId("aaa"), "name": "Martin Fowler", "bio": "..." }
// books collection
{
"_id": ObjectId("bbb"),
"title": "Patterns of Enterprise Application Architecture",
"authorId": ObjectId("aaa"), // reference to author
"isbn": "978-0321127426"
}
// Hybrid: partial embed for display + reference for detail
// Order item (embed summary, reference full product)
{
"items": [{
"productId": ObjectId("..."), // reference
"name": "Laptop", // embedded for display (avoid lookup)
"price": 75000, // price at time of purchase (snapshot)
"qty": 1
}]
}
Why it matters: This is the core schema design question in every MongoDB interview. The ability to articulate the trade-offs clearly shows design maturity and production experience.
Real applications: User profiles with addresses (embed), orders with product references (hybrid — embed price snapshot, reference product), many-to-many tags (reference).
Common mistakes: Embedding high-volume arrays (comments, logs) that grow without bounds — the document balloons past the 16 MB limit and every read includes the entire dataset even when showing just the post title.
reviews collection. This keeps the working set (data in RAM) small while still enabling fast access to the common case (show latest reviews), and full access to the complete dataset (pagination through all reviews) via a separate query.
// Problem: Product with 5000 reviews — document is huge
// Full document loaded even when you only show 10 reviews
// SUBSET PATTERN:
// products collection (main document, stays small)
{
"_id": ObjectId("p1"),
"name": "MacBook Pro",
"price": 200000,
"avgRating": 4.7,
"reviewCount": 5000,
"recentReviews": [ // subset: only last 10
{ "user": "Alice", "rating": 5, "text": "Amazing!", "date": ... },
{ "user": "Bob", "rating": 4, "text": "Great value", "date": ... }
// ... 8 more
]
}
// reviews collection (full dataset)
{
"_id": ObjectId("r1"),
"productId": ObjectId("p1"),
"user": "Alice",
"rating": 5,
"text": "Amazing!",
"date": ISODate("...")
}
// Query: product page load (one fast read, small document)
db.products.findOne({ _id: productId });
// Query: "Load all reviews" (separate query, paginated)
db.reviews.find({ productId: productId }).sort({ date: -1 }).limit(20);
// Update: add new review (update subset + insert full)
db.products.updateOne(
{ _id: productId },
{
$push: { recentReviews: { $each: [newReview], $slice: -10, $sort: { date: -1 } } },
$inc: { reviewCount: 1 },
$set: { avgRating: newAvg }
}
);
Why it matters: The Subset Pattern is a key MongoDB design pattern tested in advanced interviews. It directly addresses the working set optimization problem — keeping frequently accessed data small enough to fit in RAM.
Real applications: Product pages (latest 10 reviews), social media feeds (last 20 posts), order history (last 5 orders on profile page), message threads (last 50 messages).
Common mistakes: Not keeping the subset synchronized on write — when adding a review, you must update both the product's recentReviews array AND insert into the reviews collection.
// WITHOUT bucket pattern: 1 document per measurement
// 1M readings per sensor per day = 365M documents per year!
{ sensorId: "temp001", timestamp: ISODate("2026-04-05T10:00:00"), value: 23.5 }
{ sensorId: "temp001", timestamp: ISODate("2026-04-05T10:01:00"), value: 23.7 }
// ... 1M more documents
// WITH bucket pattern: 1 document per hour per sensor
{
"_id": { sensorId: "temp001", hour: ISODate("2026-04-05T10:00:00") },
"sensorId": "temp001",
"hour": ISODate("2026-04-05T10:00:00"),
"count": 60,
"min": 22.1, "max": 24.3, "avg": 23.4,
"measurements": [
{ "t": ISODate("2026-04-05T10:00:00"), "v": 23.5 },
{ "t": ISODate("2026-04-05T10:01:00"), "v": 23.7 },
// ... 58 more readings
]
}
// ↑ 1M → 24 documents/sensor/day (one per hour)
// = 8,760 documents/sensor/year!
// Inserting new measurement into existing bucket
db.sensor_data.updateOne(
{ sensorId: "temp001", hour: currentHour, count: { $lt: 200 } },
{
$push: { measurements: { t: now, v: 23.8 } },
$inc: { count: 1 },
$min: { min: 23.8 },
$max: { max: 23.8 }
},
{ upsert: true } // create new bucket if current is full
);
Why it matters: The Bucket Pattern is critical for time-series and IoT applications. It's frequently asked in interviews for data-intensive system design roles involving high write throughput.
Real applications: IoT sensor networks, stock market tick data, server metrics collection (CPU, memory per minute), application performance monitoring.
Common mistakes: Fixed bucket sizes with no overflow handling — if a bucket fills up, the upsert logic must handle creating a new bucket atomically.
$inc, $push + $set, or via a background job for batch recomputation. This pattern avoids expensive aggregation queries on hot read paths.
// WITHOUT computed pattern: expensive aggregation per request
// GET /posts/:id/stats
db.comments.aggregate([
{ $match: { postId: postId } },
{ $group: { _id: null, count: { $sum: 1 }, avgRating: { $avg: "$rating" } } }
]);
// Runs on every request — slow, resource-heavy!
// WITH computed pattern: pre-computed in post document
{
"_id": ObjectId("..."),
"title": "MongoDB Schema Design",
"body": "...",
// Pre-computed values — updated on every comment add/remove
"commentCount": 245,
"avgRating": 4.7,
"viewCount": 15234,
"likeCount": 892
}
// When a comment is added — update stored values atomically
db.posts.updateOne(
{ _id: postId },
{
$inc: { commentCount: 1 },
$set: { avgRating: newAvg } // recomputed in application
}
);
// Also useful: store frequently used aggregation results
// Daily stats document (computed nightly by batch job)
{
"_id": "2026-04-05",
"totalOrders": 1250,
"totalRevenue": 4750000,
"newUsers": 320,
"topProduct": "MacBook Pro",
"computedAt": ISODate("2026-04-05T23:59:00")
}
Why it matters: The Computed Pattern is essential for high-read systems like dashboards and social media. It demonstrates awareness of the classic read vs write trade-off in database optimization.
Real applications: Reddit/YouTube view counts (computed on write), e-commerce product ratings (updated on each review), social media follower counts (updated on follow/unfollow events).
Common mistakes: Letting computed values drift out of sync — always update the computed field in the same atomic operation as the data change, not in a separate query.
// Single "vehicles" collection with polymorphic documents
// Shared fields: type, make, year, price
// Type-specific fields vary
// Car document
{
"_id": ObjectId("..."),
"type": "car", // discriminator field
"make": "Toyota",
"model": "Camry",
"year": 2024,
"price": 2500000,
"doors": 4, // car-specific
"transmission": "automatic",
"fuelType": "petrol"
}
// Motorcycle document (same collection, different shape)
{
"_id": ObjectId("..."),
"type": "motorcycle",
"make": "Royal Enfield",
"model": "Classic 350",
"year": 2024,
"price": 190000,
"engineCC": 349, // motorcycle-specific
"seatHeight": 805
}
// Electric vehicle (additional type)
{
"_id": ObjectId("..."),
"type": "electric",
"make": "Tata",
"model": "Nexon EV",
"year": 2024,
"price": 1450000,
"batteryKWh": 40.5, // EV-specific
"rangeKm": 465,
"chargingTime": 8.6
}
// Query across all types (polymorphic search)
db.vehicles.find({ year: 2024, price: { $lt: 2000000 } });
// Type-specific query
db.vehicles.find({ type: "electric", rangeKm: { $gte: 300 } });
// Index on common fields + discriminator
db.vehicles.createIndex({ type: 1, price: 1, year: 1 });
Why it matters: Tests understanding of MongoDB's schema flexibility advantage. Knowing this pattern shows you can design elegant, maintainable schemas for complex domain models.
Real applications: Product catalogs (electronics, clothing, food), notification systems (email, SMS, push, in-app), social content feeds (post, photo, video, story).
Common mistakes: Creating separate collections for each type when documents share 80% of their fields — this forces multiple queries to build a unified feed/list.
// Problem: Every order display requires a user lookup
// orders collection with just userId
{ "_id": ord1, "userId": ObjectId("u1"), "items": [...], "total": 5000 }
// Every list view: JOIN with users to get user.name, user.email
// EXTENDED REFERENCE PATTERN — embed stable user fields
{
"_id": ObjectId("ord1"),
"userId": ObjectId("u1"), // reference for full lookup
"user": { // extended reference (embedded subset)
"name": "Mayur Badodiya", // rarely changes
"email": "mayur@example.com" // rarely changes
// NOT: address (changes often), password, etc.
},
"items": [
{
"productId": ObjectId("p1"), // reference
"name": "Laptop", // embedded (product name stable)
"price": 75000, // MUST snapshot at order time!
"qty": 1
}
],
"total": 75000,
"status": "pending",
"createdAt": ISODate("...")
}
// Order list query — no JOIN needed!
db.orders.find({ "user.email": "mayur@example.com" }).sort({ createdAt: -1 });
// When user changes name — update existing orders? (usually NO for historical accuracy)
// But for active displays, update with scheduled job
db.orders.updateMany(
{ "userId": userId, status: { $in: ["pending","processing"] } },
{ $set: { "user.name": newName } }
);
Why it matters: Extended Reference is one of the most practical MongoDB patterns. It demonstrates understanding of the read/write trade-off and data freshness vs performance.
Real applications: Order display with buyer name, social feed with poster's profile picture and username, audit logs with actor's name and role at time of action.
Common mistakes: Embedding fields that change frequently (like user's address) — these must be updated in all referencing documents whenever they change, creating write fan-out.
// Problem: Movies collection — most movies have 100-500 reviews
// But "Avatar" has 50,000+ reviews — breaks embedding
// OUTLIER PATTERN:
// Normal movie document (embedded reviews up to limit)
{
"_id": ObjectId("m1"),
"title": "Inception",
"reviews": [
{ "user": "Alice", "rating": 5, "text": "Mind-blowing!" },
// ... up to 100 reviews embedded
],
"has_extras": false // no overflow
}
// Outlier movie document (flag = true)
{
"_id": ObjectId("m2"),
"title": "Avatar",
"reviews": [
// only first 100 reviews embedded
],
"has_extras": true, // overflow flag!
"reviewCount": 52000
}
// Overflow collection for outlier's excess reviews
{
"_id": ObjectId("overflow1"),
"movieId": ObjectId("m2"),
"reviews": [
// reviews 101-50000 stored here
]
}
// Application logic
const movie = await db.movies.findOne({ _id: movieId });
if (movie.has_extras) {
const overflow = await db.movie_overflows.find({ movieId }).toArray();
// merge overflow.reviews into display
}
// 99% of movies: single fast document read
// 1% outliers: extra query to overflow collection
Why it matters: Knowledge of the Outlier Pattern shows understanding of designing for the common case without penalizing the majority for edge cases. This is expert-level MongoDB design.
Real applications: Viral posts with millions of comments, celebrity profiles with millions of followers stored in embedded arrays, high-transaction accounts in banking.
Common mistakes: Designing the entire schema to accommodate the outlier case — this makes ALL documents larger and slower just to handle 0.1% of edge cases.
updateMany() with pipeline.
// STRATEGY 1: Add schema version field
{ _id: 1, name: "Mayur", schema_version: 1 } // old
{ _id: 2, name: "Alice", schema_version: 2 } // new with address embedded
// Application handles both versions
function normalizeUser(doc) {
if (doc.schema_version < 2) {
doc.address = {}; // provide safe default
}
return doc;
}
// STRATEGY 2: Batch migration with updateMany + pipeline
// Add 'isActive' field with default 'true' to all docs missing it
db.users.updateMany(
{ isActive: { $exists: false } }, // find old docs
[{ $set: { isActive: true } }] // add field
);
// STRATEGY 3: Rename a field safely
db.users.updateMany(
{ "fullname": { $exists: true } },
[{ $set: { name: "$fullname" }, $unset: "fullname" }]
);
// STRATEGY 4: Change field type (string phone → object)
db.users.updateMany(
{ phone: { $type: "string" } },
[{
$set: {
phone: {
number: "$phone",
country: "+91",
type: "mobile"
}
}
}]
);
// Monitor migration progress
db.users.countDocuments({ isActive: { $exists: false } }); // remaining
Why it matters: Schema migrations in NoSQL are often misunderstood. Interviewers expect nuanced answers about zero-downtime migration strategies and handling both old and new schema formats simultaneously.
Real applications: Adding a new required field to a user collection (100M docs), changing phone storage from string to object, migrating from embedded addresses to a separate addresses collection.
Common mistakes: Running a mass migration in one shot without batching — a single updateMany() on 100M documents locks the collection and degrades production performance for hours.
// APPROACH 1: Current + History Collections
// users collection (current state only)
{
"_id": ObjectId("u1"),
"name": "Mayur",
"email": "mayur@example.com",
"role": "admin",
"version": 3,
"updatedAt": ISODate("2026-04-05")
}
// users_history collection (all previous versions)
{
"_id": ObjectId("uh1"),
"docId": ObjectId("u1"), // reference to current doc
"version": 2, // version number
"name": "Mayur Badodiya", // old state
"email": "old@email.com",
"role": "user",
"archivedAt": ISODate("2026-03-01") // when this became historical
}
// Atomic update: history + current (in a transaction)
const session = client.startSession();
await session.withTransaction(async () => {
// Archive current version to history
const current = await db.users.findOne({ _id: userId });
await db.users_history.insertOne({ docId: userId, ...current, archivedAt: new Date() }, { session });
// Update current with new data
await db.users.updateOne(
{ _id: userId },
{ $set: { role: "admin", updatedAt: new Date() }, $inc: { version: 1 } },
{ session }
);
});
// Query: get version history for a user
db.users_history.find({ docId: userId }).sort({ version: -1 });
// Rollback: restore version 2
const v2 = await db.users_history.findOne({ docId: userId, version: 2 });
await db.users.replaceOne({ _id: userId }, { ...v2, version: latestVersion + 1 });
Why it matters: Audit trails are required by regulations (GDPR, SOX, HIPAA). Explaining this pattern demonstrates compliance awareness and advanced transaction usage.
Real applications: Medical records requiring audit trail, financial contract amendments, CMS content versioning (draft → published → archived).
Common mistakes: Not using transactions when archiving to history — a server crash between archiving old and writing new leaves data in an inconsistent state.
$elemMatch) slow down as array size grows, (4) write operations that push to arrays require rewriting the entire document, (5) working set bloat reducing effective cache. The fix is to reference instead of embed when arrays can grow unboundedly, or apply the Subset or Bucket patterns.
// ❌ ANTI-PATTERN: Massive embedded array
{
"_id": ObjectId("u1"),
"username": "mayur",
"followers": [ // could have 1M+ entries for a celebrity!
ObjectId("u2"),
ObjectId("u3"),
// ... 999,998 more ObjectIds
]
}
// Problems:
// 1. Document size: 1M ObjectIds × 12 bytes = 12 MB!
// 2. Every profile load fetches 12 MB even to show just follower count
// 3. $push on every follow action rewrites 12 MB document
// ✅ SOLUTION: Separate relationship collection
// users collection (clean, small)
{ "_id": ObjectId("u1"), "username": "mayur", "followerCount": 1000000 }
// follows collection (the relationship)
{
"_id": ObjectId("..."),
"followerId": ObjectId("u2"), // the follower
"followeeId": ObjectId("u1"), // who is followed
"createdAt": ISODate("...")
}
// Compound index: db.follows.createIndex({ followeeId: 1, followerId: 1 })
// Compound index: db.follows.createIndex({ followerId: 1, followeeId: 1 })
// Efficient queries
db.follows.find({ followeeId: userId }).count(); // follower count
db.follows.find({ followerId: myId, followeeId: targetId }); // am I following?
db.follows.find({ followerId: myId }).limit(20).sort({ createdAt: -1 }); // my follows list
Why it matters: Massive arrays are the #1 MongoDB performance killer in production. Identifying and fixing this anti-pattern demonstrates production schema design experience.
Real applications: Twitter follower graphs, YouTube subscriber lists, Facebook friend networks — all use separate relationship tables/collections, not embedded arrays.
Common mistakes: Not anticipating array growth during initial design. Early-stage apps with 100 users seem fine with embedded arrays, then fail catastrophically when users reach 10K+ followers.
// 1. PARENT REFERENCE (most common)
{ _id: "mongodb", name: "MongoDB", parent: "databases" }
{ _id: "databases", name: "Databases", parent: "technology" }
{ _id: "technology", name: "Technology", parent: null }
// Find children of "databases"
db.categories.find({ parent: "databases" });
// Find ancestors (requires multiple queries or $graphLookup)
db.categories.aggregate([{
$graphLookup: {
from: "categories", startWith: "$parent",
connectFromField: "parent", connectToField: "_id",
as: "ancestors"
}
}]);
// 2. ARRAY OF ANCESTORS (best for breadcrumb/path)
{ _id: "mongodb", name: "MongoDB", ancestors: ["technology", "databases"] }
// Find all descendants of "databases" in one query:
db.categories.find({ ancestors: "databases" });
// 3. MATERIALIZED PATH (subtree queries with regex)
{ _id: "mongodb", name: "MongoDB", path: ",technology,databases,mongodb," }
// Find all under "databases":
db.categories.find({ path: /,databases,/ });
// Create index for efficient regex on anchored paths
db.categories.createIndex({ path: 1 });
// 4. NESTED SETS (best read, worst write performance)
{ _id: "mongodb", name: "MongoDB", left: 7, right: 8 }
// Find all descendants:
db.categories.find({ left: { $gt: parentLeft }, right: { $lt: parentRight } });
Why it matters: Hierarchical data is common (categories, org charts, menus, file systems). Interviewers assess ability to choose the right tree storage strategy for specific query patterns.
Real applications: E-commerce category trees, organizational charts, file system hierarchies, comment thread nesting, menu structures.
Common mistakes: Using Parent Reference for deep subtree fetches without $graphLookup — requires N recursive queries for N levels, extremely slow for deep trees.
$indexStats, and drop unused indexes.
// Find unused indexes (zero or low usage since last restart)
db.orders.aggregate([{ $indexStats: {} }]);
// Look for: "accesses.ops": 0 or very low number
// ❌ ANTI-PATTERN: Index everything
db.users.createIndex({ name: 1 });
db.users.createIndex({ age: 1 });
db.users.createIndex({ gender: 1 });
db.users.createIndex({ isActive: 1 });
db.users.createIndex({ createdAt: 1 });
db.users.createIndex({ country: 1 });
// 7 indexes on a write-heavy collection — massive write overhead!
// ✅ GOOD: Index based on query patterns
// If queries are: find by email, list by createdAt, filter active users
db.users.createIndex({ email: 1 }, { unique: true }); // login
db.users.createIndex({ createdAt: -1 }); // listing
db.users.createIndex({ isActive: 1, createdAt: -1 }, {
partialFilterExpression: { isActive: true } // only active users
});
// Audit and clean up unused indexes
// 1. Enable profiling to capture slow queries
db.setProfilingLevel(1, { slowms: 100 });
// 2. Review system.profile collection
db.system.profile.find({ millis: { $gt: 100 } }).sort({ ts: -1 });
// 3. Drop confirmed unused indexes
db.users.dropIndex({ gender: 1 });
Why it matters: Understanding index management is a sign of production experience. Interviewers ask this to confirm you know that indexes have costs, not just benefits.
Real applications: A high-throughput write application (10K inserts/second) with 10 unnecessary indexes degrades to 2K inserts/second due to index maintenance overhead.
Common mistakes: Creating indexes during initial development "to be safe" and never auditing them — production systems accumulate index bloat over time, degrading write performance progressively.