Schema Design

Top 15 MongoDB schema design interview questions. Master embedding vs referencing, bucket pattern, computed pattern, schema versions and design anti-patterns.

MongoDB

13 Questions

MongoDB schema design is fundamentally different from relational database design. The core principle is: "Design your schema based on how your application queries data", not based on relationships between entities. Key principles: (1) Embed related data that is always accessed together, (2) Reference (normalize) data that is accessed independently or shared across many documents, (3) Avoid application-level joins — MongoDB is optimized for single-document reads, (4) Optimize for the most common read queries, (5) Prefer richer documents over many small ones. The trade-off is always between read performance vs write complexity and data consistency vs query efficiency.

// Principle: Design for query patterns, not normalized form

// Application query: "show a blog post with all comments"
// BAD (relational thinking): separate comments collection
// Requires join on every post load

// GOOD: embed comments in post (single document read)
{
  "_id": ObjectId("..."),
  "title": "MongoDB Schema Design",
  "body": "...",
  "author": "Mayur",
  "tags": ["mongodb", "schema"],
  "comments": [
    { "user": "Alice", "text": "Great post!", "date": ISODate("...") },
    { "user": "Bob",   "text": "Very helpful", "date": ISODate("...") }
  ]
}
// ↑ One query fetches post + all comments — fast!

// But if comments are 10,000+, document becomes huge
// → Reference for high-volume data (see Relationships topic)

Why it matters: Schema design is the most impactful decision in a MongoDB application. A poor schema that forces joins or scans entire arrays for every query cannot be fixed with indexes alone.

Real applications: Netflix's video catalog, Airbnb's listing details, and Amazon's product pages all use embedding for data that is always shown together but never queried independently.

Common mistakes: Directly translating your SQL ER diagram to MongoDB without considering access patterns — this leads to unnecessary lookups and poor performance.

The Embedding vs Referencing decision is the most fundamental MongoDB schema design question. Embed when: (1) data is accessed together, (2) child data only makes sense in context of parent, (3) one-to-few relationship (< 100 items), (4) data doesn't need to be accessed independently, (5) atomic updates are needed. Reference when: (1) data is large or growing unboundedly, (2) data is shared across many parents, (3) data is queried independently, (4) many-to-many relationships, (5) data exceeds 16 MB risk. The 16 MB document limit forces referencing for unbounded growing arrays.

// EMBED (one-to-few, always accessed together)
// User with address (1-3 addresses)
{
  "_id": ObjectId("..."),
  "name": "Mayur",
  "addresses": [
    { "type": "home", "city": "Mumbai", "pincode": "400001" },
    { "type": "office", "city": "Pune",  "pincode": "411001" }
  ]
}

// REFERENCE (one-to-many with shared or large data)
// Author with books (books queried independently, shared between authors)
// authors collection
{ "_id": ObjectId("aaa"), "name": "Martin Fowler", "bio": "..." }

// books collection
{
  "_id": ObjectId("bbb"),
  "title": "Patterns of Enterprise Application Architecture",
  "authorId": ObjectId("aaa"),  // reference to author
  "isbn": "978-0321127426"
}

// Hybrid: partial embed for display + reference for detail
// Order item (embed summary, reference full product)
{
  "items": [{
    "productId": ObjectId("..."), // reference
    "name": "Laptop",             // embedded for display (avoid lookup)
    "price": 75000,               // price at time of purchase (snapshot)
    "qty": 1
  }]
}

Why it matters: This is the core schema design question in every MongoDB interview. The ability to articulate the trade-offs clearly shows design maturity and production experience.

Real applications: User profiles with addresses (embed), orders with product references (hybrid — embed price snapshot, reference product), many-to-many tags (reference).

Common mistakes: Embedding high-volume arrays (comments, logs) that grow without bounds — the document balloons past the 16 MB limit and every read includes the entire dataset even when showing just the post title.

The Subset Pattern solves the problem of documents with large arrays by storing only the most frequently accessed subset of data in the main document and keeping the full dataset in a separate collection. For example, a product page shows only the last 10 reviews — not all 5,000. Store the last 10 reviews in the product document and all reviews in a separate reviews collection. This keeps the working set (data in RAM) small while still enabling fast access to the common case (show latest reviews), and full access to the complete dataset (pagination through all reviews) via a separate query.

// Problem: Product with 5000 reviews — document is huge
// Full document loaded even when you only show 10 reviews

// SUBSET PATTERN:
// products collection (main document, stays small)
{
  "_id": ObjectId("p1"),
  "name": "MacBook Pro",
  "price": 200000,
  "avgRating": 4.7,
  "reviewCount": 5000,
  "recentReviews": [             // subset: only last 10
    { "user": "Alice", "rating": 5, "text": "Amazing!", "date": ... },
    { "user": "Bob",   "rating": 4, "text": "Great value", "date": ... }
    // ... 8 more
  ]
}

// reviews collection (full dataset)
{
  "_id": ObjectId("r1"),
  "productId": ObjectId("p1"),
  "user": "Alice",
  "rating": 5,
  "text": "Amazing!",
  "date": ISODate("...")
}

// Query: product page load (one fast read, small document)
db.products.findOne({ _id: productId });

// Query: "Load all reviews" (separate query, paginated)
db.reviews.find({ productId: productId }).sort({ date: -1 }).limit(20);

// Update: add new review (update subset + insert full)
db.products.updateOne(
  { _id: productId },
  {
    $push: { recentReviews: { $each: [newReview], $slice: -10, $sort: { date: -1 } } },
    $inc: { reviewCount: 1 },
    $set: { avgRating: newAvg }
  }
);

Why it matters: The Subset Pattern is a key MongoDB design pattern tested in advanced interviews. It directly addresses the working set optimization problem — keeping frequently accessed data small enough to fit in RAM.

Real applications: Product pages (latest 10 reviews), social media feeds (last 20 posts), order history (last 5 orders on profile page), message threads (last 50 messages).

Common mistakes: Not keeping the subset synchronized on write — when adding a review, you must update both the product's recentReviews array AND insert into the reviews collection.

The Bucket Pattern groups related data (typically time-series measurements) into bucket documents rather than creating one document per measurement. Instead of millions of individual sensor reading documents, group readings by time window (hour, day) or count (100 readings per bucket) into fewer, larger documents. This dramatically reduces the number of documents (and index entries) while keeping related data together. Benefits: fewer documents to index, better compression, reduced index overhead, and more efficient range scans over time windows. Used extensively for IoT, metrics, financial ticks, and logging.

// WITHOUT bucket pattern: 1 document per measurement
// 1M readings per sensor per day = 365M documents per year!
{ sensorId: "temp001", timestamp: ISODate("2026-04-05T10:00:00"), value: 23.5 }
{ sensorId: "temp001", timestamp: ISODate("2026-04-05T10:01:00"), value: 23.7 }
// ... 1M more documents

// WITH bucket pattern: 1 document per hour per sensor
{
  "_id": { sensorId: "temp001", hour: ISODate("2026-04-05T10:00:00") },
  "sensorId": "temp001",
  "hour": ISODate("2026-04-05T10:00:00"),
  "count": 60,
  "min": 22.1, "max": 24.3, "avg": 23.4,
  "measurements": [
    { "t": ISODate("2026-04-05T10:00:00"), "v": 23.5 },
    { "t": ISODate("2026-04-05T10:01:00"), "v": 23.7 },
    // ... 58 more readings
  ]
}
// ↑ 1M → 24 documents/sensor/day (one per hour)
// = 8,760 documents/sensor/year!

// Inserting new measurement into existing bucket
db.sensor_data.updateOne(
  { sensorId: "temp001", hour: currentHour, count: { $lt: 200 } },
  {
    $push: { measurements: { t: now, v: 23.8 } },
    $inc: { count: 1 },
    $min: { min: 23.8 },
    $max: { max: 23.8 }
  },
  { upsert: true }  // create new bucket if current is full
);

Why it matters: The Bucket Pattern is critical for time-series and IoT applications. It's frequently asked in interviews for data-intensive system design roles involving high write throughput.

Real applications: IoT sensor networks, stock market tick data, server metrics collection (CPU, memory per minute), application performance monitoring.

Common mistakes: Fixed bucket sizes with no overflow handling — if a bucket fills up, the upsert logic must handle creating a new bucket atomically.

The Computed Pattern pre-computes and stores the result of expensive calculations in the document itself, updating the computed value whenever the source data changes. This trades write complexity for read performance — instead of computing totals/averages on every read, compute them once on write and store. Common computations: total order amount, average rating, follower count, comment count, cumulative revenue. The computation can happen atomically on update using $inc, $push + $set, or via a background job for batch recomputation. This pattern avoids expensive aggregation queries on hot read paths.

// WITHOUT computed pattern: expensive aggregation per request
// GET /posts/:id/stats
db.comments.aggregate([
  { $match: { postId: postId } },
  { $group: { _id: null, count: { $sum: 1 }, avgRating: { $avg: "$rating" } } }
]);
// Runs on every request — slow, resource-heavy!

// WITH computed pattern: pre-computed in post document
{
  "_id": ObjectId("..."),
  "title": "MongoDB Schema Design",
  "body": "...",
  // Pre-computed values — updated on every comment add/remove
  "commentCount": 245,
  "avgRating": 4.7,
  "viewCount": 15234,
  "likeCount": 892
}

// When a comment is added — update stored values atomically
db.posts.updateOne(
  { _id: postId },
  {
    $inc: { commentCount: 1 },
    $set: { avgRating: newAvg }  // recomputed in application
  }
);

// Also useful: store frequently used aggregation results
// Daily stats document (computed nightly by batch job)
{
  "_id": "2026-04-05",
  "totalOrders": 1250,
  "totalRevenue": 4750000,
  "newUsers": 320,
  "topProduct": "MacBook Pro",
  "computedAt": ISODate("2026-04-05T23:59:00")
}

Why it matters: The Computed Pattern is essential for high-read systems like dashboards and social media. It demonstrates awareness of the classic read vs write trade-off in database optimization.

Real applications: Reddit/YouTube view counts (computed on write), e-commerce product ratings (updated on each review), social media follower counts (updated on follow/unfollow events).

Common mistakes: Letting computed values drift out of sync — always update the computed field in the same atomic operation as the data change, not in a separate query.

The Polymorphic Pattern stores different types of documents with varying schemas in a single collection, using a type discriminator field to distinguish them. This is natural in MongoDB due to its flexible schema and is used when objects share common attributes but have type-specific additional fields. MongoDB's schema-less nature makes this simple — no special handling needed at the database layer. Use this pattern for: inheritance hierarchies in OOP, multi-type search results, mixed content feeds, and product catalogs with varied attributes.

// Single "vehicles" collection with polymorphic documents
// Shared fields: type, make, year, price
// Type-specific fields vary

// Car document
{
  "_id": ObjectId("..."),
  "type": "car",          // discriminator field
  "make": "Toyota",
  "model": "Camry",
  "year": 2024,
  "price": 2500000,
  "doors": 4,             // car-specific
  "transmission": "automatic",
  "fuelType": "petrol"
}

// Motorcycle document (same collection, different shape)
{
  "_id": ObjectId("..."),
  "type": "motorcycle",
  "make": "Royal Enfield",
  "model": "Classic 350",
  "year": 2024,
  "price": 190000,
  "engineCC": 349,        // motorcycle-specific
  "seatHeight": 805
}

// Electric vehicle (additional type)
{
  "_id": ObjectId("..."),
  "type": "electric",
  "make": "Tata",
  "model": "Nexon EV",
  "year": 2024,
  "price": 1450000,
  "batteryKWh": 40.5,     // EV-specific
  "rangeKm": 465,
  "chargingTime": 8.6
}

// Query across all types (polymorphic search)
db.vehicles.find({ year: 2024, price: { $lt: 2000000 } });

// Type-specific query
db.vehicles.find({ type: "electric", rangeKm: { $gte: 300 } });

// Index on common fields + discriminator
db.vehicles.createIndex({ type: 1, price: 1, year: 1 });

Why it matters: Tests understanding of MongoDB's schema flexibility advantage. Knowing this pattern shows you can design elegant, maintainable schemas for complex domain models.

Real applications: Product catalogs (electronics, clothing, food), notification systems (email, SMS, push, in-app), social content feeds (post, photo, video, story).

Common mistakes: Creating separate collections for each type when documents share 80% of their fields — this forces multiple queries to build a unified feed/list.

The Extended Reference Pattern embeds a subset of frequently needed fields from a referenced document directly into the referencing document, creating a denormalized copy of the most-accessed data. This avoids the lookup/join for common display scenarios while maintaining the full reference for when complete data is needed. The key trade-off: the embedded subset may become stale if the source document is updated — you must decide which fields change infrequently enough to be safely embedded.

// Problem: Every order display requires a user lookup
// orders collection with just userId
{ "_id": ord1, "userId": ObjectId("u1"), "items": [...], "total": 5000 }
// Every list view: JOIN with users to get user.name, user.email

// EXTENDED REFERENCE PATTERN — embed stable user fields
{
  "_id": ObjectId("ord1"),
  "userId": ObjectId("u1"),        // reference for full lookup
  "user": {                        // extended reference (embedded subset)
    "name": "Mayur Badodiya",      // rarely changes
    "email": "mayur@example.com"   // rarely changes
    // NOT: address (changes often), password, etc.
  },
  "items": [
    {
      "productId": ObjectId("p1"), // reference
      "name": "Laptop",            // embedded (product name stable)
      "price": 75000,              // MUST snapshot at order time!
      "qty": 1
    }
  ],
  "total": 75000,
  "status": "pending",
  "createdAt": ISODate("...")
}

// Order list query — no JOIN needed!
db.orders.find({ "user.email": "mayur@example.com" }).sort({ createdAt: -1 });

// When user changes name — update existing orders? (usually NO for historical accuracy)
// But for active displays, update with scheduled job
db.orders.updateMany(
  { "userId": userId, status: { $in: ["pending","processing"] } },
  { $set: { "user.name": newName } }
);

Why it matters: Extended Reference is one of the most practical MongoDB patterns. It demonstrates understanding of the read/write trade-off and data freshness vs performance.

Real applications: Order display with buyer name, social feed with poster's profile picture and username, audit logs with actor's name and role at time of action.

Common mistakes: Embedding fields that change frequently (like user's address) — these must be updated in all referencing documents whenever they change, creating write fan-out.

The Outlier Pattern handles documents that are statistical exceptions — they have field values (especially arrays) much larger than typical documents. Instead of designing the schema for the outlier case that affects performance for the majority, you add an overflow flag and store excess data in separate overflow documents. The majority of documents stay small and fast. Only outliers require an additional query to load overflow data.

// Problem: Movies collection — most movies have 100-500 reviews
// But "Avatar" has 50,000+ reviews — breaks embedding

// OUTLIER PATTERN:
// Normal movie document (embedded reviews up to limit)
{
  "_id": ObjectId("m1"),
  "title": "Inception",
  "reviews": [
    { "user": "Alice", "rating": 5, "text": "Mind-blowing!" },
    // ... up to 100 reviews embedded
  ],
  "has_extras": false  // no overflow
}

// Outlier movie document (flag = true)
{
  "_id": ObjectId("m2"),
  "title": "Avatar",
  "reviews": [
    // only first 100 reviews embedded
  ],
  "has_extras": true,       // overflow flag!
  "reviewCount": 52000
}

// Overflow collection for outlier's excess reviews
{
  "_id": ObjectId("overflow1"),
  "movieId": ObjectId("m2"),
  "reviews": [
    // reviews 101-50000 stored here
  ]
}

// Application logic
const movie = await db.movies.findOne({ _id: movieId });
if (movie.has_extras) {
  const overflow = await db.movie_overflows.find({ movieId }).toArray();
  // merge overflow.reviews into display
}

// 99% of movies: single fast document read
// 1% outliers: extra query to overflow collection

Why it matters: Knowledge of the Outlier Pattern shows understanding of designing for the common case without penalizing the majority for edge cases. This is expert-level MongoDB design.

Real applications: Viral posts with millions of comments, celebrity profiles with millions of followers stored in embedded arrays, high-transaction accounts in banking.

Common mistakes: Designing the entire schema to accommodate the outlier case — this makes ALL documents larger and slower just to handle 0.1% of edge cases.

MongoDB's flexible schema means you can add fields to new documents without immediately updating old ones — this is called schema versioning. However, systematic schema migrations are still needed when: old documents lack required fields, a field type needs to change, or a field is renamed. Strategies: (1) Lazy migration — update documents on read access, (2) Batch migration — background job updates all documents, (3) Schema versioning — add a version field and handle each version in application code, (4) In-place migration via updateMany() with pipeline.

// STRATEGY 1: Add schema version field
{ _id: 1, name: "Mayur", schema_version: 1 } // old
{ _id: 2, name: "Alice", schema_version: 2 } // new with address embedded

// Application handles both versions
function normalizeUser(doc) {
  if (doc.schema_version < 2) {
    doc.address = {}; // provide safe default
  }
  return doc;
}

// STRATEGY 2: Batch migration with updateMany + pipeline
// Add 'isActive' field with default 'true' to all docs missing it
db.users.updateMany(
  { isActive: { $exists: false } },     // find old docs
  [{ $set: { isActive: true } }]         // add field
);

// STRATEGY 3: Rename a field safely
db.users.updateMany(
  { "fullname": { $exists: true } },
  [{ $set: { name: "$fullname" }, $unset: "fullname" }]
);

// STRATEGY 4: Change field type (string phone → object)
db.users.updateMany(
  { phone: { $type: "string" } },
  [{
    $set: {
      phone: {
        number: "$phone",
        country: "+91",
        type: "mobile"
      }
    }
  }]
);

// Monitor migration progress
db.users.countDocuments({ isActive: { $exists: false } }); // remaining

Why it matters: Schema migrations in NoSQL are often misunderstood. Interviewers expect nuanced answers about zero-downtime migration strategies and handling both old and new schema formats simultaneously.

Real applications: Adding a new required field to a user collection (100M docs), changing phone storage from string to object, migrating from embedded addresses to a separate addresses collection.

Common mistakes: Running a mass migration in one shot without batching — a single updateMany() on 100M documents locks the collection and degrades production performance for hours.

The Document Versioning Pattern maintains a history of document changes over time, enabling audit trails, rollbacks, and point-in-time data access. Two main approaches: (1) Current + History split — keep the latest document in the main collection and all previous versions in a history collection, (2) Embed version array — store version history inside the document itself (only viable for limited history). This pattern is critical for compliance, audit logging, and rollback capabilities. MongoDB transactions help ensure atomic updates across both collections.

// APPROACH 1: Current + History Collections

// users collection (current state only)
{
  "_id": ObjectId("u1"),
  "name": "Mayur",
  "email": "mayur@example.com",
  "role": "admin",
  "version": 3,
  "updatedAt": ISODate("2026-04-05")
}

// users_history collection (all previous versions)
{
  "_id": ObjectId("uh1"),
  "docId": ObjectId("u1"),    // reference to current doc
  "version": 2,               // version number
  "name": "Mayur Badodiya",   // old state
  "email": "old@email.com",
  "role": "user",
  "archivedAt": ISODate("2026-03-01") // when this became historical
}

// Atomic update: history + current (in a transaction)
const session = client.startSession();
await session.withTransaction(async () => {
  // Archive current version to history
  const current = await db.users.findOne({ _id: userId });
  await db.users_history.insertOne({ docId: userId, ...current, archivedAt: new Date() }, { session });
  // Update current with new data
  await db.users.updateOne(
    { _id: userId },
    { $set: { role: "admin", updatedAt: new Date() }, $inc: { version: 1 } },
    { session }
  );
});

// Query: get version history for a user
db.users_history.find({ docId: userId }).sort({ version: -1 });

// Rollback: restore version 2
const v2 = await db.users_history.findOne({ docId: userId, version: 2 });
await db.users.replaceOne({ _id: userId }, { ...v2, version: latestVersion + 1 });

Why it matters: Audit trails are required by regulations (GDPR, SOX, HIPAA). Explaining this pattern demonstrates compliance awareness and advanced transaction usage.

Real applications: Medical records requiring audit trail, financial contract amendments, CMS content versioning (draft → published → archived).

Common mistakes: Not using transactions when archiving to history — a server crash between archiving old and writing new leaves data in an inconsistent state.

Massive arrays are one of the most common MongoDB anti-patterns. Embedding arrays that grow without bounds creates: (1) documents approaching the 16 MB limit, (2) entire array loaded on every document read even when you need only one element, (3) queries on array elements ($elemMatch) slow down as array size grows, (4) write operations that push to arrays require rewriting the entire document, (5) working set bloat reducing effective cache. The fix is to reference instead of embed when arrays can grow unboundedly, or apply the Subset or Bucket patterns.

// ❌ ANTI-PATTERN: Massive embedded array
{
  "_id": ObjectId("u1"),
  "username": "mayur",
  "followers": [  // could have 1M+ entries for a celebrity!
    ObjectId("u2"),
    ObjectId("u3"),
    // ... 999,998 more ObjectIds
  ]
}
// Problems:
// 1. Document size: 1M ObjectIds × 12 bytes = 12 MB!
// 2. Every profile load fetches 12 MB even to show just follower count
// 3. $push on every follow action rewrites 12 MB document

// ✅ SOLUTION: Separate relationship collection
// users collection (clean, small)
{ "_id": ObjectId("u1"), "username": "mayur", "followerCount": 1000000 }

// follows collection (the relationship)
{
  "_id": ObjectId("..."),
  "followerId": ObjectId("u2"),   // the follower
  "followeeId": ObjectId("u1"),   // who is followed
  "createdAt": ISODate("...")
}
// Compound index: db.follows.createIndex({ followeeId: 1, followerId: 1 })
// Compound index: db.follows.createIndex({ followerId: 1, followeeId: 1 })

// Efficient queries
db.follows.find({ followeeId: userId }).count();  // follower count
db.follows.find({ followerId: myId, followeeId: targetId }); // am I following?
db.follows.find({ followerId: myId }).limit(20).sort({ createdAt: -1 }); // my follows list

Why it matters: Massive arrays are the #1 MongoDB performance killer in production. Identifying and fixing this anti-pattern demonstrates production schema design experience.

Real applications: Twitter follower graphs, YouTube subscriber lists, Facebook friend networks — all use separate relationship tables/collections, not embedded arrays.

Common mistakes: Not anticipating array growth during initial design. Early-stage apps with 100 users seem fine with embedded arrays, then fail catastrophically when users reach 10K+ followers.

MongoDB supports multiple ways to store hierarchical (tree) data, each optimized for different query patterns. Key approaches: (1) Parent Reference — each node stores its parent's _id (simple, good for finding children), (2) Child Reference — parent stores array of child _ids (good for finding children immediately), (3) Array of Ancestors — each node stores all ancestors (good for breadcrumb/path queries), (4) Materialized Paths — store path as string (good for subtree queries with regex), (5) Nested Sets — store left/right bounds (best read performance for subtrees). Choose based on which tree operations dominate.

// 1. PARENT REFERENCE (most common)
{ _id: "mongodb",     name: "MongoDB",     parent: "databases" }
{ _id: "databases",   name: "Databases",   parent: "technology" }
{ _id: "technology",  name: "Technology",  parent: null }

// Find children of "databases"
db.categories.find({ parent: "databases" });

// Find ancestors (requires multiple queries or $graphLookup)
db.categories.aggregate([{
  $graphLookup: {
    from: "categories", startWith: "$parent",
    connectFromField: "parent", connectToField: "_id",
    as: "ancestors"
  }
}]);

// 2. ARRAY OF ANCESTORS (best for breadcrumb/path)
{ _id: "mongodb", name: "MongoDB", ancestors: ["technology", "databases"] }
// Find all descendants of "databases" in one query:
db.categories.find({ ancestors: "databases" });

// 3. MATERIALIZED PATH (subtree queries with regex)
{ _id: "mongodb", name: "MongoDB", path: ",technology,databases,mongodb," }
// Find all under "databases":
db.categories.find({ path: /,databases,/ });
// Create index for efficient regex on anchored paths
db.categories.createIndex({ path: 1 });

// 4. NESTED SETS (best read, worst write performance)
{ _id: "mongodb", name: "MongoDB", left: 7, right: 8 }
// Find all descendants:
db.categories.find({ left: { $gt: parentLeft }, right: { $lt: parentRight } });

Why it matters: Hierarchical data is common (categories, org charts, menus, file systems). Interviewers assess ability to choose the right tree storage strategy for specific query patterns.

Real applications: E-commerce category trees, organizational charts, file system hierarchies, comment thread nesting, menu structures.

Common mistakes: Using Parent Reference for deep subtree fetches without $graphLookup — requires N recursive queries for N levels, extremely slow for deep trees.

The Unnecessary Indexes anti-pattern occurs when developers create indexes on every field "just in case" rather than based on actual query patterns. Each index: (1) uses RAM and disk space (index data structure stored in WiredTiger), (2) increases write latency (every insert/update/delete must update all indexes), (3) adds overhead to the query planner (more candidate plans to evaluate), (4) inflates replication oplog size. The MongoDB recommendation is to index based on actual query patterns, regularly audit with $indexStats, and drop unused indexes.

// Find unused indexes (zero or low usage since last restart)
db.orders.aggregate([{ $indexStats: {} }]);
// Look for: "accesses.ops": 0 or very low number

// ❌ ANTI-PATTERN: Index everything
db.users.createIndex({ name: 1 });
db.users.createIndex({ age: 1 });
db.users.createIndex({ gender: 1 });
db.users.createIndex({ isActive: 1 });
db.users.createIndex({ createdAt: 1 });
db.users.createIndex({ country: 1 });
// 7 indexes on a write-heavy collection — massive write overhead!

// ✅ GOOD: Index based on query patterns
// If queries are: find by email, list by createdAt, filter active users
db.users.createIndex({ email: 1 }, { unique: true }); // login
db.users.createIndex({ createdAt: -1 });              // listing
db.users.createIndex({ isActive: 1, createdAt: -1 }, {
  partialFilterExpression: { isActive: true }         // only active users
});

// Audit and clean up unused indexes
// 1. Enable profiling to capture slow queries
db.setProfilingLevel(1, { slowms: 100 });
// 2. Review system.profile collection
db.system.profile.find({ millis: { $gt: 100 } }).sort({ ts: -1 });
// 3. Drop confirmed unused indexes
db.users.dropIndex({ gender: 1 });

Why it matters: Understanding index management is a sign of production experience. Interviewers ask this to confirm you know that indexes have costs, not just benefits.

Real applications: A high-throughput write application (10K inserts/second) with 10 unnecessary indexes degrades to 2K inserts/second due to index maintenance overhead.

Common mistakes: Creating indexes during initial development "to be safe" and never auditing them — production systems accumulate index bloat over time, degrading write performance progressively.

Prev Indexes & Performance Next Relationships

Schema Design

1What are the key principles of MongoDB schema design?

2When should you embed documents vs reference (normalize) them?

3What is the Subset pattern in MongoDB schema design?

4What is the Bucket Pattern and when is it used?

5What is the Computed Pattern in schema design?

6What is the Polymorphic Pattern?

7What is the Extended Reference Pattern?

8What is the Outlier Pattern and why is it used?

9How do you handle schema migrations in MongoDB?

10What is the Document Versioning Pattern?

11What is schema anti-pattern: Massive Arrays?

12What is the Tree structure pattern for hierarchical data?

13What is the Schema Anti-Pattern: Unnecessary Indexes?