MySQL

MySQL Subqueries & CTEs Interview Questions

8 Questions

Scalar subqueries return single value. Table subqueries return result set. Scalar subqueries work in SELECT, WHERE, HAVING. Table subqueries typically in FROM or WHERE IN.

-- Scalar subquery (returns one value)
SELECT name, salary, (SELECT AVG(salary) FROM employees) as avg_salary
FROM employees;  -- Allowed in SELECT

-- Subquery in WHERE
SELECT * FROM employees 
WHERE salary > (SELECT AVG(salary) FROM employees);

-- Table subquery
SELECT * FROM (SELECT id, name FROM employees WHERE salary > 50000) as e;

-- IN with table subquery
SELECT * FROM orders 
WHERE customer_id IN (SELECT id FROM customers WHERE country = 'USA');

-- EXISTS (alternative to IN)
SELECT * FROM customers c 
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

Why it matters: Different patterns for different filtering needs.

Real applications: Average comparisons, EXISTS checks, IN filters.

Common mistakes: Scalar subquery returning multiple rows, efficiency of IN vs EXISTS.

Correlated subqueries reference outer query. Execute once per outer row. Useful for row-by-row comparisons but can be slow on large datasets.

-- Correlated subquery
SELECT id, name, salary FROM employees e1
WHERE salary > (SELECT AVG(salary) FROM employees e2 WHERE e2.department = e1.department);

-- Per-row comparison
SELECT order_id, amount FROM orders o
WHERE amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id);

-- EXISTS with correlation
SELECT c.id, c.name FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.amount > 100);

-- Performance issue: Executes subquery for each outer row
SELECT * FROM employees e1
WHERE salary = (SELECT MAX(salary) FROM employees e2 WHERE e2.department_id = e1.department_id);

-- Better: Use window function
SELECT * FROM (
    SELECT *, MAX(salary) OVER (PARTITION BY department_id) as max_dept_salary
    FROM employees
) t WHERE salary = max_dept_salary;

Why it matters: Performance trade-offs with correlation.

Real applications: Above/below average, row-by-row filtering.

Common mistakes: Performance on large tables, using when JOINs are more efficient.

IN typically requires value matching. EXISTS stops after finding first match. EXISTS often more efficient for large result sets, especially with NOT variations.

-- IN: Collects all matching values
SELECT * FROM customers 
WHERE customer_id IN (SELECT customer_id FROM orders WHERE amount > 100);

-- EXISTS: Stops after finding match
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.amount > 100);

-- NOT IN: Problematic with NULLs
SELECT * FROM customers WHERE id NOT IN (SELECT customer_id FROM orders);
-- If customer_id has NULLs, returns empty result!

-- NOT EXISTS: Better than NOT IN
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id);

-- IN with subquery - can be optimized to JOIN
SELECT * FROM orders WHERE customer_id IN (SELECT id FROM customers WHERE country = 'USA');
-- Equivalent to:
SELECT DISTINCT o.* FROM orders o 
JOIN customers c ON o.customer_id = c.id WHERE c.country = 'USA';

Why it matters: Query performance and correctness.

Real applications: Filtering by related records, excluding records.

Common mistakes: NOT IN with NULLs, not considering DISTINCT.

CTEs (WITH clause) create named temporary result sets. Improve readability and enable recursion. Available MySQL 8.0+. Can be referenced multiple times in main query.

-- Basic CTE
WITH high_earners AS (
    SELECT id, name, salary FROM employees WHERE salary > 100000
)
SELECT * FROM high_earners;

-- Multiple CTEs
WITH recent_orders AS (
    SELECT * FROM orders WHERE order_date >= CURDATE() - INTERVAL 30 DAY
),
large_orders AS (
    SELECT * FROM recent_orders WHERE amount > 1000
)
SELECT * FROM large_orders;

-- CTE in aggregate
WITH customer_spending AS (
    SELECT customer_id, SUM(amount) as total_spent FROM orders GROUP BY customer_id
)
SELECT * FROM customer_spending WHERE total_spent > 500;

-- Recursive CTE (hierarchy)
WITH RECURSIVE hierarchy AS (
    SELECT id, parent_id, name, 1 as level FROM categories WHERE parent_id IS NULL
    UNION ALL
    SELECT c.id, c.parent_id, c.name, h.level + 1 
    FROM categories c JOIN hierarchy h ON c.parent_id = h.id
)
SELECT * FROM hierarchy;

Why it matters: Readability and recursion support.

Real applications: Hierarchies, temporary aggregations, complex queries.

Common mistakes: MySQL < 8.0 not supported, recursive limits.

Subqueries vs JOINs vs CTEs: All achieve similar results but with different readability/performance. JOINs often more efficient. CTEs improve readability. Choice depends on query complexity and optimizer behavior.

-- Subquery approach
SELECT * FROM customers WHERE id IN (SELECT customer_id FROM orders WHERE amount > 100);

-- JOIN approach
SELECT DISTINCT c.* FROM customers c 
JOIN orders o ON c.id = o.customer_id WHERE o.amount > 100;

-- CTE approach
WITH qualified_customers AS (
    SELECT DISTINCT customer_id FROM orders WHERE amount > 100
)
SELECT c.* FROM customers c JOIN qualified_customers q ON c.id = q.customer_id;

-- Performance depends on:
-- - Query optimizer (different in each MySQL version)
-- - Index availability
-- - Data volume
-- - Join selectivity

-- EXPLAIN to compare
EXPLAIN SELECT * FROM customers c WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);
EXPLAIN SELECT DISTINCT c.* FROM customers c JOIN orders o ON c.id = o.customer_id;

-- Best practice: Use to maximize code readability, then optimize if needed

Why it matters: Query performance and code clarity.

Real applications: Complex filtering, multi-step transformations.

Common mistakes: Premature optimization, unclear intent.

Recursive CTEs reference themselves. Used for hierarchies, trees, organizational structures. Contain anchor (base) and recursive parts.

-- Basic recursive CTE
WITH RECURSIVE numbers AS (
    SELECT 1 as n  -- Anchor member
    UNION ALL
    SELECT n + 1 FROM numbers WHERE n < 10  -- Recursive member
)
SELECT * FROM numbers;

-- Organization hierarchy
WITH RECURSIVE org_hierarchy AS (
    -- Anchor: Get top-level managers
    SELECT id, name, manager_id, 1 as level FROM employees WHERE manager_id IS NULL
    UNION ALL
    -- Recursive: Get direct reports
    SELECT e.id, e.name, e.manager_id, oh.level + 1 
    FROM employees e 
    JOIN org_hierarchy oh ON e.manager_id = oh.id
    WHERE oh.level < 10  -- Prevent infinite loops
)
SELECT * FROM org_hierarchy ORDER BY level, name;

-- Generate date series
WITH RECURSIVE date_series AS (
    SELECT CURDATE() as date
    UNION ALL
    SELECT DATE_ADD(date, INTERVAL 1 DAY) FROM date_series WHERE date < DATE_ADD(CURDATE(), INTERVAL 30 DAY)
)
SELECT * FROM date_series;

Why it matters: Handle hierarchical data naturally.

Real applications: Org charts, category hierarchies, bill-of-materials.

Common mistakes: Infinite loops (missing WHERE), stack limits, MySQL < 8.0.

Materialized CTEs execute once and store results. Non-materialized CTEs might re-execute. MySQL handles materialization automatically. Can optimize complex queries but consume memory.

-- CTE used multiple times (materialized)
WITH order_stats AS (
    SELECT customer_id, COUNT(*) as order_count, AVG(amount) as avg_amount
    FROM orders GROUP BY customer_id
)
SELECT * FROM order_stats WHERE order_count > 5;

-- Reference multiple times - usually materialized once
SELECT * FROM order_stats WHERE avg_amount > 100;

-- CTE for aggregation then filtering
WITH sales_by_region AS (
    SELECT region, SUM(amount) as total_sales FROM orders GROUP BY region
)
SELECT * FROM sales_by_region WHERE total_sales > 50000;

-- Performance tuning
-- 1. CTE results are materialized unless simple enough to inline
-- 2. Materialized CTEs stored in temporary tables/memory
-- 3. EXPLAIN shows CTE materilization

-- Avoid: CTE with all data then filter heavily
WITH all_orders AS (SELECT * FROM orders)  -- Loads everything
SELECT * FROM all_orders WHERE id = 1;  -- Then filters

-- Better: Filter in CTE
WITH order_detail AS (SELECT * FROM orders WHERE id = 1)
SELECT * FROM order_detail;

Why it matters: Memory and performance with CTEs.

Real applications: Complex multi-step queries, hierarchies.

Common mistakes: Large CTEs materialized to memory, redundant CTEs.

Best practices: Use CTEs for readability and recursion. Use subqueries for simple filters. Prefer JOINs for performance when possible. Test with EXPLAIN. Avoid unnecessary nesting.

-- ✅ Good: CTE for complex logic
WITH inactive_customers AS (
    SELECT customer_id FROM orders 
    WHERE order_date < DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
    GROUP BY customer_id
),
active_customers AS (
    SELECT customer_id FROM orders 
    WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
    GROUP BY customer_id
)
SELECT a.customer_id, a.total_spent FROM active_customers a
WHERE a.customer_id NOT IN (SELECT customer_id FROM inactive_customers);

-- ❌ Bad: Overly nested subqueries
SELECT * FROM (
    SELECT * FROM (
        SELECT * FROM (SELECT * FROM orders) o1
    ) o2
) o3;

-- ✅ Good: Simple subquery for value
SELECT * FROM employees 
WHERE salary > (SELECT AVG(salary) FROM employees);

-- ✅ Good: Join for clarity
SELECT c.name, COUNT(o.id) as order_count FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.id;

-- Use EXPLAIN
EXPLAIN SELECT * FROM customers c WHERE id IN (SELECT customer_id FROM orders WHERE amount > 100);

Why it matters: Code quality and performance.

Real applications: Complex reports, data transformations.

Common mistakes: Unnecessary nesting, not testing performance.