Query optimization is the process of improving SQL query performance by writing efficient queries and designing database structures. It involves analyzing execution plans, using indexes effectively, and minimizing resource consumption. Optimized queries reduce server load, improve response times, and enable applications to scale effectively.
-- Unoptimized query - scans entire table
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- Optimized query - uses index range scan
SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Verify optimization with EXPLAIN
EXPLAIN SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
Why it matters: Query optimization directly impacts application performance, user experience, and infrastructure costs.
Real applications: E-commerce sites handling millions of queries daily need optimization to stay responsive. Reports running overnight benefit from efficient queries.
Common mistakes: Optimizing queries randomly without analysis, not considering data growth impact on query time.
Slow Query Log captures queries exceeding a threshold (default 10 seconds), helping identify performance bottlenecks. EXPLAIN reveals execution plans showing whether indexes are used. Performance Schema provides detailed metrics. Monitoring tools log and analyze query patterns to find optimization opportunities.
-- Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2; -- Log queries longer than 2 seconds
-- Check slow query log file location
SHOW VARIABLES LIKE 'slow_query_log_file';
-- Analyze slow query log using mysqldumpslow
-- mysqldumpslow -s t /path/to/mysql-slow.log | head -10
-- Use EXPLAIN to analyze specific query
EXPLAIN SELECT * FROM orders WHERE customer_id IN
(SELECT customer_id FROM customers WHERE country = 'USA');
-- Use EXPLAIN FORMAT=JSON for detailed analysis
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 123;
-- Check Performance Schema
SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;
Why it matters: Identifying slow queries is the first step in systematic performance improvement.
Real applications: Production monitoring, debugging performance issues, capacity planning.
Common mistakes: Not enabling slow query log in production, ignoring opportunities to optimize batch queries.
JOIN optimization involves using appropriate indexes on join columns, joining in the correct order (usually smallest result sets first), and avoiding unnecessary joins. Different join types (INNER, LEFT, RIGHT) have different performance characteristics. Using indexes on foreign keys is crucial for join performance.
-- Inefficient join - multiple lookups
SELECT o.*, c.name FROM orders o
LEFT JOIN customers c ON c.id = o.customer_id
LEFT JOIN products p ON p.id = o.product_id
WHERE o.status = 'completed';
-- Optimized join - add filter and index
SELECT o.*, c.name FROM orders o
LEFT JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'completed' AND o.product_id = 5;
-- Index needed: orders(status, product_id), customers(id)
-- Join order matters - join large result with smaller
-- Bad: Join full orders table with large customers table
-- Good: Filter orders first, then join
SELECT * FROM orders
WHERE order_date > '2024-01-01'
JOIN customers ON customers.id = orders.customer_id;
Why it matters: JOIN optimization significantly impacts query performance, especially with large datasets.
Real applications: Reports joining multiple tables, multi-table searches, analytics queries.
Common mistakes: Unnecessary JOINs, not using indexes on join columns, wrong join order.
Subqueries can be optimized by converting to JOINs, using IN instead of EXISTS for certain cases, or materializing results with CTEs. Common Table Expressions (CTEs) improve readability and can be more efficient than subqueries by avoiding repeated computation. Proper indexing and filtering at each subquery level is crucial.
-- Inefficient correlated subquery
SELECT c.name,
(SELECT COUNT(*) FROM orders WHERE customer_id = c.id) as order_count
FROM customers c;
-- Optimized with LEFT JOIN
SELECT c.name, COUNT(o.id) as order_count
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.id;
-- Using CTE for clarity
WITH customer_orders AS (
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
)
SELECT c.name, co.order_count
FROM customers c
LEFT JOIN customer_orders co ON co.customer_id = c.id;
-- IN vs EXISTS optimization
SELECT * FROM products
WHERE category_id IN (SELECT id FROM categories WHERE active = 1);
SELECT * FROM products p
WHERE EXISTS (SELECT 1 FROM categories c
WHERE p.category_id = c.id AND c.active = 1);
Why it matters: Subqueries and CTEs can significantly impact query performance, requiring careful optimization.
Real applications: Complex reports, analytics queries, data validation queries.
Common mistakes: Correlated subqueries in SELECT clause, not materializing CTEs, unnecessary nesting.
WHERE clause optimization involves writing conditions that can use indexes, avoiding function calls on indexed columns, and using appropriate operators. Functions like YEAR() or DATE_FORMAT() prevent index usage, while direct range comparisons are index-friendly. Sargable conditions (Search ARGument ABLE) allow the optimizer to use indexes effectively.
-- Non-sargable condition - prevents index use
SELECT * FROM employees WHERE YEAR(hire_date) = 2020;
SELECT * FROM products WHERE LOWER(name) = 'laptop';
SELECT * FROM orders WHERE total * quantity > 1000;
-- Sargable conditions - enables index use
SELECT * FROM employees
WHERE hire_date >= '2020-01-01' AND hire_date < '2021-01-01';
SELECT * FROM products WHERE name = 'Laptop';
SELECT * FROM orders WHERE total > 1000 OR quantity > 500;
-- Complex conditions with proper parentheses
SELECT * FROM orders WHERE status = 'completed'
AND (total > 100 OR quantity > 50)
AND order_date > DATE_SUB(NOW(), INTERVAL 30 DAY);
-- Use BETWEEN for range queries
SELECT * FROM sales WHERE amount BETWEEN 100 AND 500;
Why it matters: Proper WHERE clause design determines whether indexes can be used, dramatically affecting query speed.
Real applications: Filtering operations in web applications, report generation, data exports.
Common mistakes: Using functions on indexed columns, OR conditions preventing index usage, complex expression evaluation.
UNION removes duplicates and sorts results (slower), while UNION ALL keeps duplicates and doesn't sort (faster). Use UNION only when duplicate elimination is necessary for correctness. UNION ALL is preferred for performance unless deduplication is required.
-- UNION removes duplicates and sorts (slower)
SELECT name FROM customers WHERE country = 'USA'
UNION
SELECT name FROM customers WHERE state = 'California';
-- Eliminates duplicate names, adds sort overhead
-- UNION ALL keeps duplicates (faster)
SELECT name FROM customers WHERE country = 'USA'
UNION ALL
SELECT name FROM customers WHERE state = 'California';
-- Keeps all names, no sorting
-- When duplicates matter in result
SELECT product_id FROM orders WHERE status = 'completed'
UNION -- Only distinct products
SELECT product_id FROM orders WHERE status = 'pending';
-- When all results are needed
SELECT customer_id FROM orders
UNION ALL
SELECT customer_id FROM invoices;
Why it matters: Choosing UNION vs UNION ALL significantly impacts query performance based on requirements.
Real applications: Multi-source data queries, report consolidation, API endpoint responses.
Common mistakes: Using UNION when UNION ALL is sufficient, not understanding deduplication cost.
GROUP BY optimization involves filtering rows before grouping (WHERE instead of HAVING), using indexes on grouped columns, and limiting GROUP BY output. HAVING filters grouped results after aggregation and is slower than WHERE which filters before grouping. Pre-aggregated tables or materialized views improve repeated group calculations.
-- Inefficient - HAVING after grouping all rows
SELECT customer_id, COUNT(*) as orders
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5;
-- Optimized - WHERE before grouping
SELECT customer_id, COUNT(*) as orders
FROM orders
WHERE order_date > DATE_SUB(NOW(), INTERVAL 1 YEAR)
GROUP BY customer_id
HAVING COUNT(*) > 5;
-- Multiple aggregates - calculate once
SELECT customer_id,
COUNT(*) as total_orders,
SUM(amount) as total_spent,
AVG(amount) as avg_order,
MAX(order_date) as last_order
FROM orders
GROUP BY customer_id
HAVING total_orders > 10;
-- Use aggregate functions efficiently
SELECT category, COUNT(*) as count
FROM products
WHERE active = 1
GROUP BY category
ORDER BY count DESC;
Why it matters: GROUP BY operations aggregate data and poor optimization can process unnecessary rows.
Real applications: Summarization reports, dashboard calculations, sales analysis.
Common mistakes: Using HAVING for conditions better served by WHERE, not filtering before grouping.
ORDER BY optimization uses indexes to avoid sorting when possible, filters data before sorting to reduce sorting volume, and uses LIMIT to fetch only needed rows. Scanning an index in sorted order is much faster than sorting all rows. Filesort (sorting without index) becomes expensive with large result sets.
-- Inefficient - sorts all 1M rows, returns 10
SELECT * FROM customers ORDER BY name DESC LIMIT 10;
-- Optimized - uses index to sort, stops at 10
-- Index on (status, name DESC) recommended
SELECT * FROM customers
WHERE status = 'active'
ORDER BY name DESC LIMIT 10;
-- Using pagination efficiently
SELECT * FROM products
WHERE price > 100
ORDER BY created_date DESC
LIMIT 20 OFFSET 0; -- First page
SELECT * FROM products
WHERE price > 100 AND id < 1000 -- Seek method better for large offsets
ORDER BY created_date DESC
LIMIT 20;
-- Check if index is used for ORDER BY
EXPLAIN SELECT * FROM products
ORDER BY category, price DESC LIMIT 100;
-- Look for 'Using index' to confirm sort via index
Why it matters: ORDER BY with large offsets causes unnecessary sorting; proper optimization using indexes is crucial.
Real applications: Pagination, leaderboards, sorted product listings, recent item displays.
Common mistakes: Large OFFSET values with sorting, not indexing ORDER BY columns.
EXPLAIN reveals query execution plans showing table access order, index usage, join types, and row count estimates. Key columns include: type (join type), possible_keys (indexes that could be used), key (index actually used), and rows (estimated rows examined). Understanding these helps identify optimization opportunities.
-- EXPLAIN output columns
EXPLAIN SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'completed';
-- Output interpretation:
-- id: Query execution order (lower = executed first)
-- select_type: SIMPLE, PRIMARY, SUBQUERY, DERIVED, etc.
-- table: Table being accessed
-- type: Query type - const, eq_ref, ref, range, index, ALL (worst)
-- possible_keys: Indexes that could be used
-- key: Index actually used (NULL = no index)
-- key_len: Index length (number of bytes)
-- ref: Columns compared in join
-- rows: Estimated rows examined (not returned!)
-- Extra: Additional info - Using where, Using index, Using filesort, etc.
-- Use EXPLAIN FORMAT=JSON for detailed analysis
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 123;
-- Extended EXPLAIN for real statistics
EXPLAIN EXTENDED SELECT * FROM orders WHERE status = 'pending';
SHOW WARNINGS;
Why it matters: EXPLAIN is the primary tool for understanding and optimizing query performance.
Real applications: Query optimization, performance debugging, query review before deployment.
Common mistakes: Focusing only on table order, ignoring 'rows' column which shows work done, not result count.
Query caching stores query results to avoid repeated execution, while batching combines multiple single operations into one to reduce round trips. Note: MySQL query cache is deprecated in MySQL 8.0, but application-level caching using Redis or Memcached is important. Batching also applies to bulk inserts and updates.
-- Application-level query caching (use Redis or Memcached)
-- Pseudo-code
function getTopProducts() {
if (cache.exists('top_products')) {
return cache.get('top_products');
}
result = SELECT * FROM products ORDER BY sales DESC LIMIT 10;
cache.set('top_products', result, 3600); // Cache for 1 hour
return result;
}
-- Inefficient - multiple single inserts (N round trips)
INSERT INTO logs (user_id, action) VALUES (1, 'login');
INSERT INTO logs (user_id, action) VALUES (2, 'logout');
INSERT INTO logs (user_id, action) VALUES (3, 'update');
-- Optimized - batch insert (1 round trip)
INSERT INTO logs (user_id, action) VALUES
(1, 'login'), (2, 'logout'), (3, 'update');
-- Batch update example
UPDATE products SET sales = sales + 1
WHERE id IN (SELECT product_id FROM order_items
WHERE order_id IN (1,2,3,4,5));
-- Combine multiple queries into single operation
SELECT * FROM products WHERE id IN (1,2,3) LIMIT 3;
-- Better than 3 separate queries
Why it matters: Caching and batching reduce database load and improve application response time.
Real applications: Web application performance, data pipeline optimization, API optimization.
Common mistakes: Not considering cache invalidation, overly aggressive caching causing stale data.
The N+1 query problem occurs when fetching N records requires 1 initial query plus N additional queries for related data. This is common in ORMs. Solutions include eager loading (JOIN in single query), batch loading (fetch all IDs then query related data), or architectural changes like denormalization.
-- N+1 Problem: 1 query for customers + 1 per customer for orders = 11 queries
customers = SELECT * FROM customers LIMIT 10;
foreach customer in customers:
orders = SELECT * FROM orders WHERE customer_id = customer.id;
-- Solution 1: JOIN (eager loading)
SELECT c.*, o.* FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
LIMIT 10;
-- Solution 2: Batch loading
customer_ids = SELECT id FROM customers LIMIT 10;
orders = SELECT * FROM orders WHERE customer_id IN (customer_ids);
foreach customer in customers:
customer.orders = orders.filter(order => order.customer_id == customer.id);
-- Solution 3: Subquery approach
SELECT c.*,
COALESCE(order_count.cnt, 0) as order_count
FROM customers c
LEFT JOIN (
SELECT customer_id, COUNT(*) as cnt
FROM orders
GROUP BY customer_id
) order_count ON c.id = order_count.customer_id
LIMIT 10;
Why it matters: N+1 queries can cause dramatic performance degradation as data grows.
Real applications: ORM query optimization, API endpoint optimization, dashboard performance.
Common mistakes: Not detecting N+1 patterns during development, lazy loading without awareness of impact.
DISTINCT removes duplicate rows but requires scanning and comparing all values, which is expensive. Optimize by filtering data before applying DISTINCT, using GROUP BY if only specific columns are needed, or reconsidering if distinct is necessary for correctness.
-- Inefficient - DISTINCT on large result set
SELECT DISTINCT customer_id FROM orders;
-- Optimized - GROUP BY accomplishes same goal usually faster
SELECT customer_id FROM orders GROUP BY customer_id;
-- Filter before distinct for smaller result
SELECT DISTINCT customer_id FROM orders
WHERE order_date > '2024-01-01';
-- Use indexes on columns used for deduplication
CREATE INDEX idx_order_date ON orders(order_date);
SELECT DISTINCT customer_id FROM orders
WHERE order_date > DATE_SUB(NOW(), INTERVAL 30 DAY);
-- Alternative: Use UNION instead of DISTINCT
SELECT customer_id FROM orders WHERE status = 'completed'
UNION
SELECT customer_id FROM customers WHERE customer_type = 'vip';
Why it matters: DISTINCT can be expensive; understanding alternatives improves query performance.
Real applications: Unique customer counts, category listings, deduplication in ETL processes.
Common mistakes: Using DISTINCT without filtering, not understanding GROUP BY as alternative.
Pagination with large OFFSET values causes MySQL to scan and skip many rows, which is slow. Keyset pagination (seek method) uses the last row's ID or value to fetch next page, avoiding offset overhead entirely. For real-time queries on large results, keyset is far superior to traditional LIMIT OFFSET.
-- Inefficient pagination - scans 500K rows to skip them
SELECT * FROM products ORDER BY id LIMIT 500000, 20;
-- Keyset pagination - efficient for large datasets
-- First page
SELECT * FROM products WHERE id > 0 ORDER BY id LIMIT 20;
-- Second page (assuming last page had id=500)
SELECT * FROM products WHERE id > 500 ORDER BY id LIMIT 20;
-- Multiple field keyset pagination
SELECT * FROM orders ORDER BY created_date DESC, id DESC LIMIT 20;
-- Next page (assuming last item: created_date='2024-01-01', id=1000)
SELECT * FROM orders
WHERE (created_date, id) < ('2024-01-01', 1000)
ORDER BY created_date DESC, id DESC LIMIT 20;
-- For complex queries, store cursor information
-- Cursor = encoded(last_created_date, last_id)
SELECT * FROM orders
WHERE created_date > DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND (created_date < @cursor_date OR (created_date = @cursor_date AND id < @cursor_id))
ORDER BY created_date DESC, id DESC
LIMIT 20;
Why it matters: Pagination efficiency is critical for user experience in large datasets.
Real applications: Product listings, search results, infinite scroll, API pagination.
Common mistakes: Using OFFSET for large page numbers, not understanding performance impact.
Covering indexes include all columns needed by a query, allowing MySQL to satisfy the query entirely from the index without touching the table. The 'Using index' Extra clause in EXPLAIN indicates a covering index is used. This eliminates table lookups, dramatically improving performance for frequently run queries.
-- Query without covering index - must access table
SELECT first_name, last_name, email FROM employees WHERE department = 'sales';
-- Index on department, but first_name, last_name, email are in table
-- Create covering index
CREATE INDEX idx_covering ON employees(department, first_name, last_name, email);
-- Same query now uses covering index - much faster
SELECT first_name, last_name, email FROM employees WHERE department = 'sales';
-- EXPLAIN shows 'Using index' confirming covering index
EXPLAIN SELECT first_name, last_name, email FROM employees
WHERE department = 'sales';
-- Covering index trade-offs
-- Pros: Faster queries, no table access
-- Cons: Larger index size, slower inserts/updates, more memory needed
Why it matters: Covering indexes can significantly improve performance for read-heavy workloads.
Real applications: High-traffic API endpoints, dashboard queries, frequently run reports.
Common mistakes: Creating overly large covering indexes, not monitoring index size impact.
IN operator works best with small result sets or indexed columns, while EXISTS is more efficient for correlated subqueries. IN treats subquery result as a list, while EXISTS stops at first match. For unique value queries, IN is often clearer; for existence checks, EXISTS is better.
-- IN with small list - efficient
SELECT * FROM products
WHERE category_id IN (1, 2, 3);
-- IN with subquery - depends on cardinality
SELECT * FROM orders WHERE customer_id IN
(SELECT id FROM customers WHERE country = 'USA');
-- EXISTS - efficient for existence checking
SELECT * FROM customers c
WHERE EXISTS
(SELECT 1 FROM orders o WHERE o.customer_id = c.id);
-- NOT IN vs NOT EXISTS considerations
-- NULL in NOT IN result set causes issues
SELECT * FROM employees
WHERE department_id NOT IN (10, 20, NULL); -- Returns no rows!
-- NOT EXISTS handles NULL correctly
SELECT * FROM employees e
WHERE NOT EXISTS
(SELECT 1 FROM departments d
WHERE d.id = e.department_id);
-- Use IN for large list vs EXISTS for queries
SELECT * FROM products
WHERE id IN (SELECT product_id FROM orders WHERE total > 1000);
-- vs
SELECT * FROM products p
WHERE EXISTS
(SELECT 1 FROM orders WHERE product_id = p.id AND total > 1000);
Why it matters: Choosing between IN and EXISTS affects query performance and correctness.
Real applications: Filter queries, existence checks, multi-condition searches.
Common mistakes: Using NOT IN with possible NULL values, not considering NULL handling in subqueries.