MySQL

MySQL Replication & HA Interview Questions

8 Questions

Replication copies data from master to slave(s). Master logs all changes to binary log. Slave reads and applies changes. Provides read scaling, backups, and failover.

-- Master configuration (my.cnf)
[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW

-- Create replication user on master
CREATE USER 'repl'@'slave-ip' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'slave-ip';

-- Get master status
SHOW MASTER STATUS;
-- File: mysql-bin.000001, Position: 154

-- Slave configuration (my.cnf)
[mysqld]
server-id = 2
relay-log = mysql-relay-bin

-- Configure slave replication
CHANGE MASTER TO 
    MASTER_HOST='master-ip',
    MASTER_USER='repl',
    MASTER_PASSWORD='password',
    MASTER_LOG_FILE='mysql-bin.000001',
    MASTER_LOG_POS=154;

-- Start slave
START SLAVE;
SHOW SLAVE STATUS;

Why it matters: Horizontal scaling, data redundancy, disaster recovery.

Real applications: Load balancing reads, backup slaves, geographic redundancy.

Common mistakes: Same server-id on multiple servers, not securing replication user.

ROW: Log actual row changes, safe, larger logs. STATEMENT: Log SQL, smaller, can be unsafe. MIXED: Use statement usually, row for unsafe cases.

-- Configure binlog format
SET GLOBAL binlog_format = 'ROW';  -- Recommended for replication

-- ROW format (safest)
binlog_format = ROW  -- Logs actual data changes
-- Advantages: Safe with functions, triggers, UDFs
-- Disadvantages: Larger binary log size

-- STATEMENT format (space-efficient)
binlog_format = STATEMENT  -- Logs SQL statements
-- Advantages: Smaller log size
-- Disadvantages: Unsafe with RAND(), NOW(), non-deterministic functions

-- MIXED format (compromise)
binlog_format = MIXED  -- Default to STATEMENT, use ROW when needed
-- Balanced between size and safety

-- Check current format
SHOW GLOBAL VARIABLES LIKE 'binlog_format';

-- Check which statements use ROW (when MIXED)
SELECT * FROM mysql.general_log WHERE command_type = 'Query' AND argument LIKE '%UPDATE%';

-- Disadvantages of STATEMENT
-- Non-deterministic: SELECT * FROM table LIMIT 1;  -- Different order on slaves
-- Functions: UPDATE t SET ts = NOW();  -- Different time on each slave

Why it matters: Data consistency on replicas.

Real applications: High volume replication, complex queries.

Common mistakes: Using STATEMENT with unsafe functions, large ROW logs.

Replication lag is delay between master and slave writes. Caused by slow queries, large transactions, network delays. Monitor with SHOW SLAVE STATUS.

-- Check replication lag
SHOW SLAVE STATUS\G
-- Seconds_Behind_Master: Lag in seconds
-- Relay_Master_Log_File: Which master binlog being processed
-- Exec_Master_Log_Pos: Position in master log

-- Example output
-- Seconds_Behind_Master: 5  -- 5 seconds behind

-- Monitor lag with queries
SELECT TIMESTAMPDIFF(SECOND, ts, NOW()) as replication_lag_seconds 
FROM replication_heartbeat LIMIT 1;  -- Your heartbeat table

-- High lag causes
-- 1. Slow queries on slave
-- 2. Large transactions on master
-- 3. Single-threaded replication (MySQL < 5.7)
-- 4. Network issues

-- Enable parallel replication (MySQL 5.7+)
SET GLOBAL slave_parallel_workers = 4;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';

-- Check lag in scripts
#!/bin/bash
mysql -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master" | awk '{print $2}'

Why it matters: Data freshness on slaves, failover readiness.

Real applications: Monitoring dashboards, alerting systems.

Common mistakes: Ignoring lag, not monitoring during high load.

Common issues: Duplicate key errors, replication stops, data inconsistency, lag growth. Use binlog position tracking, checksums, and monitoring.

-- Issue 1: Duplicate key error on slave
ERROR 1062: Duplicate entry
-- Solution: Skip error
STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

-- Issue 2: Replication stopped at error
SHOW SLAVE STATUS\G
-- Last_Error: error message

-- Fix: Fix data issue then sync
STOP SLAVE;
-- Fix data inconsistency manually
[Fix the data]
START SLAVE;

-- Issue 3: Verify data consistency
-- Use pt-table-checksum (Percona)
pt-table-checksum --replicate=percona.checksums h=master-ip

-- Issue 4: Large lag after restart
-- Use backup restore + incremental binlog
mysqldump --master-data > backup.sql
mysql < backup.sql

-- Issue 5: Master-slave data divergence
-- Use Percona XtraBackup for consistent snapshot
xtrabackup --backup --target-dir=/backup
[Copy to slave]
xtrabackup --prepare --target-dir=/backup

Why it matters: Maintain data consistency across replicas.

Real applications: Production troubleshooting, failover procedures.

Common mistakes: Forcing replication through errors, not fixing root cause.

Master-master replication enables writes on both servers. Useful for geographic distribution. Challenges: Conflict resolution, infinite loops, coordination.

-- Master A configuration
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
auto_increment_increment = 2
auto_increment_offset = 1

-- Master B configuration
server-id = 2
log_bin = mysql-bin
binlog_format = ROW
auto_increment_increment = 2
auto_increment_offset = 2

-- Both are masters and slaves of each other
-- Master A slave setup
CHANGE MASTER TO MASTER_HOST='master-b-ip', MASTER_USER='repl', ...;

-- Master B slave setup
CHANGE MASTER TO MASTER_HOST='master-a-ip', MASTER_USER='repl', ...;

-- AUTO_INCREMENT handling
-- Master A uses IDs: 1, 3, 5, 7, ...
-- Master B uses IDs: 2, 4, 6, 8, ...
-- Prevents duplicate key conflicts

-- Challenges
-- 1. Write conflicts if same row modified on both
-- 2. Application logic must handle eventual consistency
-- 3. Failover is complex

-- Monitoring both directions
SHOW SLAVE STATUS\G  -- Check replication to other master
SHOW MASTER STATUS;  -- Check outgoing replication

Why it matters: Active-active replication, geographic redundancy.

Real applications: Multi-region deployments, symmetric failover.

Common mistakes: Same auto_increment offsets, writing same data on both.

PXC provides synchronous replication with quorum. Group Replication (MySQL 5.7+) multi-master clustering. Both ensure data consistency at cost of write latency.

-- Percona XtraDB Cluster
-- Multi-master, synchronous replication, quorum-based
-- Configuration
pxc_strict_mode = ENFORCED

-- Join cluster
systemctl start mysql

-- Check cluster status
SHOW STATUS WHERE variable_name LIKE 'wsrep%';
-- wsrep_connected: ON (connected to cluster)
-- wsrep_cluster_size: 3 (cluster members)
-- wsrep_local_state_uuid: Cluster ID

-- MySQL Group Replication
-- Initialize group replication plugin
INSTALL PLUGIN group_replication SONAME 'group_replication.so';

-- Configure host
SET GLOBAL group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa";
SET GLOBAL group_replication_start_on_boot=ON;

-- Bootstrap primary
SET GLOBAL group_replication_bootstrap_group=ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group=OFF;

-- Join secondary
START GROUP_REPLICATION;

-- Monitor group
SELECT * FROM performance_schema.replication_group_members;

Why it matters: Synchronous consistency, easier failover.

Real applications: Critical systems requiring strong consistency.

Common mistakes: Network partitions, quorum loss, latency impact.

Backup strategies: Backup from slave (non-intrusive), use Percona tools, point-in-time recovery with binlogs, binlog server for central backup.

-- Backup from slave (non-blocking)
STOP SLAVE;
mysqldump --all-databases --single-transaction > backup.sql;
START SLAVE;  -- Lag accumulates during backup

-- Better: Use slave parallel replication, backup compressed
STOP SLAVE;
mysqldump --all-databases --single-transaction --compress | gzip > backup.sql.gz;
START SLAVE;

-- Percona XtraBackup (faster, non-blocking)
xtrabackup --backup --target-dir=/backup --parallel=4
-- Slave continues running, xtrabackup reads consistent snapshot

-- Binlog server (central backup point)
-- Slave collects all binlogs from master
-- Separate server for backup operations
[Master] --> [Slave with binlog_server] --> [Backup jobs]

-- Point-in-time recovery
SHOW BINARY LOGS;  -- List all binary logs
mysqlbinlog mysql-bin.000001 mysql-bin.000002 | mysql  -- Restore to specific point

-- Setup binlog retention
SET GLOBAL binlog_expire_logs_seconds = 604800;  -- 7 days

-- Backup strategy example
-- Day 1: Full backup
mysqldump --all-databases > /backup/full_backup_day1.sql

-- Day 2-6: Incremental (save binlogs)
-- Just copy binlog files: mysql-bin.000002, mysql-bin.000003, ...

-- Recovery
mysql < /backup/full_backup_day1.sql
mysqlbinlog /backup/mysql-bin.000002 /backup/mysql-bin.000003 | mysql

Why it matters: Fast backups without blocking production.

Real applications: Disaster recovery, compliance, point-in-time restore.

Common mistakes: Blocking backups during peak hours, not testing recovery.

Best practices: Monitoring lag, testing failover, backup strategy, multiple slaves, replication filters carefully, dedicated slave users.

-- ✅ Good: Monitor replication with alerts
SELECT @@hostname, TIMESTAMPDIFF(SECOND, NOW(), 
       DATE_ADD(NOW(), INTERVAL (SELECT TIMESTAMPDIFF(SECOND, NOW(), NOW())) SECOND)) as lag_seconds
FROM INFORMATION_SCHEMA.PROCESSLIST WHERE ID = 1;

-- ✅ Good: Multiple slaves for different purposes
-- Slave 1: Read scaling
-- Slave 2: Backups  
-- Slave 3: Analytics (allow 1-2 min lag)

-- ✅ Good: Test failover regularly
-- Run failover simulation in dev/staging
-- Document runbook
-- Train team

-- ✅ Good: Use dedicated replication user
CREATE USER 'repl'@'%.domain.com' IDENTIFIED BY 'strong_password';
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'repl'@'%.domain.com';

-- ❌ Bad: Replication filters (can cause data loss)
-- replicate-ignore-db = temp  -- Risky!
-- Breaks slave if master has data in ignored db

-- ✉ Good: Set read-only on slaves
SET GLOBAL read_only = ON;
SET GLOBAL super_read_only = ON;  -- Even super users can't write

-- Monitor replication with queries
SELECT * FROM performance_schema.replication_connection_status;
SELECT * FROM performance_schema.replication_applier_status;

Why it matters: Production reliability and data safety.

Real applications: High availability setup, disaster recovery.

Common mistakes: Ignoring lag, not testing failover, poor monitoring.