從502到排障:Nginx常見故障分析案例
作為一名運維工程師,你是否曾在深夜被502錯誤的報警電話驚醒?是否因為神秘的Nginx故障而焦頭爛額?本文將通過真實案例,帶你深入Nginx故障排查的精髓,讓你從運維小白進階為故障排查專家。
引言:那些年我們踩過的Nginx坑
在互聯網公司的運維生涯中,Nginx故障可以說是最常見也最讓人頭疼的問題之一。從簡單的配置錯誤到復雜的性能瓶頸,從偶發的502到持續的高延遲,每一個故障背后都有其獨特的原因和解決方案。
作為擁有8年運維經驗的工程師,我見證了無數次午夜故障處理,也總結出了一套行之有效的故障排查方法論。今天,我將通過10個真實案例,手把手教你如何快速定位和解決Nginx常見故障。
案例一:經典502錯誤 - 上游服務不可達
故障現象
某電商網站在促銷活動期間突然出現大量502錯誤,用戶無法正常下單,業務損失嚴重。
故障排查過程
第一步:查看Nginx錯誤日志
# 查看最新的錯誤日志 tail-f /var/log/nginx/error.log # 典型502錯誤日志 2024/09/15 1425 [error] 12345#0: *67890 connect() failed (111: Connection refused)whileconnecting to upstream, client: 192.168.1.100, server: shop.example.com, request:"POST /api/order HTTP/1.1", upstream:"http://192.168.1.200:8080/api/order", host:"shop.example.com"
第二步:檢查上游服務狀態
# 檢查后端服務是否正常運行 netstat -tulpn | grep 8080 ps aux | grep java # 測試上游服務連通性 curl -I http://192.168.1.200:8080/health telnet 192.168.1.200 8080
第三步:分析Nginx配置
upstreambackend_servers { server192.168.1.200:8080weight=1max_fails=3fail_timeout=30s; server192.168.1.201:8080weight=1max_fails=3fail_timeout=30sbackup; } server{ listen80; server_nameshop.example.com; location/api/ { proxy_passhttp://backend_servers; proxy_connect_timeout5s; proxy_read_timeout60s; proxy_send_timeout60s; } }
根因分析
通過排查發現,主服務器192.168.1.200由于負載過高導致Java應用崩潰,而備份服務器配置有誤未能及時接管流量。
解決方案
# 1. 重啟故障服務器的應用
systemctl restart tomcat
# 2. 修復備份服務器配置
# 將backup參數移除,讓兩臺服務器同時處理請求
upstream backend_servers {
server 192.168.1.200:8080 weight=1 max_fails=2 fail_timeout=10s;
server 192.168.1.201:8080 weight=1 max_fails=2 fail_timeout=10s;
}
# 3. 重載Nginx配置
nginx -t && nginx -s reload
預防措施
? 配置健康檢查機制
? 設置合理的負載均衡策略
? 建立完善的監控告警體系
案例二:SSL證書過期導致的服務中斷
故障現象
某金融網站客戶反饋無法訪問,瀏覽器顯示"您的連接不是私密連接"錯誤。
故障排查過程
檢查SSL證書狀態
# 查看證書到期時間 openssl x509 -in/etc/nginx/ssl/domain.crt -noout -dates # 使用openssl檢查在線證書 echo| openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates # 查看Nginx SSL配置 nginx -T | grep -A 10 -B 5 ssl_certificate
Nginx SSL配置示例
server{
listen443ssl http2;
server_namefinance.example.com;
ssl_certificate/etc/nginx/ssl/domain.crt;
ssl_certificate_key/etc/nginx/ssl/domain.key;
ssl_protocolsTLSv1.2TLSv1.3;
ssl_ciphersECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphersoff;
# HSTS設置
add_headerStrict-Transport-Security"max-age=31536000"always;
}
解決方案
# 1. 生成新的SSL證書(以Let's Encrypt為例) certbot --nginx -d finance.example.com # 2. 手動更新證書配置 ssl_certificate /etc/letsencrypt/live/finance.example.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/finance.example.com/privkey.pem; # 3. 測試并重載配置 nginx -t && nginx -s reload # 4. 驗證SSL證書 curl -I https://finance.example.com
自動化解決方案
# 創建證書更新腳本 cat> /etc/cron.d/certbot <'EOF' 0 12 * * * /usr/bin/certbot renew --quiet --post-hook?"nginx -s reload" EOF # 添加證書監控腳本 cat?> /usr/local/bin/ssl_check.sh <'EOF' #!/bin/bash DOMAIN="finance.example.com" DAYS=30 EXPIRY_DATE=$(echo?| openssl s_client -connect?$DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate |cut-d= -f2) EXPIRY_EPOCH=$(date-d"$EXPIRY_DATE"+%s) CURRENT_EPOCH=$(date+%s) DAYS_LEFT=$(( ($EXPIRY_EPOCH-$CURRENT_EPOCH) /86400)) if[$DAYS_LEFT-lt$DAYS];then echo"SSL certificate for$DOMAINexpires in$DAYS_LEFTdays!" # 發送告警 fi EOF
案例三:高并發下的性能瓶頸
故障現象
某視頻網站在晚高峰期間響應緩慢,部分用戶反饋視頻加載失敗。
性能分析工具
# 查看Nginx連接狀態 curl http://localhost/nginx_status # 使用htop查看系統負載 htop # 檢查網絡連接數 ss -tuln |wc-l netstat -an | grep :80 |wc-l
Nginx狀態頁配置
server{
listen80;
server_namelocalhost;
location/nginx_status {
stub_statuson;
access_logoff;
allow127.0.0.1;
denyall;
}
}
性能優化配置
# 主配置優化
worker_processesauto;
worker_connections65535;
worker_rlimit_nofile65535;
events{
useepoll;
multi_accepton;
worker_connections65535;
}
http{
# 開啟gzip壓縮
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_typestext/plain text/css application/json application/javascript;
# 緩存優化
open_file_cachemax=100000inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;
# 連接優化
keepalive_timeout65;
keepalive_requests100;
# 緩沖區優化
client_body_buffer_size128k;
client_max_body_size50m;
client_header_buffer_size1k;
large_client_header_buffers44k;
}
系統層面優化
# 優化系統參數 cat>> /etc/sysctl.conf <'EOF' # 網絡優化 net.core.somaxconn = 65535 net.core.netdev_max_backlog = 5000 net.ipv4.tcp_max_syn_backlog = 65535 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 1200 net.ipv4.tcp_max_tw_buckets = 5000 # 文件描述符優化 fs.file-max = 1000000 EOF # 應用配置 sysctl -p
案例四:緩存配置錯誤導致的問題
故障現象
某新聞網站更新內容后,用戶仍然看到舊內容,清除瀏覽器緩存后問題依然存在。
緩存配置分析
server{
listen80;
server_namenews.example.com;
# 靜態資源緩存
location~* .(jpg|jpeg|png|gif|ico|css|js)${
expires1y;
add_headerCache-Control"public, immutable";
add_headerPragma public;
}
# 動態內容
location/ {
proxy_passhttp://backend;
# 錯誤的緩存配置
proxy_cache_valid20030210m;
proxy_cache_valid4041m;
add_headerX-Cache-Status$upstream_cache_status;
}
}
問題排查
# 檢查緩存目錄 ls-la /var/cache/nginx/ # 查看緩存配置 nginx -T | grep -A 20 proxy_cache # 測試緩存狀態 curl -I http://news.example.com/article/123 | grep X-Cache-Status
正確的緩存配置
http{
# 緩存路徑配置
proxy_cache_path/var/cache/nginx levels=1:2keys_zone=my_cache:10mmax_size=10ginactive=60muse_temp_path=off;
server{
listen80;
server_namenews.example.com;
# API接口不緩存
location/api/ {
proxy_passhttp://backend;
proxy_cacheoff;
add_headerCache-Control"no-cache, no-store, must-revalidate";
}
# 新聞內容緩存
location/article/ {
proxy_passhttp://backend;
proxy_cachemy_cache;
proxy_cache_valid2005m;
proxy_cache_use_staleerrortimeout updating;
add_headerX-Cache-Status$upstream_cache_status;
}
# 靜態資源長期緩存
location~* .(jpg|jpeg|png|gif|ico)${
expires1y;
add_headerCache-Control"public, immutable";
}
location~* .(css|js)${
expires1d;
add_headerCache-Control"public";
}
}
}
緩存管理工具
# 清除特定URL緩存 curl -X PURGE http://news.example.com/article/123 # 批量清除緩存 find /var/cache/nginx -typef -name"*.cache"-mtime +7 -delete # 緩存統計腳本 cat> /usr/local/bin/cache_stats.sh <'EOF' #!/bin/bash CACHE_DIR="/var/cache/nginx" echo"Cache directory size:?$(du -sh $CACHE_DIR)" echo"Cache files count:?$(find $CACHE_DIR -type f | wc -l)" echo"Cache hit rate:?$(grep -c HIT /var/log/nginx/access.log)" EOF
案例五:日志輪轉異常導致磁盤空間耗盡
故障現象
服務器突然無法響應,檢查發現磁盤空間100%占用,主要是Nginx日志文件過大。
問題診斷
# 檢查磁盤空間 df-h # 找出大文件 du-h /var/log/nginx/ |sort-hr # 檢查日志輪轉配置 cat/etc/logrotate.d/nginx
修復和優化
# 緊急處理:截斷當前日志 > /var/log/nginx/access.log > /var/log/nginx/error.log # 重啟nginx以重新打開日志文件 nginx -s reopen
優化的日志輪轉配置
# /etc/logrotate.d/nginx
/var/log/nginx/*.log{
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 640 nginx nginx
sharedscripts
postrotate
if[ -f /var/run/nginx.pid ];then
kill-USR1 `cat/var/run/nginx.pid`
fi
endscript
}
日志配置優化
http{
# 自定義日志格式
log_formatmain'$remote_addr-$remote_user[$time_local] "$request" '
'$status$body_bytes_sent"$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_timeuct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"';
# 條件日志記錄
map$status$loggable{
~^[23] 0;
default1;
}
server{
# 只記錄錯誤請求
access_log/var/log/nginx/access.log main if=$loggable;
# 靜態資源不記錄日志
location~* .(jpg|jpeg|png|gif|ico|css|js)${
access_logoff;
expires1y;
}
}
}
監控腳本
# 磁盤空間監控
cat> /usr/local/bin/disk_monitor.sh <'EOF'
#!/bin/bash
THRESHOLD=80
USAGE=$(df?/ | awk?'NR==2 {print $5}'?| sed?'s/%//')
if?[?$USAGE?-gt?$THRESHOLD?];?then
echo"Disk usage is?${USAGE}%, exceeding threshold of?${THRESHOLD}%"
# 自動清理老日志
? ? find /var/log/nginx -name?"*.log.*"?-mtime +7 -delete
# 發送告警
fi
EOF
案例六:負載均衡配置錯誤
故障現象
某服務采用多臺后端服務器,但發現流量分配不均,部分服務器負載過高而其他服務器閑置。
負載均衡策略對比
# 輪詢(默認)
upstreambackend_round_robin {
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
# 加權輪詢
upstreambackend_weighted {
server192.168.1.10:8080weight=3;
server192.168.1.11:8080weight=2;
server192.168.1.12:8080weight=1;
}
# IP哈希
upstreambackend_ip_hash {
ip_hash;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
# 最少連接
upstreambackend_least_conn {
least_conn;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
健康檢查配置
upstreambackend_with_health {
server192.168.1.10:8080max_fails=3fail_timeout=30s;
server192.168.1.11:8080max_fails=3fail_timeout=30s;
server192.168.1.12:8080max_fails=3fail_timeout=30sbackup;
# keepalive連接池
keepalive32;
}
server{
location/ {
proxy_passhttp://backend_with_health;
# 健康檢查相關
proxy_next_upstreamerrortimeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries2;
proxy_next_upstream_timeout5s;
# 連接復用
proxy_http_version1.1;
proxy_set_headerConnection"";
}
}
監控腳本
# 后端服務器健康檢查腳本
cat> /usr/local/bin/backend_health_check.sh <'EOF'
#!/bin/bash
SERVERS=("192.168.1.10:8080""192.168.1.11:8080""192.168.1.12:8080")
for?server?in"${SERVERS[@]}";?do
if?curl -sf?"http://$server/health"?> /dev/null;then
echo"$server: OK"
else
echo"$server: FAILED"
# 發送告警
fi
done
EOF
案例七:安全配置漏洞
故障現象
網站被惡意掃描,發現存在多個安全漏洞,需要加強Nginx安全配置。
安全加固配置
server{
listen80;
server_namesecure.example.com;
# 隱藏版本信息
server_tokensoff;
more_set_headers"Server: WebServer";
# 安全頭設置
add_headerX-Frame-Options"SAMEORIGIN"always;
add_headerX-XSS-Protection"1; mode=block"always;
add_headerX-Content-Type-Options"nosniff"always;
add_headerReferrer-Policy"no-referrer-when-downgrade"always;
add_headerContent-Security-Policy"default-src 'self' http: https: data: blob: 'unsafe-inline'"always;
# 限制請求方法
if($request_method!~ ^(GET|HEAD|POST)$) {
return405;
}
# 防止目錄遍歷
location~ /.{
denyall;
access_logoff;
log_not_foundoff;
}
# 限制文件上傳大小
client_max_body_size10M;
# 限制請求頻率
limit_req_zone$binary_remote_addrzone=api:10mrate=10r/s;
limit_req_zone$binary_remote_addrzone=login:10mrate=1r/s;
location/api/ {
limit_reqzone=api burst=20nodelay;
proxy_passhttp://backend;
}
location/login {
limit_reqzone=login burst=5nodelay;
proxy_passhttp://backend;
}
}
防護腳本
# fail2ban配置示例 cat> /etc/fail2ban/filter.d/nginx-4xx.conf <'EOF' [Definition] failregex = ^-.*"(GET|POST).*"(404|403|400) .*$ ignoreregex = EOF cat> /etc/fail2ban/jail.local <'EOF' [nginx-4xx] enabled =?true port = http,https filter = nginx-4xx logpath = /var/log/nginx/access.log maxretry = 10 bantime = 3600 findtime = 60 EOF
案例八:反向代理配置問題
故障現象
使用Nginx作為反向代理時,客戶端真實IP丟失,后端服務無法獲取正確的客戶端信息。
問題分析和解決
server{
listen80;
server_nameapi.example.com;
location/ {
proxy_passhttp://backend;
# 正確傳遞客戶端IP
proxy_set_headerHost$host;
proxy_set_headerX-Real-IP$remote_addr;
proxy_set_headerX-Forwarded-For$proxy_add_x_forwarded_for;
proxy_set_headerX-Forwarded-Proto$scheme;
# 處理重定向
proxy_redirectoff;
# 超時設置
proxy_connect_timeout30s;
proxy_send_timeout30s;
proxy_read_timeout30s;
# 緩沖設置
proxy_bufferingon;
proxy_buffer_size4k;
proxy_buffers84k;
proxy_busy_buffers_size8k;
}
}
WebSocket支持
map$http_upgrade$connection_upgrade{
defaultupgrade;
'' close;
}
server{
listen80;
server_namews.example.com;
location/websocket {
proxy_passhttp://backend;
proxy_http_version1.1;
proxy_set_headerUpgrade$http_upgrade;
proxy_set_headerConnection$connection_upgrade;
proxy_set_headerHost$host;
proxy_cache_bypass$http_upgrade;
# WebSocket特殊配置
proxy_read_timeout86400;
}
}
案例九:URL重寫規則沖突
故障現象
網站URL重寫規則復雜,出現重定向循環和404錯誤。
重寫規則優化
server{
listen80;
server_nameexample.com www.example.com;
# 強制跳轉到主域名
if($host!='example.com') {
return301https://example.com$request_uri;
}
# SEO友好的URL重寫
location/ {
try_files$uri$uri/@rewrites;
}
location@rewrites{
rewrite^/product/([0-9]+)$/product.php?id=$1last;
rewrite^/category/([a-zA-Z0-9-]+)$/category.php?name=$1last;
rewrite^/user/([a-zA-Z0-9]+)$/profile.php?username=$1last;
return404;
}
# 防止重定向循環
location~ .php${
try_files$uri=404;
fastcgi_pass127.0.0.1:9000;
fastcgi_indexindex.php;
includefastcgi_params;
}
}
調試重寫規則
# 開啟重寫日志
error_log/var/log/nginx/rewrite.lognotice;
rewrite_logon;
# 測試重寫規則
location/test {
rewrite^/test/(.*)$/debug?param=$1break;
return200"Rewrite test:$args
";
}
案例十:性能監控與調優
故障現象
需要建立完善的Nginx性能監控體系,及時發現和解決性能問題。
監控腳本
# Nginx性能監控腳本
cat> /usr/local/bin/nginx_monitor.sh <'EOF'
#!/bin/bash
NGINX_STATUS_URL="http://localhost/nginx_status"
LOG_FILE="/var/log/nginx_monitor.log"
# 獲取狀態信息
STATUS=$(curl -s?$NGINX_STATUS_URL)
ACTIVE_CONN=$(echo"$STATUS"?| grep?"Active connections"?| awk?'{print $3}')
ACCEPTS=$(echo"$STATUS"?| awk?'NR==2 {print $1}')
HANDLED=$(echo"$STATUS"?| awk?'NR==2 {print $2}')
REQUESTS=$(echo"$STATUS"?| awk?'NR==2 {print $3}')
READING=$(echo"$STATUS"?| awk?'NR==3 {print $2}')
WRITING=$(echo"$STATUS"?| awk?'NR==3 {print $4}')
WAITING=$(echo"$STATUS"?| awk?'NR==3 {print $6}')
# 記錄到日志
echo"$(date): Active:$ACTIVE_CONN, Reading:$READING, Writing:$WRITING, Waiting:$WAITING"?>>$LOG_FILE
# 告警邏輯
if[$ACTIVE_CONN-gt 1000 ];then
echo"High connection count:$ACTIVE_CONN"| logger -t nginx_monitor
fi
EOF
綜合調優配置
# 終極優化配置 worker_processesauto; worker_cpu_affinityauto; worker_rlimit_nofile100000; error_log/var/log/nginx/error.logwarn; pid/var/run/nginx.pid; events{ useepoll; worker_connections10240; multi_accepton; accept_mutexoff; } http{ include/etc/nginx/mime.types; default_typeapplication/octet-stream; # 日志格式 log_formatmain'$remote_addr-$remote_user[$time_local] "$request" ' '$status$body_bytes_sent"$http_referer" ' '"$http_user_agent"$request_time$upstream_response_time'; # 性能優化 sendfileon; tcp_nopushon; tcp_nodelayon; keepalive_timeout65; keepalive_requests1000; # 壓縮優化 gzipon; gzip_varyon; gzip_min_length1000; gzip_comp_level6; gzip_typestext/plain text/css application/json application/javascript text/xml application/xml; # 緩存優化 open_file_cachemax=100000inactive=20s; open_file_cache_valid30s; open_file_cache_min_uses2; open_file_cache_errorson; # 安全優化 server_tokensoff; client_header_timeout10; client_body_timeout10; reset_timedout_connectionon; send_timeout10; # 限流配置 limit_req_zone$binary_remote_addrzone=global:10mrate=100r/s; limit_conn_zone$binary_remote_addrzone=addr:10m; include/etc/nginx/conf.d/*.conf; }
故障排查方法論總結
1. 標準化排查流程
1.收集故障信息:確認故障現象、影響范圍、發生時間
2.查看日志文件:error.log、access.log、系統日志
3.檢查配置文件:語法檢查、邏輯檢查
4.驗證網絡連通:端口狀態、連通性測試
5.分析性能指標:CPU、內存、網絡、磁盤
6.確定根本原因:深入分析,找出真正原因
7.實施解決方案:臨時修復、永久解決
8.驗證修復效果:功能測試、性能測試
9.總結經驗教訓:文檔記錄、流程優化
2. 常用排查工具
?日志分析:tail、grep、awk、sed
?網絡工具:curl、wget、telnet、netstat、ss
?性能監控:htop、iotop、iftop、nginx-status
?系統診斷:strace、lsof、tcpdump
3. 預防性措施
? 建立完善的監控告警體系
? 定期進行配置文件備份
? 實施自動化運維工具
? 制定標準化操作流程
? 定期進行故障演練
結語
Nginx故障排查是運維工程師必備的核心技能,需要扎實的理論基礎和豐富的經驗
-
nginx
+關注
關注
0文章
186瀏覽量
13110
原文標題:從502到排障:Nginx常見故障分析案例
文章出處:【微信號:magedu-Linux,微信公眾號:馬哥Linux運維】歡迎添加關注!文章轉載請注明出處。
發布評論請先 登錄
Nginx常見故障案例總結
評論