ethtool 原理介紹和解決網卡丟包排查思路

之前記錄過處理因爲 LVS 網卡流量負載過高導致軟中斷髮生丟包的問題， RPS 和 RFS 網卡多隊列性能調優實踐，對一般人來說壓力不大的情況下其實碰見的概率並不高。這次想分享的話題是比較常見服務器網卡丟包現象排查思路，如果你是想了解點對點的丟包解決思路涉及面可能就比較廣，不妨先參考之前的文章如何使用 MTR 診斷網絡問題，對於 Linux 常用的網卡丟包分析工具自然是 ethtool。

更新歷史

2020 年 06 月 22 日 - 初稿

閱讀原文 - https://wsgzao.github.io/post/ethtool/

ethtool 簡介

ethtool - utility for controlling network drivers and hardware

ethtool is the standard Linux utility for controlling network drivers and hardware, particularly for wired Ethernet devices. It can be used to:

Get identification and diagnostic information
Get extended device statistics
Control speed, duplex, autonegotiation and flow control for Ethernet devices
Control checksum offload and other hardware offload features
Control DMA ring sizes and interrupt moderation
Control receive queue selection for multiqueue devices
Upgrade firmware in flash memory

Most features are dependent on support in the specific driver. See the manual page for full information.

ethtool 用於查看和修改網絡設備（尤其是有線以太網設備）的驅動參數和硬件設置。你可以根據需要更改以太網卡的參數，包括自動協商、速度、雙工和局域網喚醒等參數。通過對以太網卡的配置，你的計算機可以通過網絡有效地進行通信。該工具提供了許多關於接駁到你的 Linux 系統的以太網設備的信息。

瞭解接收數據包的流程

這裏摘取了美團技術團隊的分析，在此表示感謝

接收數據包是一個複雜的過程，涉及很多底層的技術細節，但大致需要以下幾個步驟：

網卡收到數據包。
將數據包從網卡硬件緩存轉移到服務器內存中。
通知內核處理。
經過 TCP/IP 協議逐層處理。
應用程序通過 read() 從 socket buffer 讀取數據。

將網卡收到的數據包轉移到主機內存（NIC 與驅動交互）

NIC 在接收到數據包之後，首先需要將數據同步到內核中，這中間的橋樑是 rx ring buffer 。它是由 NIC 和驅動程序共享的一片區域，事實上， rx ring buffer 存儲的並不是實際的 packet 數據，而是一個描述符，這個描述符指向了它真正的存儲地址，具體流程如下：

sk_buffer
rx ring buffer
rx ring buffer
sk_buffer

當驅動處理速度跟不上網卡收包速度時，驅動來不及分配緩衝區，NIC 接收到的數據包無法及時寫到 sk_buffer ，就會產生堆積，當 NIC 內部緩衝區寫滿後，就會丟棄部分數據，引起丟包。這部分丟包爲 rx_fifo_errors ，在 /proc/net/dev 中體現爲 fifo 字段增長，在 ifconfig 中體現爲 overruns 指標增長。

通知系統內核處理（驅動與 Linux 內核交互）

這個時候，數據包已經被轉移到了 sk_buffer 中。前文提到，這是驅動程序在內存中分配的一片緩衝區，並且是通過 DMA 寫入的，這種方式不依賴 CPU 直接將數據寫到了內存中，意味着對內核來說，其實並不知道已經有新數據到了內存中。那麼如何讓內核知道有新數據進來了呢？答案就是中斷，通過中斷告訴內核有新數據進來了，並需要進行後續處理。

提到中斷，就涉及到硬中斷和軟中斷，首先需要簡單瞭解一下它們的區別：

硬中斷：由硬件自己生成，具有隨機性，硬中斷被 CPU 接收後，觸發執行中斷處理程序。中斷處理程序只會處理關鍵性的、短時間內可以處理完的工作，剩餘耗時較長工作，會放到中斷之後，由軟中斷來完成。硬中斷也被稱爲上半部分。
軟中斷：由硬中斷對應的中斷處理程序生成，往往是預先在代碼裏實現好的，不具有隨機性。（除此之外，也有應用程序觸發的軟中斷，與本文討論的網卡收包無關。）也被稱爲下半部分。

當 NIC 把數據包通過 DMA 複製到內核緩衝區 sk_buffer 後，NIC 立即發起一個硬件中斷。CPU 接收後，首先進入上半部分，網卡中斷對應的中斷處理程序是網卡驅動程序的一部分，之後由它發起軟中斷，進入下半部分，開始消費 sk_buffer 中的數據，交給內核協議棧處理。

通過中斷，能夠快速及時地響應網卡數據請求，但如果數據量大，那麼會產生大量中斷請求，CPU 大部分時間都忙於處理中斷，效率很低。爲了解決這個問題，現在的內核及驅動都採用一種叫 NAPI（new API）的方式進行數據處理，其原理可以簡單理解爲中斷 + 輪詢，在數據量大時，一次中斷後通過輪詢接收一定數量包再返回，避免產生多次中斷。

ifconfig 解釋

[root@localhost ~]# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.135 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::20c:29ff:fe9b:52d3 prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:9b:52:d3 txqueuelen 1000 (Ethernet)
RX packets 833 bytes 61846 (60.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 122 bytes 9028 (8.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

(1) RX errors

表示總的收包的錯誤數量，這包括 too-long-frames 錯誤，Ring Buffer 溢出錯誤，crc 校驗錯誤，幀同步錯誤，fifo overruns 以及 missed pkg 等等。

(2) RX dropped

表示數據包已經進入了 Ring Buffer，但是由於內存不夠等系統原因，導致在拷貝到內存的過程中被丟棄。

(3) RX overruns

表示了 fifo 的 overruns，這是由於 Ring Buffer(aka Driver Queue) 傳輸的 IO 大於 kernel 能夠處理的 IO 導致的，而 Ring Buffer 則是指在發起 IRQ 請求之前的那塊 buffer。很明顯，overruns 的增大意味着數據包沒到 Ring Buffer 就被網卡物理層給丟棄了，而 CPU 無法即使的處理中斷是造成 Ring Buffer 滿的原因之一，上面那臺有問題的機器就是因爲 interruprs 分佈的不均勻 (都壓在 core0)，沒有做 affinity 而造成的丟包。

(4) RX frame

表示 misaligned 的 frames。

網卡工作原理

如果上面接收數據包的流程覺得不夠詳細可以再看純文字解釋

網卡收包

網線上的 packet 首先被網卡獲取，網卡會檢查 packet 的 CRC 校驗，保證完整性，然後將 packet 頭去掉，得到 frame。網卡會檢查 MAC 包內的目的 MAC 地址，如果和本網卡的 MAC 地址不一樣則丟棄 (混雜模式除外)。

網卡將 frame 拷貝到網卡內部的 FIFO 緩衝區，觸發硬件中斷。（如有 ring buffer 的網卡，好像 frame 可以先存在 ring buffer 裏再觸發軟件中斷（下篇文章將詳細解釋 Linux 中 frame 的走向），ring buffer 是網卡和驅動程序共享，是設備裏的內存，但是對操作系統是可見的，因爲看到 linux 內核源碼裏網卡驅動程序是使用 kcalloc 來分配的空間，所以 ring buffer 一般都有上限，另外這個 ring buffer size，表示的應該是能存儲的 frame 的個數，而不是字節大小。另外有些系統的 ethtool 命令並不能改變 ring parameters 來設置 ring buffer 的大小，暫時不知道爲什麼，可能是驅動不支持。）

網卡驅動程序通過硬中斷處理函數，構建 sk_buff，把 frame 從網卡 FIFO 拷貝到內存 skb 中，接下來交給內核處理。（支持 napi 的網卡應該是直接放在 ring buffer，不觸發硬中斷，直接使用軟中斷，拷貝 ring buffer 裏的數據，直接輸送給上層處理，每個網卡在一次軟中斷處理過程能處理 weight 個 frame）

過程中，網卡芯片對 frame 進行了 MAC 過濾，以減小系統負荷。（除了混雜模式）

網卡發包

網卡驅動程序將 IP 包添加 14 字節的 MAC 頭，構成 frame（暫無 CRC）。Frame（暫無 CRC）中含有發送端和接收端的 MAC 地址，由於是驅動程序創建 MAC 頭，所以可以隨便輸入地址，也可以進行主機僞裝。

驅動程序將 frame（暫無 CRC）拷貝到網卡芯片內部的緩衝區，由網卡處理。

網卡芯片將未完全完成的 frame（缺 CRC）再次封裝爲可以發送的 packet，也就是添加頭部同步信息和 CRC 校驗，然後丟到網線上，就完成一個 IP 報的發送了，所有接到網線上的網卡都可以看到該 packet。

網卡中斷處理函數

產生中斷的每個設備都有一個相應的中斷處理程序，是設備驅動程序的一部分。每個網卡都有一箇中斷處理程序，用於通知網卡該中斷已經被接收了，以及把網卡緩衝區的數據包拷貝到內存中。

當網卡接收來自網絡的數據包時，需要通知內核數據包到了。網卡立即發出中斷。內核通過執行網卡已註冊的中斷處理函數來做出應答。中斷處理程序開始執行，通知硬件，拷貝最新的網絡數據包到內存，然後讀取網卡更多的數據包。

這些都是重要、緊迫而又與硬件相關的工作。內核通常需要快速的拷貝網絡數據包到系統內存，因爲網卡上接收網絡數據包的緩存大小固定，而且相比系統內存也要小得多。所以上述拷貝動作一旦被延遲，必然造成網卡 FIFO 緩存溢出 - 進入的數據包占滿了網卡的緩存，後續的包只能被丟棄，這也應該就是 ifconfig 裏的 overrun 的來源。

當網絡數據包被拷貝到系統內存後，中斷的任務算是完成了，這時它把控制權交還給被系統中斷前運行的程序。

緩衝區訪問

網卡的內核緩衝區，是在 PC 內存中，由內核控制，而網卡會有 FIFO 緩衝區，或者 ring buffer，這應該將兩者區分開。FIFO 比較小，裏面有數據便會盡量將數據存在內核緩衝中。

網卡中的緩衝區既不屬於內核空間，也不屬於用戶空間。它屬於硬件緩衝，允許網卡與操作系統之間有個緩衝；

內核緩衝區在內核空間，在內存中，用於內核程序，做爲讀自或寫往硬件的數據緩衝區；

用戶緩衝區在用戶空間，在內存中，用於用戶程序，做爲讀自或寫往硬件的數據緩衝區；

另外，爲了加快數據的交互，可以將內核緩衝區映射到用戶空間，這樣，內核程序和用戶程序就可以同時訪問這一區間了。

對於有 ring buffer 的網卡，ring buffer 是由驅動與網卡共享的，所以內核可以直接訪問 ring buffer，一般拷貝 frames 的副本到自己的內核空間進行處理（deliver 到上層協議，之後的一個個 skb 就是按 skb 的指針傳遞方式傳遞，直到用戶獲得數據，所以，對於 ring buffer 網卡，大量拷貝發生在 frame 從 ring buffer 傳遞到內核控制的計算機內存裏）。

丟包排查思路

網卡工作在數據鏈路層，數據量鏈路層，會做一些校驗，封裝成幀。我們可以查看校驗是否出錯，確定傳輸是否存在問題。然後從軟件層面，是否因爲緩衝區太小丟包。

先查看硬件情況

一臺機器經常收到丟包的報警，先看看最底層的有沒有問題:

(1) 查看工作模式是否正常

[root@localhost ~]# ethtool eth0 | egrep 'Speed|Duplex'
Speed: 1000Mb/s
Duplex: Full

(2) 查看檢驗是否正常

[root@localhost ~]# ethtool -S eth0 | grep crc
rx_crc_errors: 0

Speed，Duplex，CRC 之類的都沒問題，基本可以排除物理層面的干擾。

overruns 和 buffer size

# 通過 ifconfig 可以看到 overruns 是否一直增大 
for i in `seq 1 100`; do ifconfig eth2 | grep RX | grep overruns; sleep 1; done
# 這裏一直增加 
RX packets:346547657 errors:0 dropped:0 overruns:35345 frame:0

# 可以通過 ethtool 來修改網卡的 buffer size ，首先要網卡支持，我的服務器是是 INTEL 的 1000M 網卡, 我們看看 ethtool 說明 
-g   –show-ringQueries the specified ethernet device for rx/tx ring parameter information.
-G   –set-ringChanges the rx/tx ring parameters of the specified ethernet device.
# 查看當前網卡的 buffer size 情況 
ethtool -g eth0

[root@localhost ~]# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 256

# 修改 buffer size 大小 
ethtool -G eth0 rx 2048
ethtool -G eth0 tx 2048

[root@localhost ~]# ethtool -G eth0 rx 2048
[root@localhost ~]# ethtool -G eth0 tx 2048
[root@localhost ~]# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 2048
RX Mini: 0
RX Jumbo: 0
TX: 2048

Red Hat 官方解決思路

Issue

Why rx_crc_errors incrementing in the receive counter of ethtool -S output?

$ ethtool -S <Interface_name> | grep -i error
     rx_error_bytes: 0
     tx_error_bytes: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 9244
     rx_align_errors: 0

Resolution

Change the cable.
Check switch configuration.
Change the network interface card.

Root Cause

Most of the time incrementing the value of rx_crc_errors means the problem is in Layer-1 of the networking model.
When a packet is received at the interface, it goes through a data integrity check which is called cyclic redundancy check . If the packet fails in that check, it is marked as rx_crc_errors .
The switch was forcing the NIC to operate in half-duplex mode. Fixing the switch to tell the NIC to operate in full-duplex mode have resolved the issue.

Diagnostic Steps

Check ethtool -S output and find where are the drops and errors.

$ ethtool -S <Interface_name> | grep -i error
     rx_error_bytes: 0
     tx_error_bytes: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 9244  >>>>>>
     rx_align_errors: 0

Check the numbers corresponding to rx_crc_errors .

ethtool 常用命令

ethtool p1p1

Settings for p1p1:
	Supported ports: [ FIBRE ]
	Supported link modes:   10000baseT/Full
	Supported pause frame use: Symmetric
	Supports auto-negotiation: No
	Supported FEC modes: Not reported
	Advertised link modes:  10000baseT/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: No
	Advertised FEC modes: Not reported
	Speed: 10000Mb/s
	Duplex: Full
	Port: FIBRE
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: off
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000007 (7)
			       drv probe link
	Link detected: yes

顯示了 p1p1 的接口類型，連接模式，速率等等信息，以及當前是否連接了網線（如果是網線 Supported ports 就是 TP，如果是光纖則顯示 Fiber），這裏例舉下 3 個重要關鍵詞

Supported ports: [ FIBRE ]

Speed: 10000Mb/s

Link detected: yes

# -S 顯示 NIC- and driver-specific 的統計參數，如網卡接收 / 發送的字節數、接收 / 發送的廣播包個數等。 
ethtool -S p1p1 | grep -i error
     rx_errors: 0
     tx_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     rx_length_errors: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_csum_offload_errors: 0


# -p 用於區別不同 ethX 對應網卡的物理位置，常用的方法是使網卡 port 上的 led 不斷的閃 
ethtool -p <Interface_name>
ethtool -p eth0

# -i 顯示網卡驅動的信息，如驅動的名稱、版本等 
ethtool -i p1p1

driver: ixgbe
version: 5.1.0-k-rh7.6
firmware-version: 0x80000960, 18.3.6
expansion-rom-version:
bus-info: 0000:04:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

# ethtool –s ethX [speed 10|100|1000] [duplex half|full]  [autoneg on|off]
# 設置網口速率 10/100/1000M、設置網口半 / 全雙工、設置網口是否自協商 
ethtool -s eth0 speed 100