Linux
Contents
About
Talks
- Amazing talk - must watch!
Kernel Boot-Parameter
Different distributions ship different boot-parameters. Look them up via:
1 man 7 bootparam
Here are some important kernel command line parameters that should not be forgotten.
1 GRUB_CMDLINE_LINUX_DEFAULT="quiet zswap.enabled=1 cgroup.enable=memory swapaccount=1 scsi_mod.use_blk_mq=1 nomodeset"
Networking
IPv6
Source of the hint: FreeIPA Deployment Recommendations
DO NOT use ipv6.disable=1 on the kernel commandline: It disables the whole IPv6 stack and breaks Samba.
If necessary, adding ipv6.disable_ipv6=1 will keep the IPv6 stack functional but will not assign IPv6 addresses to any of your network devices. This is recommended approach for cases when you don't use IPv6 networking.
You may also disable "all" or very specific interfaces.
/etc/sysctl.d/ipv6.conf
Explicit Congestion Notification (ECN)
Please compare to
https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html
IETF RFC3168 The Addition of Explicit Congestion Notification (ECN) to IP
1 tcp_ecn - INTEGER
2
3 Control use of Explicit Congestion Notification (ECN) by TCP.
4 ECN is used only when both ends of the TCP connection indicate support for it.
5 This feature is useful in avoiding losses due to congestion
6 by allowing supporting routers to signal congestion
7 before having to drop packets.
8
9 Possible values are:
10
11 0 Disable ECN. Neither initiate nor accept ECN.
12 1 Enable ECN when requested by incoming connections and
13 also request ECN on outgoing connection attempts.
14 2 Enable ECN when requested by incoming connections
15 but do not request ECN on outgoing connections.
16
17 Default: 2
path: /proc/sys/net/ipv4/tcp_ecn
sysctl key: net.ipv4.tcp_ecn
- default: 2
- configuration:
at boottime via sysctl
at runtime via sysctl or procfs
1 tcp_ecn_fallback - BOOLEAN
2
3 If the kernel detects that ECN connection misbehaves,
4 enable fall back to non-ECN.
5 Currently, this knob implements the fallback from RFC3168, section 6.1.1.1.,
6 but we reserve that in future,
7 additional detection mechanisms could be implemented under this knob.
8 The value is not used,
9 if tcp_ecn or per route (or congestion control) ECN settings are disabled.
10
11 Default: 1 (fallback enabled)
path: /proc/sys/net/ipv4/tcp_ecn_fallback
sysctl key: net.ipv4.tcp_ecn_fallback
- default: 1
- configuration:
at boottime via sysctl
at runtime via sysctl or procfs
/etc/sysctl.d/net.conf
Enable IP forwarding
In Linux there are only a few switches for enabling global forwarding, which are not specific to an interface.
net.ipv4.ip_forward
Please compare to
https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html
kernel.org doc/html/latest /proc/sys/net/ipv4/* Variables and search for ip_forward
ip_forward - BOOLEAN 0 - disabled (default) not 0 - enabled Forward Packets between interfaces. This variable is special, its change resets all configuration parameters to their default state (RFC1122 for hosts, RFC1812 for routers)
path: /proc/sys/net/ipv4/ip_forward
sysctl key: net.ipv4.ip_forward
- default: 0
- configuration:
at boottime via sysctl
at runtime via procfs
/proc/sys/net/ip_forward
/etc/sysctl.d/net.conf
1 net.ipv4.ip_forward = 1
net.ipv4.conf.all.forwarding
kernel.org doc/html/latest IP Variables
and search for
forwarding - BOOLEAN
Enable IP forwarding on this interface. This controls whether packets received _on_ this interface can be forwarded.
net.ipv6.conf.all.forwarding
Please compare to kernel.org doc/html/latest /proc/sys/net/ipv6/* Variables
And search for
conf/all/forwarding - BOOLEAN
Enable global IPv6 forwarding between all interfaces.
IPv4 and IPv6 work differently here; e.g. netfilter must be used to control, which interfaces may forward packets and which not.
This also sets all interfaces’ Host/Router setting ‘forwarding’ to the specified value. See below for details.
This referred to as global forwarding.
SEMANTICS HAVE CHANGED FROM IPv4 TO IPv6
Don't fall prey like me to think the interface specific forwarding switches for IPv6 enable forwarding per interface selectively. They only change details in the behavior of the interface like SLAAC, acceptance of route advertisements, IsRouter flag in neighbour advertisements and honoring the redirects
https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html#proc-sys-net-ipv6-variables
And search for
forwarding - INTEGER
Configure interface-specific Host/Router behavior.
Note:
It is recommended to have the same setting on all interfaces; mixed router/host scenarios are rather uncommon.
Possible values are:
- 0 Forwarding disabled
- 1 Forwarding enabled
FALSE (0):
By default, Host behaviour is assumed. This means:
IsRouter flag is not set in Neighbour Advertisements.
- If accept_ra is TRUE (default), transmit Router Solicitations.
- If accept_ra is TRUE (default), accept Router Advertisements (and do autoconfiguration).
- If accept_redirects is TRUE (default), accept Redirects.
TRUE (1):
If local forwarding is enabled, Router behaviour is assumed. This means exactly the reverse from the above:
IsRouter flag is set in Neighbour Advertisements.
- Router Solicitations are not sent unless accept_ra is 2.
- Router Advertisements are ignored unless accept_ra is 2.
- Redirects are ignored. Default: 0 (disabled) if global forwarding is disabled (default), otherwise 1 (enabled).
/etc/sysctl.d/ip_forwarding.conf
Enable the configuration
1 sysctl -p /etc/sysctl.d/ip_forwarding.conf
Control the configuration
1 sysctl -a |grep -E 'net\.ipv[46]\.conf\.[^.]+\.?forwarding'
Virtual Memory
- You should use the newer sysfs interface, while the procfs-interface is kept for backwards compatibility.
Take a look on the info provided by the pseudo-filesystems exported by the kernel concerning the virtual memory management.
Swappiness
path: /proc/sys/vm/swappiness
sysctl key: vm.swappiness
- default: 60
- configuration:
at boottime via sysctl
at runtime via procfs
/proc/sys/vm/swappiness
/etc/sysctl.d/vm.conf
1 vm.swappiness = 5
Apply configuration via sysctl.
1 # sysctl --system
2 * Applying /etc/sysctl.d/30-baloo-inotify-limit.conf ...
3 fs.inotify.max_user_watches = 524288
4 * Applying /etc/sysctl.d/30-postgresql-shm.conf ...
5 * Applying /etc/sysctl.d/30-tracker.conf ...
6 fs.inotify.max_user_watches = 65536
7 * Applying /usr/lib/sysctl.d/50-coredump.conf ...
8 kernel.core_pattern = |/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %e
9 * Applying /etc/sysctl.d/99-sysctl.conf ...
10 * Applying /etc/sysctl.d/vm.conf ...
11 vm.swappiness = 5
12 vm.dirty_background_ratio = 8
13 vm.dirty_expire_centisecs = 3000
14 vm.dirty_ratio = 32
15 vm.dirty_writeback_centisecs = 500
16 * Applying /etc/sysctl.conf ...
Inotify
/etc/sysctl.d/inotify.conf
1 fs.inotify.max_user_watches=100000
Find and count inotified files (as root)
Memory over-commitment
https://www.kernel.org/doc/html/latest/vm/overcommit-accounting.html
sysctl key: vm.overcommit_memory
path: /proc/sys/vm/overcommit_memory
- default: 0
- configuration:
at boottime via sysctl
at runtime via procfs
- usage:
- redis demands it with
WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
- redis demands it with
/etc/sysctl.d/vm.conf
1 vm.overcommit_memory = 1
Hugepages
Hugepages are a optimization of the memory management targeting the Translation Lookaside Buffer (TLB), which is a fast buffer and limited in the CPU, that maps virtual addresses to physical addresses. Less entries in the TLB because of bigger page sizes, mean less page misses during the runtime.
Transparent hugepages
https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html
khugepaged scans the memory in intervals and defrags and collapses large areas to a hugepages.
Currently THP only works for anonymous memory mappings and tmpfs/shmem. But in the future it can expand to other filesystems. See also tmpfs with systemd
path: /sys/kernel/mm/transparent_hugepage
- default: 0
- configuration:
at boottime via sysfsutils
at runtime via sysfs
- usage:
- redis demands it with
WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
- redis demands it with
/etc/sysfs.d/transparent_hugepage.conf
1 kernel/mm/transparent_hugepage = madvise
Explicit hugepages
https://www.kernel.org/doc/html/latest/admin-guide/mm/hugetlbpage.html
path: /sys/kernel/mm/hugepages/
- default: 0
- configuration:
at boottime via sysctl
at runtime via sysfs
at runtime via procfs /proc/sys/vm/*huge*
This example configures 2048 hugepages, each 2MiB in size, which may be allocated dynamically on top to the fixed number of hugepages (0).
/etc/sysctl.d/hugepages.conf
grep -rH "" /sys/kernel/mm/hugepages
1 /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages:0
2 /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages:0
3 /sys/kernel/mm/hugepages/hugepages-2048kB/surplus_hugepages:0
4 /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages_mempolicy:0
5 /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages:0
6 /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages:2048
7 /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages:0
8 /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages:0
9 /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages:0
10 /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages_mempolicy:0
11 /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages:0
12 /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_overcommit_hugepages:0
zswap
https://www.kernel.org/doc/html/latest/admin-guide/mm/zswap.html
https://www.kernel.org/doc/html/next/admin-guide/mm/zswap.html
Zswap is a lightweight compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles for potentially reduced swap I/O. This trade-off can also result in a significant performance improvement if reads from the compressed cache are faster than reads from a swap device.
path: /sys/module/zswap/parameters
- default: N
- configuration:
at boottime via sysfsutils
at runtime via sysfs
grep -R . /sys/module/zswap/parameters
/etc/sysfs.d/zswap.conf
1 ### DEFAULTS
2 #module/zswap/parameters/same_filled_pages_enabled = Y
3 #module/zswap/parameters/enabled = N
4 #module/zswap/parameters/max_pool_percent = 20
5 #module/zswap/parameters/compressor = lzo
6 #module/zswap/parameters/non_same_filled_pages_enabled = Y
7 #module/zswap/parameters/zpool = zbud
8 #module/zswap/parameters/accept_threshold_percent = 90
9
10 module/zswap/parameters/enabled = 1
11 module/zswap/parameters/compressor = zstd
12 module/zswap/parameters/max_pool_percent = 30
zram
systemd-zram-generator
/usr/share/doc/systemd-zram-generator/zram-generator.conf.example
man zram-generator.conf
man zram-generator
grep -rH "" /sys/block/zram0
Install systemd integration
1 apt install systemd-zram-generator
Available compression algorithms
ZRAM devices may be deleted/reset using zramctl from package util-linux
1 zramctl -r /dev/zram0
Show info about ZRAM devices
Active swap devices an their metadata like priority can be summarised using
1 swapon -s
/etc/systemd/zram-generator.conf
1 # This config file enables a /dev/zram0 swap device with the following
2 # properties:
3 # * size: 50% of available RAM or 4GiB, whichever is less
4 # * compression-algorithm: kernel default
5 #
6 # This device's properties can be modified by adding options under the
7 # `[zram0]` section, or disabled by removing the section header.
8 # Additional zram devices can be created by appending new `[zramX]`
9 # sections and setting the appropriate options for each device.
10 #
11 # See /usr/share/doc/systemd-zram-generator/zram-generator.conf.example
12 # and/or zram-generator.conf(5) for a list of available options
13 [zram0]
14
15 # The size of the zram device, as a function of MemTotal, both in MB.
16 # For example, if the machine has 1 GiB, and zram-size=ram/4,
17 # then the zram device will have 256 MiB.
18 # Fractions in the range 0.1–0.5 are recommended.
19 #
20 # The default is "min(ram / 2, 4096)".
21 zram-size = max(min(ram / 2, 4096), ram / 4 - (ram / 4) % 1024)
22
23 # The compression algorithm to use for the zram device,
24 # or leave unspecified to keep the kernel default.
25 compression-algorithm = zstd
26
27 # By default, file systems and swap areas are trimmed on-the-go
28 # by setting "discard".
29 # Setting this to the empty string clears the option.
30 options =
31
32 # Write incompressible pages to this device,
33 # as there's no gain from keeping them in RAM
34 #writeback-device = /dev/zvol/tarta-zoot/swap-writeback
35
36 #swap-priority=100
37
Generators are run on a reload of the init daemon and the zram devices are initialized immeadiately, but they are not mounted as swap.
1 systemctl daemon-reload
Alter parameters of the zram device at runtime
1 systemctl restart systemd-zram-setup@zram0.service
Mount zswap-device
1 systemctl start dev-zram0.swap
Check swap devices
zram-tools
Nowadays ZRAM should be integrated in init via systemd-zram-generator
Install zram-tools and systemd integration
1 apt install zram-tools
Adjust configuration to your needs
/etc/default/zramswap
1 # Compression algorithm selection
2 # speed: lz4 > zstd > lzo
3 # compression: zstd > lzo > lz4
4 # This is not inclusive of all that is available in latest kernels
5 # See /sys/block/zram0/comp_algorithm (when zram module is loaded) to see
6 # what is currently set and available for your kernel[1]
7 # [1] https://github.com/torvalds/linux/blob/master/Documentation/blockdev/zram.txt#L86
8 #ALGO=lz4
9 ALGO=zstd
10
11 # Specifies the amount of RAM that should be used for zram
12 # based on a percentage the total amount of available memory
13 # This takes precedence and overrides SIZE below
14 PERCENT=25
15
16 # Specifies a static amount of RAM that should be used for
17 # the ZRAM devices, this is in MiB
18 #SIZE=256
19
20 # Specifies the priority for the swap devices, see swapon(2)
21 # for more details. Higher number = higher priority
22 # This should probably be higher than hdd/ssd swaps.
23 #PRIORITY=1000
24
Enable zwapswap service
CPU scaling governor
Available scaling governors
Default is ondemand
Set governor
1 echo "performance" \
2 |sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
3 echo "powersave" \
4 |sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
5 echo "ondemand" \
6 |sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
7 ### ALTERNATIVELY WITH find -exec
8 ### (redirection is a shell feature …)
9 find /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor \
10 -exec bash -c 'echo powersave > "$0"' {} \;
11 ### ALTERNATIVELY WITH find|xargs
12 find /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor \
13 |xargs -I{} -- bash -c 'echo powersave > {}'
Finetune CPU frequency with
1 # grep -H "" /sys/devices/system/cpu/cpu*/cpufreq/scaling_*_freq
2 /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:600000
3 /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:1500000
4 /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:600000
5 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:600000
6 /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:1500000
7 /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq:600000
8 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:600000
9 /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:1500000
10 /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq:600000
11 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:600000
12 /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:1500000
13 /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq:600000
Task delay accounting
https://www.kernel.org/doc/Documentation/accounting/delay-accounting.rst
iotop complains that:
CONFIG_TASK_DELAY_ACCT not enabled in kernel, cannot determine SWAPIN and IO %
So CONFIG_TASK_DELAY_ACCT is not configured?
You may add the option delayacct in /etc/default/grub to GRUB_CMDLINE_LINUX_DEFAULT.
Enable it at runtime
For persistence
IO-Scheduler
On a hypervisor the scheduler bfq seems to be reasonable.
On a VM with no disk or controller pass-through none should be used. This avoids optimizing the queues twice, which is inefficient and contra-productive. The hypervisor will optimize the io-request anyway.
Make alternative schedulers available
BLK-MQ is nowadays broadly available and enabled in distributions. Using multiple queues on multicore systems with fast storage promises some performance gains.
But when I took a look on available schedulers only "mq-deadline" and "none" were available.
This is because these scheduler are shipped as a kernel module and need to be loaded first into the kernel via modprobe.
Modules may be loaded manually:
Modules may also be loaded automatically at boot-time via /etc/modules.
Set IO-Scheduler permanently
kernel-cmdline
Method seems not to be working any longer.
- Service affecting.
/etc/default/grub
1 GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=$SCHEDULER"
Refresh grub config and reboot.
udev-rule
Works at run- and at boot-time!
- More selective because disks may be filtered with a regex.
/etc/udev/rules.d/60-persistent-storage-scheduler.rules
Reload udev-rules
Reload will probably happen automatically but the "trigger" is necessary.
1 udevadm control --reload-rules && udevadm trigger
Drop FS Cache
1 echo 3 | tee /proc/sys/vm/drop_caches
Hardening
Disable TCP Timestamping
1 hping3 -S -p 22 --tcp-timestamp $DESTINATION
2
3 1 root@libertas /home/tobias/Downloads # hping3 -S -p 22 --tcp-timestamp www.rockstable.it
4 HPING www.rockstable.it (bridge 178.63.149.226): S set, 40 headers + 0 data bytes
5 len=56 ip=178.63.149.226 ttl=53 DF id=0 sport=22 flags=SA seq=0 win=65160 rtt=24.2 ms
6 TCP timestamp: tcpts=2031225761
7
8 len=56 ip=178.63.149.226 ttl=53 DF id=0 sport=22 flags=SA seq=1 win=65160 rtt=19.8 ms
9 TCP timestamp: tcpts=2031226761
10 HZ seems hz=1000
11 System uptime seems: 23 days, 12 hours, 13 minutes, 46 seconds
Disable temporarily
1 echo 0 > /proc/sys/net/ipv4/tcp_timestamps
Disable persitent
{{{/etc/sysctl.d/tcp.conf
1 net.ipv4.tcp_timestamps=0