summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-07-29scsi: ibmvfc: Fix command state accounting and stale response detectionTyrel Datwyler
Prior to commit 1f4a4a19508d ("scsi: ibmvfc: Complete commands outside the host/queue lock") responses to commands were completed sequentially with the host lock held such that a command had a basic binary state of active or free. It was therefore a simple affair of ensuring the assocaiated ibmvfc_event to a VIOS response was valid by testing that it was not already free. The lock relexation work to complete commands outside the lock inadverdently made it a trinary command state such that a command is either in flight, received and being completed, or completed and now free. This breaks the stale command detection logic as a command may be still marked active and been placed on the delayed completion list when a second stale response for the same command arrives. This can lead to double completions and list corruption. This issue was exposed by a recent VIOS regression were a missing memory barrier could occasionally result in the ibmvfc client receiving a duplicate response for the same command. Fix the issue by introducing the atomic ibmvfc_event.active to track the trinary state of a command. The state is explicitly set to 1 when a command is successfully sent. The CRQ response handlers use atomic_dec_if_positive() to test for stale responses and correctly transition to the completion state when a active command is received. Finally, atomic_dec_and_test() is used to sanity check transistions when commands are freed as a result of a completion, or moved to the purge list as a result of error handling or adapter reset. Link: https://lore.kernel.org/r/20210716205220.1101150-1-tyreld@linux.ibm.com Fixes: 1f4a4a19508d ("scsi: ibmvfc: Complete commands outside the host/queue lock") Cc: stable@vger.kernel.org Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-29scsi: core: Avoid printing an error if target_alloc() returns -ENXIOSreekanth Reddy
Avoid printing a 'target allocation failed' error if the driver target_alloc() callback function returns -ENXIO. This return value indicates that the corresponding H:C:T:L entry is empty. Removing this error reduces the scan time if the user issues SCAN_WILD_CARD scan operation through sysfs parameter on a host with a lot of empty H:C:T:L entries. Avoiding the printk on -ENXIO matches the behavior of the other callback functions during scanning. Link: https://lore.kernel.org/r/20210726115402.1936-1-sreekanth.reddy@broadcom.com Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-29scsi: scsi_dh_rdac: Avoid crash during rdac_bus_attach()Ye Bin
The following BUG_ON() was observed during RDAC scan: [595952.944297] kernel BUG at drivers/scsi/device_handler/scsi_dh_rdac.c:427! [595952.951143] Internal error: Oops - BUG: 0 [#1] SMP ...... [595953.251065] Call trace: [595953.259054] check_ownership+0xb0/0x118 [595953.269794] rdac_bus_attach+0x1f0/0x4b0 [595953.273787] scsi_dh_handler_attach+0x3c/0xe8 [595953.278211] scsi_dh_add_device+0xc4/0xe8 [595953.282291] scsi_sysfs_add_sdev+0x8c/0x2a8 [595953.286544] scsi_probe_and_add_lun+0x9fc/0xd00 [595953.291142] __scsi_scan_target+0x598/0x630 [595953.295395] scsi_scan_target+0x120/0x130 [595953.299481] fc_user_scan+0x1a0/0x1c0 [scsi_transport_fc] [595953.304944] store_scan+0xb0/0x108 [595953.308420] dev_attr_store+0x44/0x60 [595953.312160] sysfs_kf_write+0x58/0x80 [595953.315893] kernfs_fop_write+0xe8/0x1f0 [595953.319888] __vfs_write+0x60/0x190 [595953.323448] vfs_write+0xac/0x1c0 [595953.326836] ksys_write+0x74/0xf0 [595953.330221] __arm64_sys_write+0x24/0x30 Code is in check_ownership: list_for_each_entry_rcu(tmp, &h->ctlr->dh_list, node) { /* h->sdev should always be valid */ BUG_ON(!tmp->sdev); tmp->sdev->access_state = access_state; } rdac_bus_attach initialize_controller list_add_rcu(&h->node, &h->ctlr->dh_list); h->sdev = sdev; rdac_bus_detach list_del_rcu(&h->node); h->sdev = NULL; Fix the race between rdac_bus_attach() and rdac_bus_detach() where h->sdev is NULL when processing the RDAC attach. Link: https://lore.kernel.org/r/20210113063103.2698953-1-yebin10@huawei.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-29Merge branch 'libbpf: rename btf__get_from_id() and btf__load() APIs, ↵Andrii Nakryiko
support split BTF' Quentin Monnet says: ==================== As part of the effort to move towards a v1.0 for libbpf [0], this set improves some confusing function names related to BTF loading from and to the kernel: - btf__load() becomes btf__load_into_kernel(). - btf__get_from_id becomes btf__load_from_kernel_by_id(). - A new version btf__load_from_kernel_by_id_split() extends the former to add support for split BTF. The last patch is a trivial change to bpftool to add support for dumping split BTF objects by referencing them by their id (and not only by their BTF path). [0] https://github.com/libbpf/libbpf/wiki/Libbpf:-the-road-to-v1.0#btfh-apis v3: - Use libbpf_err_ptr() in btf__load_from_kernel_by_id(), ERR_PTR() in bpftool's get_map_kv_btf(). - Move the definition of btf__load_from_kernel_by_id() closer to the btf__parse() group in btf.h (move the legacy function with it). - Fix a bug on the return value in libbpf_find_prog_btf_id(), as a new patch. - Move the btf__free() fixes to their own patch. - Add "Fixes:" tags to relevant patches. v2: - Remove deprecation marking of legacy functions (patch 4/6 from v1). - Make btf__load_from_kernel_by_id{,_split}() return the btf struct, adjust surrounding code and call btf__free() when missing. - Add new functions to v0.5.0 API (and not v0.6.0). ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2021-07-29tools: bpftool: Support dumping split BTF by idQuentin Monnet
Split BTF objects are typically BTF objects for kernel modules, which are incrementally built on top of kernel BTF instead of redefining all kernel symbols they need. We can use bpftool with its -B command-line option to dump split BTF objects. It works well when the handle provided for the BTF object to dump is a "path" to the BTF object, typically under /sys/kernel/btf, because bpftool internally calls btf__parse_split() which can take a "base_btf" pointer and resolve the BTF reconstruction (although in that case, the "-B" option is unnecessary because bpftool performs autodetection). However, it did not work so far when passing the BTF object through its id, because bpftool would call btf__get_from_id() which did not provide a way to pass a "base_btf" pointer. In other words, the following works: # bpftool btf dump file /sys/kernel/btf/i2c_smbus -B /sys/kernel/btf/vmlinux But this was not possible: # bpftool btf dump id 6 -B /sys/kernel/btf/vmlinux The libbpf API has recently changed, and btf__get_from_id() has been deprecated in favour of btf__load_from_kernel_by_id() and its version with support for split BTF, btf__load_from_kernel_by_id_split(). Let's update bpftool to make it able to dump the BTF object in the second case as well. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210729162028.29512-9-quentin@isovalent.com
2021-07-29libbpf: Add split BTF support for btf__load_from_kernel_by_id()Quentin Monnet
Add a new API function btf__load_from_kernel_by_id_split(), which takes a pointer to a base BTF object in order to support split BTF objects when retrieving BTF information from the kernel. Reference: https://github.com/libbpf/libbpf/issues/314 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210729162028.29512-8-quentin@isovalent.com
2021-07-29tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()Quentin Monnet
Replace the calls to function btf__get_from_id(), which we plan to deprecate before the library reaches v1.0, with calls to btf__load_from_kernel_by_id() in tools/ (bpftool, perf, selftests). Update the surrounding code accordingly (instead of passing a pointer to the btf struct, get it as a return value from the function). Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210729162028.29512-6-quentin@isovalent.com
2021-07-29tools: Free BTF objects at various locationsQuentin Monnet
Make sure to call btf__free() (and not simply free(), which does not free all pointers stored in the struct) on pointers to struct btf objects retrieved at various locations. These were found while updating the calls to btf__get_from_id(). Fixes: 999d82cbc044 ("tools/bpf: enhance test_btf file testing to test func info") Fixes: 254471e57a86 ("tools/bpf: bpftool: add support for func types") Fixes: 7b612e291a5a ("perf tools: Synthesize PERF_RECORD_* for loaded BPF programs") Fixes: d56354dc4909 ("perf tools: Save bpf_prog_info and BTF of new BPF programs") Fixes: 47c09d6a9f67 ("bpftool: Introduce "prog profile" command") Fixes: fa853c4b839e ("perf stat: Enable counting events for BPF programs") Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210729162028.29512-5-quentin@isovalent.com
2021-07-29libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()Quentin Monnet
Rename function btf__get_from_id() as btf__load_from_kernel_by_id() to better indicate what the function does. Change the new function so that, instead of requiring a pointer to the pointer to update and returning with an error code, it takes a single argument (the id of the BTF object) and returns the corresponding pointer. This is more in line with the existing constructors. The other tools calling the (soon-to-be) deprecated btf__get_from_id() function will be updated in a future commit. References: - https://github.com/libbpf/libbpf/issues/278 - https://github.com/libbpf/libbpf/wiki/Libbpf:-the-road-to-v1.0#btfh-apis Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210729162028.29512-4-quentin@isovalent.com
2021-07-29libbpf: Rename btf__load() as btf__load_into_kernel()Quentin Monnet
As part of the effort to move towards a v1.0 for libbpf, rename btf__load() function, used to "upload" BTF information into the kernel, as btf__load_into_kernel(). This new name better reflects what the function does. References: - https://github.com/libbpf/libbpf/issues/278 - https://github.com/libbpf/libbpf/wiki/Libbpf:-the-road-to-v1.0#btfh-apis Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210729162028.29512-3-quentin@isovalent.com
2021-07-29libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()Quentin Monnet
Variable "err" is initialised to -EINVAL so that this error code is returned when something goes wrong in libbpf_find_prog_btf_id(). However, a recent change in the function made use of the variable in such a way that it is set to 0 if retrieving linear information on the program is successful, and this 0 value remains if we error out on failures at later stages. Let's fix this by setting err to -EINVAL later in the function. Fixes: e9fc3ce99b34 ("libbpf: Streamline error reporting for high-level APIs") Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210729162028.29512-2-quentin@isovalent.com
2021-07-29bpf: Emit better log message if bpf_iter ctx arg btf_id == 0Yonghong Song
To avoid kernel build failure due to some missing .BTF-ids referenced functions/types, the patch ([1]) tries to fill btf_id 0 for these types. In bpf verifier, for percpu variable and helper returning btf_id cases, verifier already emitted proper warning with something like verbose(env, "Helper has invalid btf_id in R%d\n", regno); verbose(env, "invalid return type %d of func %s#%d\n", fn->ret_type, func_id_name(func_id), func_id); But this is not the case for bpf_iter context arguments. I hacked resolve_btfids to encode btf_id 0 for struct task_struct. With `./test_progs -n 7/5`, I got, 0: (79) r2 = *(u64 *)(r1 +0) func 'bpf_iter_task' arg0 has btf_id 29739 type STRUCT 'bpf_iter_meta' ; struct seq_file *seq = ctx->meta->seq; 1: (79) r6 = *(u64 *)(r2 +0) ; struct task_struct *task = ctx->task; 2: (79) r7 = *(u64 *)(r1 +8) ; if (task == (void *)0) { 3: (55) if r7 != 0x0 goto pc+11 ... ; BPF_SEQ_PRINTF(seq, "%8d %8d\n", task->tgid, task->pid); 26: (61) r1 = *(u32 *)(r7 +1372) Type '(anon)' is not a struct Basically, verifier will return btf_id 0 for task_struct. Later on, when the code tries to access task->tgid, the verifier correctly complains the type is '(anon)' and it is not a struct. Users still need to backtrace to find out what is going on. Let us catch the invalid btf_id 0 earlier and provide better message indicating btf_id is wrong. The new error message looks like below: R1 type=ctx expected=fp ; struct seq_file *seq = ctx->meta->seq; 0: (79) r2 = *(u64 *)(r1 +0) func 'bpf_iter_task' arg0 has btf_id 29739 type STRUCT 'bpf_iter_meta' ; struct seq_file *seq = ctx->meta->seq; 1: (79) r6 = *(u64 *)(r2 +0) ; struct task_struct *task = ctx->task; 2: (79) r7 = *(u64 *)(r1 +8) invalid btf_id for context argument offset 8 invalid bpf_context access off=8 size=8 [1] https://lore.kernel.org/bpf/20210727132532.2473636-1-hengqi.chen@gmail.com/ Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210728183025.1461750-1-yhs@fb.com
2021-07-29tools/resolve_btfids: Emit warnings and patch zero id for missing symbolsHengqi Chen
Kernel functions referenced by .BTF_ids may be changed from global to static and get inlined or get renamed/removed, and thus disappears from BTF. This causes kernel build failure when resolve_btfids do id patch for symbols in .BTF_ids in vmlinux. Update resolve_btfids to emit warning messages and patch zero id for missing symbols instead of aborting kernel build process. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210727132532.2473636-2-hengqi.chen@gmail.com
2021-07-29bcm63xx_enet: delete a redundant assignmentTang Bin
In the function bcm_enetsw_probe(), 'ret' will be assigned by bcm_enet_change_mtu(), so 'ret = 0' make no sense. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: Tang Bin <tangbin@cmss.chinamobile.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29net: dsa: don't set skb->offload_fwd_mark when not offloading the bridgeVladimir Oltean
DSA has gained the recent ability to deal gracefully with upper interfaces it cannot offload, such as the bridge, bonding or team drivers. When such uppers exist, the ports are still in standalone mode as far as the hardware is concerned. But when we deliver packets to the software bridge in order for that to do the forwarding, there is an unpleasant surprise in that the bridge will refuse to forward them. This is because we unconditionally set skb->offload_fwd_mark = true, meaning that the bridge thinks the frames were already forwarded in hardware by us. Since dp->bridge_dev is populated only when there is hardware offload for it, but not in the software fallback case, let's introduce a new helper that can be called from the tagger data path which sets the skb->offload_fwd_mark accordingly to zero when there is no hardware offload for bridging. This lets the bridge forward packets back to other interfaces of our switch, if needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29ipvlan: Add handling of NETDEV_UP eventsDi Zhu
When an ipvlan device is created on a bond device, the link state of the ipvlan device may be abnormal. This is because bonding device allows to add physical network card device in the down state and so NETDEV_CHANGE event will not be notified to other listeners, so ipvlan has no chance to update its link status. The following steps can cause such problems: 1) bond0 is down 2) ip link add link bond0 name ipvlan type ipvlan mode l2 3) echo +enp2s7 >/sys/class/net/bond0/bonding/slaves 4) ip link set bond0 up After these steps, use ip link command, we found ipvlan has NO-CARRIER: ipvlan@bond0: <NO-CARRIER, BROADCAST,MULTICAST,UP,M-DOWN> mtu ...> We can deal with this problem like VLAN: Add handling of NETDEV_UP events. If we receive NETDEV_UP event, we will update the link status of the ipvlan. Signed-off-by: Di Zhu <zhudi21@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29net/sched: store the last executed chain also for clsact egressDavide Caratti
currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain id' in the skb extension. However, userspace programs (like ovs) are able to setup egress rules, and datapath gets confused in case it doesn't find the 'chain id' for a packet that's "recirculated" by tc. Change tcf_classify() to have the same semantic as tcf_classify_ingress() so that a single function can be called in ingress / egress, using the tc ingress / egress block respectively. Suggested-by: Alaa Hleilel <alaa@nvidia.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29Merge branch 'dpaa2-switch-add-mirroring-support'David S. Miller
Ioana Ciornei says: ==================== dpaa2-switch: add mirroring support This patch set adds per port and per VLAN mirroring in dpaa2-switch. The first 4 patches are just cosmetic changes. We renamed the dpaa2_switch_acl_tbl structure into dpaa2_switch_filter_block so that we can reuse it for filters that do not use the ACL table and reorganized the addition of trap, redirect and drop filters into a separate function. All this just to make for a more streamlined addition of the support for mirroring. The next 4 patches are actually adding the advertised support. Mirroring rules can be added in shared blocks, the driver will replicate the same configuration on all the switch ports part of the same block. The last patch documents the feature, presents its behavior and limitations and gives a couple of examples. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29docs: networking: dpaa2: document mirroring support on the switchIoana Ciornei
Document the mirroring capabilities of the dpaa2-switch driver, any restrictions that are imposed and some example commands. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29dpaa2-switch: offload shared block mirror filters when binding to a portIoana Ciornei
When mirroring rules are added in shared filter blocks, the same mirroring rule has to be configured on all the switch ports that are part of the same block. In case a switch port joins a shared block after mirroring filters have been already added to it, then all the mirror rules should be offloaded to the port. The reverse, removal of mirroring rules, has to be done at block unbind. For this purpose, the dpaa2_switch_block_offload_mirror() and dpaa2_switch_block_unoffload_mirror() functions are added and called upon binding and unbinding a switch port to/from a block. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29dpaa2-switch: add VLAN based mirroringIoana Ciornei
Using the infrastructure added in the previous patch, extend tc-flower support with FLOW_ACTION_MIRRED based on VLAN. Tested with: tc qdisc add dev eth8 ingress_block 1 clsact tc filter add block 1 ingress protocol 802.1q flower skip_sw \ vlan_id 100 action mirred egress mirror dev eth6 tc filter del block 1 ingress pref 49152 Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29dpaa2-switch: add support for port mirroringIoana Ciornei
Add support for per port mirroring for the DPAA2 switch. We support only single mirror port, therefore we allow mirroring rules only as long as the destination port is always the same. Unlike all the actions (drop, redirect, trap) already supported by the dpaa2-switch driver, adding mirroring filters in shared blocks is not achieved by a singular ACL entry added in a table shared by the ports. This is why, when a new mirror filter is added in a block we have to got through all the switch ports sharing it and configure the filter individually on all. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29dpaa2-switch: add API for setting up mirroringIoana Ciornei
Add the necessary MC API for setting up and configuring the mirroring feature on the DPSW DPAA2 object. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29dpaa2-switch: reorganize dpaa2_switch_cls_matchall_replaceIoana Ciornei
Extract the necessary steps to offload a filter by using the ACL table in a separate function - dpaa2_switch_cls_matchall_replace_acl(). This is intended to help with the code readability when the mirroring support is added. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29dpaa2-switch: reorganize dpaa2_switch_cls_flower_replaceIoana Ciornei
Extract the necessary steps to offload a filter by using the ACL table in a separate function - dpaa2_switch_cls_flower_replace_acl(). This is intended to help with the code readability when the mirroring support is added. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29dpaa2-switch: rename dpaa2_switch_acl_tbl into filter_blockIoana Ciornei
Until now, shared filter blocks were implemented only by ACL tables shared between ports. Going forward, when the mirroring support will be added, this will not be true anymore. Rename the dpaa2_switch_acl_tbl into dpaa2_switch_filter_block so that we make it clear that the structure is used not only for filters that use the ACL table but will be used for all the filters that are added in a block. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29dpaa2-switch: rename dpaa2_switch_tc_parse_action to specify the ACLIoana Ciornei
Until now, the dpaa2_switch_tc_parse_action() function was used for all the supported tc actions since all of them were implemented by adding ACL table entries. In the next commits, the dpaa2-switch driver will gain mirroring support which is not using the same HW feature. Make sure that we specify the ACL in the function name so that we make it clear that it's only used for specific actions. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29drm/kmb: Define driver date and major/minor versionEdmund Dea
Added macros for date and version Fixes: 7f7b96a8a0a1 ("drm/kmb: Add support for KeemBay Display") Signed-off-by: Edmund Dea <edmund.j.dea@intel.com> Signed-off-by: Anitha Chrisanthus <anitha.chrisanthus@intel.com> Acked-by: Sam Ravnborg <sam@ravnborg.org> Link: https://patchwork.freedesktop.org/patch/msgid/20210728003126.1425028-2-anitha.chrisanthus@intel.com
2021-07-29drm/kmb: Enable LCD DMA for low TVDDCVEdmund Dea
There's an undocumented dependency between LCD layer enable bits [2-5] and the AXI pipelined read enable bit [28] in the LCD_CONTROL register. The proper order of operation is: 1) Clear AXI pipelined read enable bit 2) Set LCD layers 3) Set AXI pipelined read enable bit With this update, LCD can start DMA when TVDDCV is reduced down to 700mV. Fixes: 7f7b96a8a0a1 ("drm/kmb: Add support for KeemBay Display") Signed-off-by: Edmund Dea <edmund.j.dea@intel.com> Signed-off-by: Anitha Chrisanthus <anitha.chrisanthus@intel.com> Acked-by: Sam Ravnborg <sam@ravnborg.org> Link: https://patchwork.freedesktop.org/patch/msgid/20210728003126.1425028-1-anitha.chrisanthus@intel.com
2021-07-29scsi: fas216: Fix fall-through warning for ClangGustavo A. R. Silva
Fix the following fallthrough warning (on ARM): drivers/scsi/arm/fas216.c:1379:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] default: ^ drivers/scsi/arm/fas216.c:1379:2: note: insert 'break;' to avoid fall-through default: ^ break; Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/202107260355.bF00i5bi-lkp@intel.com/ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-07-29scsi: acornscsi: Fix fall-through warning for clangGustavo A. R. Silva
Fix the following fallthrough warning (on ARM): drivers/scsi/arm/acornscsi.c:2651:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] case res_success: ^ drivers/scsi/arm/acornscsi.c:2651:2: note: insert '__attribute__((fallthrough));' to silence this warning case res_success: ^ __attribute__((fallthrough)); drivers/scsi/arm/acornscsi.c:2651:2: note: insert 'break;' to avoid fall-through case res_success: ^ break; Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/202107260355.bF00i5bi-lkp@intel.com/ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-07-29ASoC: cs42l42: Fix bclk calculation for monoRichard Fitzgerald
An I2S frame always has a left and right channel slot even if mono data is being sent. So if channels==1 the actual bitclock frequency is 2 * snd_soc_params_to_bclk(params). Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Fixes: 2cdba9b045c7 ("ASoC: cs42l42: Use bclk from hw_params if set_sysclk was not called") Link: https://lore.kernel.org/r/20210729170929.6589-3-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-07-29ASoC: cs42l42: Don't allow SND_SOC_DAIFMT_LEFT_JRichard Fitzgerald
The driver has no support for left-justified protocol so it should not have been allowing this to be passed to cs42l42_set_dai_fmt(). Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Fixes: 2c394ca79604 ("ASoC: Add support for CS42L42 codec") Link: https://lore.kernel.org/r/20210729170929.6589-2-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-07-29ASoC: cs42l42: Correct definition of ADC Volume controlRichard Fitzgerald
The ADC volume is a signed 8-bit number with range -97 to +12, with -97 being mute. Use a SOC_SINGLE_S8_TLV() to define this and fix the DECLARE_TLV_DB_SCALE() to have the correct start and mute flag. Fixes: 2c394ca79604 ("ASoC: Add support for CS42L42 codec") Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://lore.kernel.org/r/20210729170929.6589-1-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-07-29ARM: riscpc: Fix fall-through warning for ClangGustavo A. R. Silva
Fix the following fallthrough warning: arch/arm/mach-rpc/riscpc.c:52:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] default: ^ arch/arm/mach-rpc/riscpc.c:52:2: note: insert 'break;' to avoid fall-through default: ^ break; Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/202107260355.bF00i5bi-lkp@intel.com/ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-07-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix MTE shared page detection - Enable selftest's use of PMU registers when asked to s390: - restore 5.13 debugfs names x86: - fix sizes for vcpu-id indexed arrays - fixes for AMD virtualized LAPIC (AVIC) - other small bugfixes Generic: - access tracking performance test - dirty_log_perf_test command line parsing fix - Fix selftest use of obsolete pthread_yield() in favour of sched_yield() - use cpu_relax when halt polling - fixed missing KVM_CLEAR_DIRTY_LOG compat ioctl" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: add missing compat KVM_CLEAR_DIRTY_LOG KVM: use cpu_relax when halt polling KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl KVM: SVM: tweak warning about enabled AVIC on nested entry KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated KVM: s390: restore old debugfs names KVM: SVM: delay svm_vcpu_init_msrpm after svm->vmcb is initialized KVM: selftests: Introduce access_tracking_perf_test KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing x86/kvm: fix vcpu-id indexed array sizes KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access docs: virt: kvm: api.rst: replace some characters KVM: Documentation: Fix KVM_CAP_ENFORCE_PV_FEATURE_CPUID name KVM: nSVM: Swap the parameter order for svm_copy_vmrun_state()/svm_copy_vmloadsave_state() KVM: nSVM: Rename nested_svm_vmloadsave() to svm_copy_vmloadsave_state() KVM: arm64: selftests: get-reg-list: actually enable pmu regs in pmu sublist KVM: selftests: change pthread_yield to sched_yield KVM: arm64: Fix detection of shared VMAs on guest fault
2021-07-29Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu Pull m68knommu fix from Greg Ungerer: "A single compile time fix" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: m68k/coldfire: change pll var. to clk_pll
2021-07-29xfs: prevent spoofing of rtbitmap blocks when recovering buffersDarrick J. Wong
While reviewing the buffer item recovery code, the thought occurred to me: in V5 filesystems we use log sequence number (LSN) tracking to avoid replaying older metadata updates against newer log items. However, we use the magic number of the ondisk buffer to find the LSN of the ondisk metadata, which means that if an attacker can control the layout of the realtime device precisely enough that the start of an rt bitmap block matches the magic and UUID of some other kind of block, they can control the purported LSN of that spoofed block and thereby break log replay. Since realtime bitmap and summary blocks don't have headers at all, we have no way to tell if a block really should be replayed. The best we can do is replay unconditionally and hope for the best. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-07-29xfs: limit iclog tail updatesDave Chinner
From the department of "generic/482 keeps on giving", we bring you another tail update race condition: iclog: S1 C1 +-----------------------+-----------------------+ S2 EOIC Two checkpoints in a single iclog. One is complete, the other just contains the start record and overruns into a new iclog. Timeline: Before S1: Cache flush, log tail = X At S1: Metadata stable, write start record and checkpoint At C1: Write commit record, set NEED_FUA Single iclog checkpoint, so no need for NEED_FLUSH Log tail still = X, so no need for NEED_FLUSH After C1, Before S2: Cache flush, log tail = X At S2: Metadata stable, write start record and checkpoint After S2: Log tail moves to X+1 At EOIC: End of iclog, more journal data to write Releases iclog Not a commit iclog, so no need for NEED_FLUSH Writes log tail X+1 into iclog. At this point, the iclog has tail X+1 and NEED_FUA set. There has been no cache flush for the metadata between X and X+1, and the iclog writes the new tail permanently to the log. THis is sufficient to violate on disk metadata/journal ordering. We have two options here. The first is to detect this case in some manner and ensure that the partial checkpoint write sets NEED_FLUSH when the iclog is already marked NEED_FUA and the log tail changes. This seems somewhat fragile and quite complex to get right, and it doesn't actually make it obvious what underlying problem it is actually addressing from reading the code. The second option seems much cleaner to me, because it is derived directly from the requirements of the C1 commit record in the iclog. That is, when we write this commit record to the iclog, we've guaranteed that the metadata/data ordering is correct for tail update purposes. Hence if we only write the log tail into the iclog for the *first* commit record rather than the log tail at the last release, we guarantee that the log tail does not move past where the the first commit record in the log expects it to be. IOWs, taking the first option means that replay of C1 becomes dependent on future operations doing the right thing, not just the C1 checkpoint itself doing the right thing. This makes log recovery almost impossible to reason about because now we have to take into account what might or might not have happened in the future when looking at checkpoints in the log rather than just having to reconstruct the past... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: need to see iclog flags in tracingDave Chinner
Because I cannot tell if the NEED_FLUSH flag is being set correctly by the log force and CIL push machinery without it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: Enforce attr3 buffer recovery orderDave Chinner
From the department of "WTAF? How did we miss that!?"... When we are recovering a buffer, the first thing we do is check the buffer magic number and extract the LSN from the buffer. If the LSN is older than the current LSN, we replay the modification to it. If the metadata on disk is newer than the transaction in the log, we skip it. This is a fundamental v5 filesystem metadata recovery behaviour. generic/482 failed with an attribute writeback failure during log recovery. The write verifier caught the corruption before it got written to disk, and the attr buffer dump looked like: XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8 XFS (dm-3): Unmount and run xfs_repair XFS (dm-3): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1 ........;...M*.. 00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38 ...............8 ^^^^^^^^^^^^^^^^^^^^^^^ 00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17 .9^QX.D.....D... 00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00 .............$.. 00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00 .h.............. 00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00 ..<1.$....<2.... 00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00 ..<3............ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ ..... The highlighted bytes are the LSN that was replayed into the buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay, that block on disk looks like this: $ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol hdr.info.hdr.forw = 0 hdr.info.hdr.back = 0 hdr.info.hdr.magic = 0x3bee hdr.info.crc = 0xb5af0bc6 (correct) hdr.info.bno = 105448 hdr.info.lsn = 0x100000900 ^^^^^^^^^^^ hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17 hdr.info.owner = 131203 hdr.count = 2 hdr.usedbytes = 120 hdr.firstused = 3796 hdr.holes = 1 hdr.freemap[0-2] = [base,size] Note the LSN stamped into the buffer on disk: 1/0x900. The version on disk is much newer than the log transaction that was being replayed. That's a bug, and should -never- happen. So I immediately went to look at xlog_recover_get_buf_lsn() to check that we handled the LSN correctly. I was wondering if there was a similar "two commits with the same start LSN skips the second replay" problem with buffers. I didn't get that far, because I found a much more basic, rudimentary bug: xlog_recover_get_buf_lsn() doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!! IOWs, attr3 leaf buffers fall through the magic number checks unrecognised, so trigger the "recover immediately" behaviour instead of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf buffers and that causes silent on disk corruption of inode attribute forks and potentially other things.... Git history shows this is *another* zero day bug, this time introduced in commit 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery") which failed to handle the attr3 leaf buffers in recovery. And we've failed to handle them ever since... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: logging the on disk inode LSN can make it go backwardsDave Chinner
When we log an inode, we format the "log inode" core and set an LSN in that inode core. We do that via xfs_inode_item_format_core(), which calls: xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn); to format the log inode. It writes the LSN from the inode item into the log inode, and if recovery decides the inode item needs to be replayed, it recovers the log inode LSN field and writes it into the on disk inode LSN field. Now this might seem like a reasonable thing to do, but it is wrong on multiple levels. Firstly, if the item is not yet in the AIL, item->li_lsn is zero. i.e. the first time the inode it is logged and formatted, the LSN we write into the log inode will be zero. If we only log it once, recovery will run and can write this zero LSN into the inode. This means that the next time the inode is logged and log recovery runs, it will *always* replay changes to the inode regardless of whether the inode is newer on disk than the version in the log and that violates the entire purpose of recording the LSN in the inode at writeback time (i.e. to stop it going backwards in time on disk during recovery). Secondly, if we commit the CIL to the journal so the inode item moves to the AIL, and then relog the inode, the LSN that gets stamped into the log inode will be the LSN of the inode's current location in the AIL, not it's age on disk. And it's not the LSN that will be associated with the current change. That means when log recovery replays this inode item, the LSN that ends up on disk is the LSN for the previous changes in the log, not the current changes being replayed. IOWs, after recovery the LSN on disk is not in sync with the LSN of the modifications that were replayed into the inode. This, again, violates the recovery ordering semantics that on-disk writeback LSNs provide. Hence the inode LSN in the log dinode is -always- invalid. Thirdly, recovery actually has the LSN of the log transaction it is replaying right at hand - it uses it to determine if it should replay the inode by comparing it to the on-disk inode's LSN. But it doesn't use that LSN to stamp the LSN into the inode which will be written back when the transaction is fully replayed. It uses the one in the log dinode, which we know is always going to be incorrect. Looking back at the change history, the inode logging was broken by commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way back in 2016 by a stupid idiot who thought he knew how this code worked. i.e. me. That commit replaced an in memory di_lsn field that was updated only at inode writeback time from the inode item.li_lsn value - and hence always contained the same LSN that appeared in the on-disk inode - with a read of the inode item LSN at inode format time. CLearly these are not the same thing. Before 93f958f9c41f, the log recovery behaviour was irrelevant, because the LSN in the log inode always matched the on-disk LSN at the time the inode was logged, hence recovery of the transaction would never make the on-disk LSN in the inode go backwards or get out of sync. A symptom of the problem is this, caught from a failure of generic/482. Before log recovery, the inode has been allocated but never used: xfs_db> inode 393388 xfs_db> p core.magic = 0x494e core.mode = 0 .... v3.crc = 0x99126961 (correct) v3.change_count = 0 v3.lsn = 0 v3.flags2 = 0 v3.cowextsize = 0 v3.crtime.sec = Thu Jan 1 10:00:00 1970 v3.crtime.nsec = 0 After log recovery: xfs_db> p core.magic = 0x494e core.mode = 020444 .... v3.crc = 0x23e68f23 (correct) v3.change_count = 2 v3.lsn = 0 v3.flags2 = 0 v3.cowextsize = 0 v3.crtime.sec = Thu Jul 22 17:03:03 2021 v3.crtime.nsec = 751000000 ... You can see that the LSN of the on-disk inode is 0, even though it clearly has been written to disk. I point out this inode, because the generic/482 failure occurred because several adjacent inodes in this specific inode cluster were not replayed correctly and still appeared to be zero on disk when all the other metadata (inobt, finobt, directories, etc) indicated they should be allocated and written back. The fix for this is two-fold. The first is that we need to either revert the LSN changes in 93f958f9c41f or stop logging the inode LSN altogether. If we do the former, log recovery does not need to change but we add 8 bytes of memory per inode to store what is largely a write-only inode field. If we do the latter, log recovery needs to stamp the on-disk inode in the same manner that inode writeback does. I prefer the latter, because we shouldn't really be trying to log and replay changes to the on disk LSN as the on-disk value is the canonical source of the on-disk version of the inode. It also matches the way we recover buffer items - we create a buf_log_item that carries the current recovery transaction LSN that gets stamped into the buffer by the write verifier when it gets written back when the transaction is fully recovered. However, this might break log recovery on older kernels even more, so I'm going to simply ignore the logged value in recovery and stamp the on-disk inode with the LSN of the transaction being recovered that will trigger writeback on transaction recovery completion. This will ensure that the on-disk inode LSN always reflects the LSN of the last change that was written to disk, regardless of whether it comes from log recovery or runtime writeback. Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: avoid unnecessary waits in xfs_log_force_lsn()Dave Chinner
Before waiting on a iclog in xfs_log_force_lsn(), we don't check to see if the iclog has already been completed and the contents on stable storage. We check for completed iclogs in xfs_log_force(), so we should do the same thing for xfs_log_force_lsn(). This fixed some random up-to-30s pauses seen in unmounting filesystems in some tests. A log force ends up waiting on completed iclog, and that doesn't then get flushed (and hence the log force get completed) until the background log worker issues a log force that flushes the iclog in question. Then the unmount unblocks and continues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: log forces imply data device cache flushesDave Chinner
After fixing the tail_lsn vs cache flush race, generic/482 continued to fail in a similar way where cache flushes were missing before iclog FUA writes. Tracing of iclog state changes during the fsstress workload portion of the test (via xlog_iclog* events) indicated that iclog writes were coming from two sources - CIL pushes and log forces (due to fsync/O_SYNC operations). All of the cases where a recovery problem was triggered indicated that the log force was the source of the iclog write that was not preceeded by a cache flush. This was an oversight in the modifications made in commit eef983ffeae7 ("xfs: journal IO cache flush reductions"). Log forces for fsync imply a data device cache flush has been issued if an iclog was flushed to disk and is indicated to the caller via the log_flushed parameter so they can elide the device cache flush if the journal issued one. The change in eef983ffeae7 results in iclogs only issuing a cache flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not added to the iclogs that the log force code flushes to disk. Hence log forces are no longer guaranteeing that a cache flush is issued, hence opening up a potential on-disk ordering failure. Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that the actual iclogs it forces to the journal are also on stable storage before it returns to the caller. This patch introduces the xlog_force_iclog() helper function to encapsulate the process of taking a reference to an iclog, switching its state if WANT_SYNC and flushing it to stable storage correctly. Both xfs_log_force() and xfs_log_force_lsn() are converted to use it, as is xlog_unmount_write() which has an elaborate method of doing exactly the same "write this iclog to stable storage" operation. Further, if the log force code needs to wait on a iclog in the WANT_SYNC state, it needs to ensure that iclog also results in a cache flush being issued. This covers the case where the iclog contains the commit record of the CIL flush that the log force triggered, but it hasn't been written yet because there is still an active reference to the iclog. Note: this whole cache flush whack-a-mole patch is a result of log forces still being iclog state centric rather than being CIL sequence centric. Most of this nasty code will go away in future when log forces are converted to wait on CIL sequence push completion rather than iclog completion. With the CIL push algorithm guaranteeing that the CIL checkpoint is fully on stable storage when it completes, we no longer need to iterate iclogs and push them to ensure a CIL sequence push has completed and so all this nasty iclog iteration and flushing code will go away. Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: factor out forced iclog flushesDave Chinner
We force iclogs in several places - we need them all to have the same cache flush semantics, so start by factoring out the iclog force into a common helper. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: fix ordering violation between cache flushes and tail updatesDave Chinner
There is a race between the new CIL async data device metadata IO completion cache flush and the log tail in the iclog the flush covers being updated. This can be seen by repeating generic/482 in a loop and eventually log recovery fails with a failures such as this: XFS (dm-3): Starting recovery (logdev: internal) XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0) XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify XFS (dm-3): Unmount and run xfs_repair XFS (dm-3): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117 Analysis of the logwrite replay shows that there were no writes to the data device between the FUA @ write 124 and the FUA at write @ 125, but log recovery @ 125 failed. The difference was the one log write @ 125 moved the tail of the log forwards from (1,8) to (1,32) and so the inode create intent in (1,8) was not replayed and so the inode cluster was zero on disk when replay of the first inode item in (1,32) was attempted. What this meant was that the journal write that occurred at @ 125 did not ensure that metadata completed before the iclog was written was correctly on stable storage. The tail of the log moved forward, so IO must have been completed between the two iclog writes. This means that there is a race condition between the unconditional async cache flush in the CIL push work and the tail LSN that is written to the iclog. This happens like so: CIL push work AIL push work ------------- ------------- Add to committing list start async data dev cache flush ..... <flush completes> <all writes to old tail lsn are stable> xlog_write .... push inode create buffer <start IO> ..... xlog_write(commit record) .... <IO completes> log tail moves xlog_assign_tail_lsn() start_lsn == commit_lsn <no iclog preflush!> xlog_state_release_iclog __xlog_state_release_iclog() <writes *new* tail_lsn into iclog> xlog_sync() .... submit_bio() <tail in log moves forward without flushing written metadata> Essentially, this can only occur if the commit iclog is issued without a cache flush. If the iclog bio is submitted with REQ_PREFLUSH, then it will guarantee that all the completed IO is one stable storage before the iclog bio with the new tail LSN in it is written to the log. IOWs, the tail lsn that is written to the iclog needs to be sampled *before* we issue the cache flush that guarantees all IO up to that LSN has been completed. To fix this without giving up the performance advantage of the flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1 compared to 5.13), we need to ensure that we always issue a cache flush if the tail LSN changes between the initial async flush and the commit record being written. THis requires sampling the tail_lsn before we start the flush, and then passing the sampled tail LSN to xlog_state_release_iclog() so it can determine if the the tail LSN has changed while writing the checkpoint. If the tail LSN has changed, then it needs to set the NEED_FLUSH flag on the iclog and we'll issue another cache flush before writing the iclog. Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: fold __xlog_state_release_iclog into xlog_state_release_iclogDave Chinner
Fold __xlog_state_release_iclog into its only caller to prepare make an upcoming fix easier. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: external logs need to flush data deviceDave Chinner
The recent journal flush/FUA changes replaced the flushing of the data device on every iclog write with an up-front async data device cache flush. Unfortunately, the assumption of which this was based on has been proven incorrect by the flush vs log tail update ordering issue. As the fix for that issue uses the XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache flush, we now need to (once again) ensure that an iclog write to external logs that need a cache flush to be issued actually issue a cache flush to the data device as well as the log device. Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: flush data dev on external log writeDave Chinner
We incorrectly flush the log device instead of the data device when trying to ensure metadata is correctly on disk before writing the unmount record. Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29firmware_loader: fix use-after-free in firmware_fallback_sysfsAnirudh Rayabharam
This use-after-free happens when a fw_priv object has been freed but hasn't been removed from the pending list (pending_fw_head). The next time fw_load_sysfs_fallback tries to insert into the list, it ends up accessing the pending_list member of the previously freed fw_priv. The root cause here is that all code paths that abort the fw load don't delete it from the pending list. For example: _request_firmware() -> fw_abort_batch_reqs() -> fw_state_aborted() To fix this, delete the fw_priv from the list in __fw_set_state() if the new state is DONE or ABORTED. This way, all aborts will remove the fw_priv from the list. Accordingly, remove calls to list_del_init that were being made before calling fw_state_(aborted|done). Also, in fw_load_sysfs_fallback, don't add the fw_priv to the pending list if it is already aborted. Instead, just jump out and return early. Fixes: bcfbd3523f3c ("firmware: fix a double abort case with fw_load_sysfs_fallback") Cc: stable <stable@vger.kernel.org> Reported-by: syzbot+de271708674e2093097b@syzkaller.appspotmail.com Tested-by: syzbot+de271708674e2093097b@syzkaller.appspotmail.com Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Anirudh Rayabharam <mail@anirudhrb.com> Link: https://lore.kernel.org/r/20210728085107.4141-3-mail@anirudhrb.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>