summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2016-09-21tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88Neal Cardwell
The TCP CUBIC module already uses 64 bytes. The upcoming TCP BBR module uses 88 bytes. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: new CC hook to set sending rate with rate_sample in any CA stateYuchung Cheng
This commit introduces an optional new "omnipotent" hook, cong_control(), for congestion control modules. The cong_control() function is called at the end of processing an ACK (i.e., after updating sequence numbers, the SACK scoreboard, and loss detection). At that moment we have precise delivery rate information the congestion control module can use to control the sending behavior (using cwnd, TSO skb size, and pacing rate) in any CA state. This function can also be used by a congestion control that prefers not to use the default cwnd reduction approach (i.e., the PRR algorithm) during CA_Recovery to control the cwnd and sending rate during loss recovery. We take advantage of the fact that recent changes defer the retransmission or transmission of new data (e.g. by F-RTO) in recovery until the new tcp_cong_control() function is run. With this commit, we only run tcp_update_pacing_rate() if the congestion control is not using this new API. New congestion controls which use the new API do not want the TCP stack to run the default pacing rate calculation and overwrite whatever pacing rate they have chosen at initialization time. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: allow congestion control to expand send buffer differentlyYuchung Cheng
Currently the TCP send buffer expands to twice cwnd, in order to allow limited transmits in the CA_Recovery state. This assumes that cwnd does not increase in the CA_Recovery. For some congestion control algorithms, like the upcoming BBR module, if the losses in recovery do not indicate congestion then we may continue to raise cwnd multiplicatively in recovery. In such cases the current multiplier will falsely limit the sending rate, much as if it were limited by the application. This commit adds an optional congestion control callback to use a different multiplier to expand the TCP send buffer. For congestion control modules that do not specificy this callback, TCP continues to use the previous default of 2. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segmentsNeal Cardwell
To allow congestion control modules to use the default TSO auto-sizing algorithm as one of the ingredients in their own decision about TSO sizing: 1) Export tcp_tso_autosize() so that CC modules can use it. 2) Change tcp_tso_autosize() to allow callers to specify a minimum number of segments per TSO skb, in case the congestion control module has a different notion of the best floor for TSO skbs for the connection right now. For very low-rate paths or policed connections it can be appropriate to use smaller TSO skbs. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: allow congestion control module to request TSO skb segment countNeal Cardwell
Add the tso_segs_goal() function in tcp_congestion_ops to allow the congestion control module to specify the number of segments that should be in a TSO skb sent by tcp_write_xmit() and tcp_xmit_retransmit_queue(). The congestion control module can either request a particular number of segments in TSO skb that we transmit, or return 0 if it doesn't care. This allows the upcoming BBR congestion control module to select small TSO skb sizes if the module detects that the bottleneck bandwidth is very low, or that the connection is policed to a low rate. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: export data delivery rateYuchung Cheng
This commit export two new fields in struct tcp_info: tcpi_delivery_rate: The most recent goodput, as measured by tcp_rate_gen(). If the socket is limited by the sending application (e.g., no data to send), it reports the highest measurement instead of the most recent. The unit is bytes per second (like other rate fields in tcp_info). tcpi_delivery_rate_app_limited: A boolean indicating if the goodput was measured when the socket's throughput was limited by the sending application. This delivery rate information can be useful for applications that want to know the current throughput the TCP connection is seeing, e.g. adaptive bitrate video streaming. It can also be very useful for debugging or troubleshooting. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: track application-limited rate samplesSoheil Hassas Yeganeh
This commit adds code to track whether the delivery rate represented by each rate_sample was limited by the application. Upon each transmit, we store in the is_app_limited field in the skb a boolean bit indicating whether there is a known "bubble in the pipe": a point in the rate sample interval where the sender was application-limited, and did not transmit even though the cwnd and pacing rate allowed it. This logic marks the flow app-limited on a write if *all* of the following are true: 1) There is less than 1 MSS of unsent data in the write queue available to transmit. 2) There is no packet in the sender's queues (e.g. in fq or the NIC tx queue). 3) The connection is not limited by cwnd. 4) There are no lost packets to retransmit. The tcp_rate_check_app_limited() code in tcp_rate.c determines whether the connection is application-limited at the moment. If the flow is application-limited, it sets the tp->app_limited field. If the flow is application-limited then that means there is effectively a "bubble" of silence in the pipe now, and this silence will be reflected in a lower bandwidth sample for any rate samples from now until we get an ACK indicating this bubble has exited the pipe: specifically, until we get an ACK for the next packet we transmit. When we send every skb we record in scb->tx.is_app_limited whether the resulting rate sample will be application-limited. The code in tcp_rate_gen() checks to see when it is safe to mark all known application-limited bubbles of silence as having exited the pipe. It does this by checking to see when the delivered count moves past the tp->app_limited marker. At this point it zeroes the tp->app_limited marker, as all known bubbles are out of the pipe. We make room for the tx.is_app_limited bit in the skb by borrowing a bit from the in_flight field used by NV to record the number of bytes in flight. The receive window in the TCP header is 16 bits, and the max receive window scaling shift factor is 14 (RFC 1323). So the max receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we only need 30 bits for the tx.in_flight used by NV. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: track data delivery rate for a TCP connectionYuchung Cheng
This patch generates data delivery rate (throughput) samples on a per-ACK basis. These rate samples can be used by congestion control modules, and specifically will be used by TCP BBR in later patches in this series. Key state: tp->delivered: Tracks the total number of data packets (original or not) delivered so far. This is an already-existing field. tp->delivered_mstamp: the last time tp->delivered was updated. Algorithm: A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis: d1: the current tp->delivered after processing the ACK t1: the current time after processing the ACK d0: the prior tp->delivered when the acked skb was transmitted t0: the prior tp->delivered_mstamp when the acked skb was transmitted When an skb is transmitted, we snapshot d0 and t0 in its control block in tcp_rate_skb_sent(). When an ACK arrives, it may SACK and ACK some skbs. For each SACKed or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct to reflect the latest (d0, t0). Finally, tcp_rate_gen() generates a rate sample by storing (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us. One caveat: if an skb was sent with no packets in flight, then tp->delivered_mstamp may be either invalid (if the connection is starting) or outdated (if the connection was idle). In that case, we'll re-stamp tp->delivered_mstamp. At first glance it seems t0 should always be the time when an skb was transmitted, but actually this could over-estimate the rate due to phase mismatch between transmit and ACK events. To track the delivery rate, we ensure that if packets are in flight then t0 and and t1 are times at which packets were marked delivered. If the initial and final RTTs are different then one may be corrupted by some sort of noise. The noise we see most often is sending gaps caused by delayed, compressed, or stretched acks. This either affects both RTTs equally or artificially reduces the final RTT. We approach this by recording the info we need to compute the initial RTT (duration of the "send phase" of the window) when we recorded the associated inflight. Then, for a filter to avoid bandwidth overestimates, we generalize the per-sample bandwidth computation from: bw = delivered / ack_phase_rtt to the following: bw = delivered / max(send_phase_rtt, ack_phase_rtt) In large-scale experiments, this filtering approach incorporating send_phase_rtt is effective at avoiding bandwidth overestimates due to ACK compression or stretched ACKs. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: count packets marked lost for a TCP connectionNeal Cardwell
Count the number of packets that a TCP connection marks lost. Congestion control modules can use this loss rate information for more intelligent decisions about how fast to send. Specifically, this is used in TCP BBR policer detection. BBR uses a high packet loss rate as one signal in its policer detection and policer bandwidth estimation algorithm. The BBR policer detection algorithm cannot simply track retransmits, because a retransmit can be (and often is) an indicator of packets lost long, long ago. This is particularly true in a long CA_Loss period that repairs the initial massive losses when a policer kicks in. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net_sched: sch_fq: add low_rate_threshold parameterEric Dumazet
This commit adds to the fq module a low_rate_threshold parameter to insert a delay after all packets if the socket requests a pacing rate below the threshold. This helps achieve more precise control of the sending rate with low-rate paths, especially policers. The basic issue is that if a congestion control module detects a policer at a certain rate, it may want fq to be able to shape to that policed rate. That way the sender can avoid policer drops by having the packets arrive at the policer at or just under the policed rate. The default threshold of 550Kbps was chosen analytically so that for policers or links at 500Kbps or 512Kbps fq would very likely invoke this mechanism, even if the pacing rate was briefly slightly above the available bandwidth. This value was then empirically validated with two years of production testing on YouTube video servers. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21tcp: use windowed min filter library for TCP min_rtt estimationNeal Cardwell
Refactor the TCP min_rtt code to reuse the new win_minmax library in lib/win_minmax.c to simplify the TCP code. This is a pure refactor: the functionality is exactly the same. We just moved the windowed min code to make TCP easier to read and maintain, and to allow other parts of the kernel to use the windowed min/max filter code. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21lib/win_minmax: windowed min or max estimatorNeal Cardwell
This commit introduces a generic library to estimate either the min or max value of a time-varying variable over a recent time window. This is code originally from Kathleen Nichols. The current form of the code is from Van Jacobson. A single struct minmax_sample will track the estimated windowed-max value of the series if you call minmax_running_max() or the estimated windowed-min value of the series if you call minmax_running_min(). Nearly equivalent code is already in place for minimum RTT estimation in the TCP stack. This commit extracts that code and generalizes it to handle both min and max. Moving the code here reduces the footprint and complexity of the TCP code base and makes the filter generally available for other parts of the codebase, including an upcoming TCP congestion control module. This library works well for time series where the measurements are smoothly increasing or decreasing. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20bpf: direct packet write and access for helpers for clsact progsDaniel Borkmann
This work implements direct packet access for helpers and direct packet write in a similar fashion as already available for XDP types via commits 4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and 6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a complementary feature to the already available direct packet read for tc (cls/act) programs. For enabling this, we need to introduce two helpers, bpf_skb_pull_data() and bpf_csum_update(). The first is generally needed for both, read and write, because they would otherwise only be limited to the current linear skb head. Usually, when the data_end test fails, programs just bail out, or, in the direct read case, use bpf_skb_load_bytes() as an alternative to overcome this limitation. If such data sits in non-linear parts, we can just pull them in once with the new helper, retest and eventually access them. At the same time, this also makes sure the skb is uncloned, which is, of course, a necessary condition for direct write. As this needs to be an invariant for the write part only, the verifier detects writes and adds a prologue that is calling bpf_skb_pull_data() to effectively unclone the skb from the very beginning in case it is indeed cloned. The heuristic makes use of a similar trick that was done in 233577a22089 ("net: filter: constify detection of pkt_type_offset"). This comes at zero cost for other programs that do not use the direct write feature. Should a program use this feature only sparsely and has read access for the most parts with, for example, drop return codes, then such write action can be delegated to a tail called program for mitigating this cost of potential uncloning to a late point in time where it would have been paid similarly with the bpf_skb_store_bytes() as well. Advantage of direct write is that the writes are inlined whereas the helper cannot make any length assumptions and thus needs to generate a call to memcpy() also for small sizes, as well as cost of helper call itself with sanity checks are avoided. Plus, when direct read is already used, we don't need to cache or perform rechecks on the data boundaries (due to verifier invalidating previous checks for helpers that change skb->data), so more complex programs using rewrites can benefit from switching to direct read plus write. For direct packet access to helpers, we save the otherwise needed copy into a temp struct sitting on stack memory when use-case allows. Both facilities are enabled via may_access_direct_pkt_data() in verifier. For now, we limit this to map helpers and csum_diff, and can successively enable other helpers where we find it makes sense. Helpers that definitely cannot be allowed for this are those part of bpf_helper_changes_skb_data() since they can change underlying data, and those that write into memory as this could happen for packet typed args when still cloned. bpf_csum_update() helper accommodates for the fact that we need to fixup checksum_complete when using direct write instead of bpf_skb_store_bytes(), meaning the programs can use available helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(), csum_block_add(), csum_block_sub() equivalents in eBPF together with the new helper. A usage example will be provided for iproute2's examples/bpf/ directory. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2016-09-19 Here's the main bluetooth-next pull request for the 4.9 kernel. - Added new messages for monitor sockets for better mgmt tracing - Added local name and appearance support in scan response - Added new Qualcomm WCNSS SMD based HCI driver - Minor fixes & cleanup to 802.15.4 code - New USB ID to btusb driver - Added Marvell support to HCI UART driver - Add combined LED trigger for controller power - Other minor fixes here and there Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21Merge branch 'stable-4.9' of git://git.infradead.org/users/pcmoore/selinux ↵James Morris
into next
2016-09-21power: supply: bq27xxx_battery: allow kernel poll_interval parameter runtime ↵Matt Ranostay
update Fix issue with poll_interval being not updated till the previous interval expired. Cc: Tony Lindgren <tony@atomide.com> Cc: Liam Breck <liam@networkimprov.net> Signed-off-by: Matt Ranostay <matt@ranostay.consulting> Signed-off-by: Sebastian Reichel <sre@kernel.org>
2016-09-21power: supply: sbs-battery: Cleanup removal of chip->pdataPhil Reid
There where still a few lingering references to pdata after commit power: supply: sbs-battery: simplify DT parsing. Remove pdata from struct·sbs_info and conditional checks to ser if this was set from the i2c read / write functions. Instead of call max in each function for incrementing poll_retry_count do it once in the probe function. Fixup null pointer dereference in to pdata in sbs_external_power_changed. Change retry counts to u32 to avoid need for max. Signed-off-by: Phil Reid <preid@electromag.com.au> Signed-off-by: Sebastian Reichel <sre@kernel.org>
2016-09-20clk: imx6: fix i.MX6DL clock tree to reflect realityLucas Stach
The current clock tree only implements the minimal set of differences between the i.MX6Q and the i.MX6DL, but that doesn't really reflect reality. Apply the following fixes to match the RM: - DL has no GPU3D_SHADER_SEL/PODF, the shader domain is clocked by GPU3D_CORE - GPU3D_SHADER_SEL/PODF has been repurposed as GPU2D_CORE_SEL/PODF - GPU2D_CORE_SEL/PODF has been repurposed as MLB_SEL/PODF Cc: stable@vger.kernel.org Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2016-09-20clk: imx53: Add clocks configurationKalle Kankare
Add clocks configuration for CSI, FIRI and IEEE1588. Signed-off-by: Fabien Lahoudere <fabien.lahoudere@collabora.co.uk> Acked-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2016-09-20fix fault_in_multipages_...() on architectures with no-op access_ok()Al Viro
Switching iov_iter fault-in to multipages variants has exposed an old bug in underlying fault_in_multipages_...(); they break if the range passed to them wraps around. Normally access_ok() done by callers will prevent such (and it's a guaranteed EFAULT - ERR_PTR() values fall into such a range and they should not point to any valid objects). However, on architectures where userland and kernel live in different MMU contexts (e.g. s390) access_ok() is a no-op and on those a range with a wraparound can reach fault_in_multipages_...(). Since any wraparound means EFAULT there, the fix is trivial - turn those while (uaddr <= end) ... into if (unlikely(uaddr > end)) return -EFAULT; do ... while (uaddr <= end); Reported-by: Jan Stancek <jstancek@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Cc: stable@vger.kernel.org # v3.5+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-20Merge branch 'irq/urgent' into irq/coreThomas Gleixner
Merge urgent fixes so pending patches for 4.9 can be applied.
2016-09-20PCI/AER: Remove duplicate AER severity translationTyler Baicar
Currently the AER severity is being translated twice in the code flow for PCIe errors. It is first translated in ghes_do_proc() before calling into the AER driver. Then it is translated again when the AER driver calls cper_print_aer(). This causes the severity that is used in cper_print_aer() to be incorrect. Remove the second translation that is in cper_print_aer() since this function is already receiving the correct AER severity. Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Borislav Petkov <bp@suse.de>
2016-09-20parisc: Drop BROKEN_RODATA config optionHelge Deller
PARISC was the only architecture which selected the BROKEN_RODATA config option. Drop it and remove the special handling from init.h as well. Signed-off-by: Helge Deller <deller@gmx.de>
2016-09-20Merge branch 'efi/urgent' into efi/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-20dma-buf/sync_file: fix documentation errorEmilio López
The ioctl name and description on the documentation block don't match the ioctl being defined. This was probably overlooked while renaming the ioctls during the sync file destaging. This patch provides a more accurate description of what the ioctl actually does. Signed-off-by: Emilio López <emilio.lopez@collabora.co.uk> Reviewed-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Link: http://patchwork.freedesktop.org/patch/msgid/20160919042120.6280-1-emilio.lopez@collabora.co.uk
2016-09-20Merge branches 'x86/amd', 'x86/vt-d', 'arm/exynos', 'arm/mediatek', ↵Joerg Roedel
'arm/renesas' and 'arm/smmu' into next
2016-09-20Merge branch 'for-joerg/arm-smmu/updates' of ↵Joerg Roedel
git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/smmu
2016-09-20blk/mq: Reserve hotplug states for block multiqueueSebastian Andrzej Siewior
This patch only reserves two CPU hotplug states for block/mq so the block tree can apply the conversion patches. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-20-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-20rhashtable: Add rhlist interfaceHerbert Xu
The insecure_elasticity setting is an ugly wart brought out by users who need to insert duplicate objects (that is, distinct objects with identical keys) into the same table. In fact, those users have a much bigger problem. Once those duplicate objects are inserted, they don't have an interface to find them (unless you count the walker interface which walks over the entire table). Some users have resorted to doing a manual walk over the hash table which is of course broken because they don't handle the potential existence of multiple hash tables. The result is that they will break sporadically when they encounter a hash table resize/rehash. This patch provides a way out for those users, at the expense of an extra pointer per object. Essentially each object is now a list of objects carrying the same key. The hash table will only see the lists so nothing changes as far as rhashtable is concerned. To use this new interface, you need to insert a struct rhlist_head into your objects instead of struct rhash_head. While the hash table is unchanged, for type-safety you'll need to use struct rhltable instead of struct rhashtable. All the existing interfaces have been duplicated for rhlist, including the hash table walker. One missing feature is nulls marking because AFAIK the only potential user of it does not need duplicate objects. Should anyone need this it shouldn't be too hard to add. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20Merge branch 'linus' into x86/asm, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-19net sched ife action: Introduce skb tcindex metadata encap decapJamal Hadi Salim
Sample use case of how this is encoded: user space via tuntap (or a connected VM/Machine/container) encodes the tcindex TLV. Sample use case of decoding: IFE action decodes it and the skb->tc_index is then used to classify. So something like this for encoded ICMP packets: .. first decode then reclassify... skb->tcindex will be set sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \ u32 match u32 0 0 flowid 1:1 \ action ife decode reclassify ...next match the decode icmp packet... sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \ u32 match ip protocol 1 0xff flowid 1:1 \ action continue ... last classify it using the tcindex classifier and do someaction.. sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \ handle 0x11 tcindex classid 1:1 \ action blah.. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19net sched ife action: add 16 bit helpersJamal Hadi Salim
encoder and checker for 16 bits metadata Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19fanotify: fix list corruption in fanotify_get_response()Jan Kara
fanotify_get_response() calls fsnotify_remove_event() when it finds that group is being released from fanotify_release() (bypass_perm is set). However the event it removes need not be only in the group's notification queue but it can have already moved to access_list (userspace read the event before closing the fanotify instance fd) which is protected by a different lock. Thus when fsnotify_remove_event() races with fanotify_release() operating on access_list, the list can get corrupted. Fix the problem by moving all the logic removing permission events from the lists to one place - fanotify_release(). Fixes: 5838d4442bd5 ("fanotify: fix double free of pending permission events") Link: http://lkml.kernel.org/r/1473797711-14111-3-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19fsnotify: add a way to stop queueing events on group shutdownJan Kara
Implement a function that can be called when a group is being shutdown to stop queueing new events to the group. Fanotify will use this. Fixes: 5838d4442bd5 ("fanotify: fix double free of pending permission events") Link: http://lkml.kernel.org/r/1473797711-14111-2-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19ALSA: line6: Add hwdep interface to access the POD control messagesAndrej Krutak
We must do it this way, because e.g. POD X3 won't play any sound unless the host listens on the bulk EP, so we cannot export it only via libusb. The driver currently doesn't use the bulk EP messages in other way, in future it could e.g. sense/modify volume(s). Signed-off-by: Andrej Krutak <dev@andree.sk> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-09-20Merge tag 'imx-drm-next-2016-09-19' of ↵Dave Airlie
git://git.pengutronix.de/git/pza/linux into drm-next imx-drm active plane reconfiguration, cleanup, FSU/IC/IRT/VDIC support - add active plane reconfiguration support (v4), use the atomic_disable callback - stop calling disable_plane manually in the plane destroy path - let mode cleanup destroy mode objects on driver unbind - drop deprecated load/unload drm_driver ops - add exclusive fence to plane state, so the atomic helper can wait on them, remove the open-coded fence wait from imx-drm - add low level deinterlacer (VDIC) support - add support for channel linking via the frame synchronisation unit (FSU) - add queued image conversion support for memory-to-memory scaling, rotation, and color space conversion, using IC and IRT. * tag 'imx-drm-next-2016-09-19' of git://git.pengutronix.de/git/pza/linux: gpu: ipu-v3: Add queued image conversion support gpu: ipu-v3: Add ipu_rot_mode_is_irt() gpu: ipu-v3: fix a possible NULL dereference drm/imx: parallel-display: detach bridge or panel on unbind drm/imx: imx-ldb: detach bridge on unbind drm/imx: imx-ldb: detach panel on unbind gpu: ipu-v3: Add FSU channel linking support gpu: ipu-v3: Add Video Deinterlacer unit drm/imx: add exclusive fence to plane state drm/imx: fold ipu_plane_disable into ipu_disable_plane drm/imx: don't destroy mode objects manually on driver unbind drm/imx: drop deprecated load/unload drm_driver ops drm/imx: don't call disable_plane in plane destroy path drm/imx: Add active plane reconfiguration support drm/imx: Use DRM_PLANE_COMMIT_NO_DISABLE_AFTER_MODESET flag drm/imx: ipuv3-crtc: Use the callback ->atomic_disable instead of ->disable gpu: ipu-v3: Do not wait for DMFC FIFO to clear when disabling DMFC channel
2016-09-20Merge tag 'drm-intel-next-2016-09-19' of ↵Dave Airlie
git://anongit.freedesktop.org/drm-intel into drm-next - refactor the sseu code (Imre) - refine guc dmesg output (Dave Gordon) - more vgpu work - more skl wm fixes (Lyude) - refactor dpll code in prep for upfront link training (Jim Bride et al) - consolidate all platform feature checks into intel_device_info (Carlos Santa) - refactor elsp/execlist submission as prep for re-submission after hang recovery and eventually scheduling (Chris Wilson) - allow synchronous gpu reset handling, to remove tricky/impossible/fragile error recovery code (Chris Wilson) - prep work for nonblocking (execlist) submission, using fences to track depencies and drive elsp submission (Chris Wilson) - partial error recover/resubmission of non-guilty batches after hangs (Chris Wilson) - full dma-buf implicit fencing support (Chris Wilson) - dp link training fixes (Jim, Dhinkaran, Navare, ...) - obey dp branch device pixel rate/bpc/clock limits (Mika Kahola), needed for many vga dongles - bunch of small cleanups and polish all over, as usual [airlied: printing macros collided] * tag 'drm-intel-next-2016-09-19' of git://anongit.freedesktop.org/drm-intel: (163 commits) drm/i915: Update DRIVER_DATE to 20160919 drm: Fix DisplayPort branch device ID kernel-doc drm/i915: use NULL for NULL pointers drm/i915: do not use 'false' as a NULL pointer drm/i915: make intel_dp_compute_bpp static drm: Add DP branch device info on debugfs drm/i915: Update bits per component for display info drm/i915: Check pixel rate for DP to VGA dongle drm/i915: Read DP branch device SW revision drm/i915: Read DP branch device HW revision drm/i915: Cleanup DisplayPort AUX channel initialization drm: Read DP branch device id drm: Helper to read max bits per component drm: Helper to read max clock rate drm: Drop VGA from bpc definitions drm: Add missing DP downstream port types drm/i915: Add ddb size field to device info structure drm/i915/guc: general tidying up (submission) drm/i915/guc: general tidying up (loader) drm/i915: clarify PMINTRMSK/pm_intr_keep usage ...
2016-09-20Merge branch 'drm-next-4.9' of git://people.freedesktop.org/~agd5f/linux ↵Dave Airlie
into drm-next More radeon and amdgpu changes for 4.9. Highlights: - Initial SI support for amdgpu (controlled by a Kconfig option) - misc ttm cleanups - runtimepm fixes - S3/S4 fixes - power improvements - lots of code cleanups and optimizations * 'drm-next-4.9' of git://people.freedesktop.org/~agd5f/linux: (151 commits) drm/ttm: remove cpu_address member from ttm_tt drm/radeon/radeon_device: remove unused function drm/amdgpu: clean function declarations in amdgpu_ttm.c up drm/amdgpu: use the new ring ib and dma frame size callbacks (v2) drm/amdgpu/vce3: add ring callbacks for ib and dma frame size drm/amdgpu/vce2: add ring callbacks for ib and dma frame size drm/amdgpu/vce: add common ring callbacks for ib and dma frame size drm/amdgpu/uvd6: add ring callbacks for ib and dma frame size drm/amdgpu/uvd5: add ring callbacks for ib and dma frame size drm/amdgpu/uvd4.2: add ring callbacks for ib and dma frame size drm/amdgpu/sdma3: add ring callbacks for ib and dma frame size drm/amdgpu/sdma2.4: add ring callbacks for ib and dma frame size drm/amdgpu/cik_sdma: add ring callbacks for ib and dma frame size drm/amdgpu/si_dma: add ring callbacks for ib and dma frame size drm/amdgpu/gfx8: add ring callbacks for ib and dma frame size drm/amdgpu/gfx7: add ring callbacks for ib and dma frame size drm/amdgpu/gfx6: add ring callbacks for ib and dma frame size drm/amdgpu/ring: add an interface to get dma frame and ib size drm/amdgpu/sdma3: drop unused functions drm/amdgpu/gfx6: drop gds_switch callback ...
2016-09-19s390/mm/pfault: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-s390@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: rt@linutronix.de Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Link: http://lkml.kernel.org/r/20160906170457.32393-18-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19mips/octeon/smp: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. [ tglx: Renamed the state to MIPS_SOC_PREPARE so it can be reused by other SOCs ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-16-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19fault-injection/cpu: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. This is just a temporary vehicle to keep the interface working for now, It'll be replaced by the sysfs interface which allows to step through the hotplug state machine step by step. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-15-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19padata: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. CPU-hotplug multinstance support is used with the nocalls() version. Maybe parts of padata_alloc() could be moved into the online callback so that we could invoke ->startup callback for instance and drop get_online_cpus(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-crypto@vger.kernel.org Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-14-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19ACPI/processor: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-acpi@vger.kernel.org Cc: rt@linutronix.de Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/20160906170457.32393-12-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19virtio scsi: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. It uses the multi instance infrastructure of the hotplug code to handle each interface. virtscsi_set_affinity() is removed from virtscsi_init() because virtscsi_cpu_notif_add() (the function which registers the instance) is invoked right after it and the cpuhp_state_add_instance() functions invokes the startup callback on all online CPUs. The same thing can not be applied virtscsi_cpu_notif_remove() because virtscsi_remove_vqs() invokes virtscsi_set_affinity() with affinity = false as argument but the old CPU_DEAD state invoked the function with affinity = true (which does not match the DEAD callback). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com> Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: virtualization@lists.linux-foundation.org Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-11-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19block/softirq: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-9-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19lib/irq_poll: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-8-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19sh/SH-X3 SMP: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-sh@vger.kernel.org Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-6-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19ARM/OMAP/wakeupgen: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tony Lindgren <tony@atomide.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Cc: linux-omap@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20160906170457.32393-4-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19ARM/shmobile: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine so the old notifier based cpuhotplug infrastructure can be removed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-sh@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Magnus Damm <magnus.damm@gmail.com> Cc: Simon Horman <horms@verge.net.au> Cc: rt@linutronix.de Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20160906170457.32393-3-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19arm64/FP/SIMD: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: rt@linutronix.de Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20160906170457.32393-2-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>