bpf: Ringbuf: Ensure we acquire load the length for the ring buf entry

The kernel updates the length with xchg() which does a memory barrier,
on the kernel side when the data is actually committed to the ring
buffer [1].

On the user space side the volatile is not sufficient to prevent the
data read from being reordered  before the load of length.


[1]https://github.com/torvalds/linux/blob/a20971c187522f5a7cd8e961e7e9c88f31ea2bed/kernel/bpf/ringbuf.c#L484

Bug: 374722456
Bug: 368624834
Bug: 376536942
Change-Id: I75eee3deee2afce83c1b760e6df383375f926ebb
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
diff --git a/bpf/headers/include/bpf/BpfRingbuf.h b/bpf/headers/include/bpf/BpfRingbuf.h
index 4bcd259..5fe4ef7 100644
--- a/bpf/headers/include/bpf/BpfRingbuf.h
+++ b/bpf/headers/include/bpf/BpfRingbuf.h
@@ -99,6 +99,7 @@
   // 32-bit kernel will just ignore the high-order bits.
   std::atomic_uint64_t* mConsumerPos = nullptr;
   std::atomic_uint32_t* mProducerPos = nullptr;
+  std::atomic_uint32_t* mLength = nullptr;
 
   // In order to guarantee atomic access in a 32 bit userspace environment, atomic_uint64_t is used
   // in addition to std::atomic<T>::is_always_lock_free that guarantees that read / write operations
@@ -247,7 +248,8 @@
     //   u32 len;
     //   u32 pg_off;
     // };
-    uint32_t length = *reinterpret_cast<volatile uint32_t*>(start_ptr);
+    mLength = reinterpret_cast<decltype(mLength)>(start_ptr);
+    uint32_t length = mLength->load(std::memory_order_acquire);
 
     // If the sample isn't committed, we're caught up with the producer.
     if (length & BPF_RINGBUF_BUSY_BIT) return count;