Here's my note on doing the [One billion rows challenge](https://github.com/gunnarmorling/1brc).
My goal here is to:
1. find the reasonably optimized solution that's still readable and maintainable for a new-grade programmer.
2. Show that how easy it is to optimize code with the help of the flamegraph.
You can find the code [here](https://github.com/poga/onebrc/blob/main/src/main.rs)
# Step 1. Unoptimized: 73s
The baseline running on my M2 macbook with 24GB of RAM tooks 73s to run. not bad.
use std::collections::HashMap;
use std::env::args;
use std::io::Read;
struct Record {
count: i32,
min: f32,
max: f32,
sum: f32,
}
impl Record {
fn default() -> Record {
Record {
count: 0,
min: f32::INFINITY,
max: f32::NEG_INFINITY,
sum: 0.0,
}
}
fn add(&mut self, value: f32) {
self.count += 1;
self.sum += value;
self.min = self.min.min(value);
self.max = self.max.max(value);
}
fn avg(&self) -> f32 {
self.sum / self.count as f32
}
}
fn main() {
let filename = args().nth(1).unwrap_or("measurements.txt".to_string());
let mut data = String::new();
std::fs::File::open(filename)
.expect("Could not open file")
.read_to_string(&mut data)
.expect("Could not read file");
let mut h = HashMap::new();
for line in data.lines() {
let (name, value) = line.split_once(';').unwrap();
let value: f32 = value.parse().expect("Could not parse value");
h.entry(name.to_string())
.or_insert(Record::default())
.add(value);
}
let mut v: Vec<_> = h.iter().collect();
v.sort_unstable_by_key(|p| p.0);
for (name, r) in v {
println!("{name}: {:.1}/{:.1}/{:.1}", r.min, r.avg(), r.max);
}
}
Here's the flamegraph with `cargo flamegraph`. Some obvious bottlenecks are the `HashMap` and string `split` operations.
# Step 2. `memchr` crate: 40s
simply switch from `split_once` to the SIMD-optimized [`memchr`](https://docs.rs/memchr/latest/memchr/) crate, the runtime is reduced to 40s.
use memchr::memchr;
use std::collections::HashMap;
use std::env::args;
use std::io::Read;
struct Record {
count: i32,
min: f32,
max: f32,
sum: f32,
}
impl Record {
fn default() -> Record {
Record {
count: 0,
min: f32::INFINITY,
max: f32::NEG_INFINITY,
sum: 0.0,
}
}
fn add(&mut self, value: f32) {
self.count += 1;
self.sum += value;
self.min = self.min.min(value);
self.max = self.max.max(value);
}
fn avg(&self) -> f32 {
self.sum / self.count as f32
}
}
fn main() {
let filename = args().nth(1).unwrap_or("measurements.txt".to_string());
let mut data = String::new();
std::fs::File::open(filename)
.expect("Could not open file")
.read_to_string(&mut data)
.expect("Could not read file");
let mut h = HashMap::new();
let mut data = data.as_bytes();
loop {
let Some(sep) = memchr(b';', data) else {
break;
};
let end = memchr(b'\n', &data[sep..]).unwrap();
let name = &data[..sep];
let value = &data[sep + 1..sep + end];
let val = String::from_utf8_lossy(value)
.parse::()
.expect("Could not parse value");
h.entry(name).or_insert(Record::default()).add(val);
data = &data[sep + end + 1..];
}
let mut v: Vec<_> = h.iter().collect();
v.sort_unstable_by_key(|p| p.0);
for (name, r) in v {
println!("{:?}: {:.1}/{:.1}/{:.1}", name, r.min, r.avg(), r.max);
}
}
However we introduced an ugly and slow `String::from_utf8_lossy` to convert the value to a string to use the `parse` method, which shows up in the flamegraph.
![](./02.svg)
# Step 3. Manual parsing: 28.1s
By manually parsing the float value into a fixed-precision i32 signed integer, we can reduce the runtime to 28.1s.
use memchr::memchr;
use std::collections::HashMap;
use std::env::args;
use std::io::Read;
struct Record {
count: i32,
min: i32,
max: i32,
sum: i32,
}
impl Record {
fn default() -> Record {
Record {
count: 0,
min: i32::MAX,
max: i32::MIN,
sum: 0,
}
}
fn add(&mut self, value: i32) {
self.count += 1;
self.sum += value;
self.min = self.min.min(value);
self.max = self.max.max(value);
}
fn avg(&self) -> i32 {
self.sum / self.count as i32
}
}
fn main() {
let filename = args().nth(1).unwrap_or("measurements.txt".to_string());
let mut data = String::new();
std::fs::File::open(filename)
.expect("Could not open file")
.read_to_string(&mut data)
.expect("Could not read file");
let mut h = HashMap::new();
let mut data = data.as_bytes();
loop {
let Some(sep) = memchr(b';', data) else {
break;
};
let end = memchr(b'\n', &data[sep..]).unwrap();
let name = &data[..sep];
let value = &data[sep + 1..sep + end];
h.entry(name).or_insert(Record::default()).add(parse(value));
data = &data[sep + end + 1..];
}
let mut v: Vec<_> = h.iter().collect();
v.sort_unstable_by_key(|p| p.0);
for (name, r) in v {
println!(
"{:?}: {}/{}/{}",
*name,
format(r.min),
format(r.avg()),
format(r.max)
);
}
}
// parse into a fixed-precision i32 signed integer
fn parse(mut s: &[u8]) -> i32 {
let neg = if s[0] == b'-' {
s = &s[1..];
true
} else {
false
};
// s = abc.d
let (a, b, c, d) = match s {
[c, b'.', d] => (0, 0, c - b'0', d - b'0'),
[b, c, b'.', d] => (0, b - b'0', c - b'0', d - b'0'),
[a, b, c, b'.', d] => (a - b'0', b - b'0', c - b'0', d - b'0'),
[c] => (0, 0, 0, c - b'0'),
[b, c] => (0, b - b'0', c - b'0', 0),
[a, b, c] => (a - b'0', b - b'0', c - b'0', 0),
_ => panic!("Unknown patters {:?}", std::str::from_utf8(s).unwrap()),
};
let v = a as i32 * 1000 + b as i32 * 100 + c as i32 * 10 + d as i32;
if neg {
-v
} else {
v
}
}
fn format(v: i32) -> String {
format!("{:.1}", v as f64 / 10.0)
}
From the flamegraph, we can see that the String::from_utf8_lossy is gone. The `HashMap` is our next target.
![](./03.svg)
# Step 4. `hashbrown` crate: 23.1s
I googled around and find that rust already use `hashbrown` as its default HashMap implementation. But I still decided to try the crate version.
Thanks to the rust's ecosystem, we can simply switch to the [`hashbrown`](https://docs.rs/hashbrown/0.9.1/hashbrown/) crate, which is a faster drop-in replacement for the standard library's `HashMap`.
The code only needs a single change: `use hashbrown::HashMap;`.
![](./04.svg)
Why is switching to a crate version improve the performance? I don't know.
From the above (kind of useless) flamegraph, we can't see any obvious bottlenecks. I guess everything is inlined?
The next step would be to profile the code with a more fine-grained tool like `perf report`.
I tried to use memmap to map the file into memory, but it didn't show visible performance improvement.
# Step 5. parallelization: 11.29s, where the actual processing took only 3.68s
Check the code [here](https://github.com/poga/onebrc/commit/f3ce194adea4952c4beb134e331001d8273c3e96)
# Step 6. use `fs::read`: 4.9s
Somehow I missed the fact that `read_to_string` will validate the UTF-8 encoding of the file, which is unnecessary in this case. By using `fs::read`, we can reduce the runtime to 4.9s.
Check the code [here](https://github.com/poga/onebrc/blob/main/src/main.rs).
# Conclusion
The whole process of optimizing the code to achieve 15x performance with little effort, mostly just mindlessly following the guidance of the flamegraph.