Counting Words in C with Yc

Introduction

Ben Hoyt wrote an excellent blog post comparing programming languages with a non-trivial program: counting the number of words in a text file.

You can find his repo here.

Of all of the languages, only C did not provide a hashmap out of the box, so Ben wrote a simple linearly probed map with a simple hash function for his version.

As it turns out, I have already written what is supposed to be a production-ready hashmap for C, so I thought I would try it out.

It went hilariously wrong.

Fighting Bugs

First, I had made a change to the implementation recently. My current tests had not caught any bugs, and since I wasn’t actually using the hashmap in anything important, I had stopped there.

But there were bugs. There were very subtle bugs, and my adaptation of Ben’s test caught them.

The problems were so subtle that it took me several days of trying until I finally figured out what the problem was.

And the fix was not simple either.

To ensure that sort of thing does not happen again, I actually added my adaptation as a test in my repo, and I added a bunch of assert code to explode at the slightest provocation.

The Code

Without further ado, the code:

/**
 * This file is under the following copyright and license:
 *
 * MIT License
 *
 * Copyright (c) 2021 Ben Hoyt
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */

#include <assert.h>
#include <ctype.h>
#include <search.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <yc/yc.h>
#include <yc/hash.h>
#include <yc/map.h>
#include <yc/vector.h>

#include <xxhash.h>

#define BUF_SIZE 65536

// Used both for hash table buckets and array for sorting.
typedef struct {
	char* word;
	int word_len;
} word_data;

typedef struct {
	word_data word;
	int count;
} count;

// Comparison function for qsort() ordering by count descending.
int cmp_count(const void* p1, const void* p2) {
	int c1 = ((count*)p1)->count;
	int c2 = ((count*)p2)->count;
	if (c1 == c2) return 0;
	if (c1 < c2) return 1;
	return -1;
}
ssize_t cmp(const void* p1, const void* p2) {
	return (ssize_t) strcmp(((count*)p1)->word.word, ((count*)p2)->word.word);
}
ssize_t cmp_safe(const void* p1, const void* p2) {
	count* c1 = (count*) p1;
	count* c2 = (count*) p2;
	int min = c1->word.word_len < c2->word.word_len ?
		c1->word.word_len : c2->word.word_len;
	int res = memcmp(c1->word.word, c2->word.word, min);
	if (!res) {
		if (c1->word.word_len == c2->word.word_len) return 0;
		if (c1->word.word_len < c2->word.word_len) return -1;
		return 1;
	}
	return res;
}

YMap table;

uint64_t xxhash_hash(const void* key) {
	const word_data* w = (const word_data*) key;
	return XXH3_64bits(w->word, w->word_len);
}

uint64_t hash(const void* key) {
	const word_data* w = (const word_data*) key;
	return yhash64(w->word, w->word_len, w->word_len);
}

bool eq(const void* key1, const void* key2) {
	const word_data* w1 = (const word_data*) key1;
	const word_data* w2 = (const word_data*) key2;
	return w1->word_len == w2->word_len &&
		!memcmp(w1->word, w2->word, w1->word_len);
}

bool eq_test(const void* key1, const void* key2) {
	const count* c1 = (const count*) key1;
	const count* c2 = (const count*) key2;
	return c1->word.word_len == c2->word.word_len &&
		!memcmp(c1->word.word, c2->word.word, c1->word.word_len) &&
		c1->count == c2->count;
}

void dtor(void* wrd) {
	word_data* w = (word_data*) wrd;
	free(w->word);
}

// Increment count of word in hash table (or insert new word).
void increment(char* word, int word_len) {

	static int times = 0;

	times += 1;

	word_data w;
	word_data* temp;

	w.word = word;
	w.word_len = word_len;

	int* val;

	if (ymap_exists_v(table, &w, &val)) {
		*val += 1;
	}
	else {

		int wcount = 1;

		w.word = malloc(word_len + 1);

		if (w.word == NULL) {
			fprintf(stderr, "could not allocate\n");
			exit(1);
		}

		memcpy(w.word, word, word_len);
		w.word[word_len] = '\0';

		YStatus s = ymap_insert(table, &w, &wcount);

		if (s != YSTATUS_SUCCESS) {
			fprintf(stderr, "could not insert\n");
			exit(1);
		}
	}
}

int main() {

	// Allocate hash table buckets.
	table = ymap_create(0.96f, sizeof(word_data), sizeof(int),
	                    xxhash_hash, eq, dtor, NULL);
	if (table == NULL) {
		fprintf(stderr, "out of memory\n");
		return 1;
	}
	char buf[BUF_SIZE];
	int offset = 0;
	while (1) {
		// Read file in chunks, processing one chunk at a time.
		size_t num_read = fread(buf+offset, 1, BUF_SIZE-offset, stdin);
		if (num_read == 0) {
			break;
		}

		// Find last space or linefeed in buf and process up to there.
		int space;
		for (space = offset+num_read-1; space>=0; space--) {
			char c = buf[space];
			if (c <= ' ') {
				break;
			}
		}
		int num_process = (space >= 0) ? space : (int)num_read+offset;

		// Scan chars to process: tokenize, lowercase, and hash as we go.
		int i = 0;
		while (1) {
			// Skip whitespace before word.
			for (; i < num_process; i++) {
				char c = buf[i];
				if (c > ' ') {
					break;
				}
			}
			// Look for end of word, lowercase and hash as we go.
			int start = i;
			for (; i < num_process; i++) {
				char c = buf[i];
				if (c <= ' ') {
					break;
				}
				if (c >= 'A' && c <= 'Z') {
					c += ('a' - 'A');
					buf[i] = c;
				}
			}
			if (i <= start) {
				break;
			}
			// Got a word, increment count in hash table.
			increment(buf+start, i-start);
		}

		// Move down remaining partial word.
		if (space >= 0) {
			offset = (offset+num_read-1) - space;
			memmove(buf, buf+space+1, offset);
		} else {
			offset = 0;
		}
	}

	size_t num_unique = ymap_size(table);

	count* ordered_map = calloc(num_unique, sizeof(count));

	if (ordered_map == NULL) {
		fprintf(stderr, "out of memory\n");
		exit(1);
	}

	size_t i = 0;
	YIterator it;

	for (it = ymap_begin(table); it; it = ymap_next(table, it), ++i) {
		word_data* w = ymap_key(table, it);
		int* c = ymap_at(table, w);
		ordered_map[i].word = *w;
		ordered_map[i].count = *c;
	}

	i = 0;

	qsort(ordered_map, num_unique, sizeof(count), cmp_count);
	for (size_t i = 0; i < num_unique; i++) {
		printf("%s %d\n", ordered_map[i].word.word, ordered_map[i].count);
	}

#ifndef NDEBUG
	free(ordered_map);
	ymap_free(table);
#endif // NDEBUG

	return 0;
}

I also added the following to test.sh:

echo C optimized yc
gcc -static -O2 -DNDEBUG optimized_yc.c /usr/local/lib/libyc.a /usr/local/lib/libxxhash.a -o optimized-c-yc
./optimized-c-yc <kjvbible_x10.txt | python3 normalize.py >output.txt
git diff --exit-code output.txt

Be sure that libyc.a and libxxhash.a are accessible to the compiler. Make sure to use this commit.

And finally, I also added the following item to the list of programs in benchmark.py:

    ('C (Yc)', './optimized-c', './optimized-c-yc', ''),

Benchmarking

When I ran the benchmarks, Ben’s optimized C took 17 milliseconds, and the version that used my map’s builtin hash function (which you can get by changing the ymap_create() call to use hash instead of xxhash_hash) took about 43 seconds.

That’s not ideal, but it is more realistic.

However, my map’s builtin hash function is SpookyHash, which fares poorly against xxh3 in some benchmarks, so I also added xxh3 to my version and used that instead.

Yes, my hashmap can use whatever hash function you want.

The result: my version ran in 30 milliseconds.

I was happy with that; 30 milliseconds for using a full hashmap, since that is very close to the Rust version with a full hashmap.

That Rust version took 27 milliseconds on my machine, and my hashmap is probably not as optimized as Rust’s.

Lessons

The first lesson is: test.

I am glad that I took the time to make sure that my version worked; it would have been embarrassing otherwise.

Second, hash functions matter.

Cutting almost a third off the time for this exercise is significant for one simple change.

Especially where my map has to dynamically grow, a somewhat costly operation. And it has to do it several times.

Growing the map was where the bug was, in fact.

In fact, I would argue that Ben’s optimized C version works so fast because of his hash.

I don’t know how good his hash would be in the real world, but I would bet that it would not stand up well to xxh3 in the general case.

However…I hesitate to use xxh3. I chose SpookyHash because it was public domain, and xxh3 is not, though its license is very liberal.

But I don’t want to have any questions about the origin of the code in Yc, much like the authors of SQLite.

I may add xxh3 as a submodule, or write my own version from scratch, but I think I will keep SpookyHash, just in case.

Third, Rust can hold its own. I say this because some people think I hate Rust because I have criticized it before.

Conclusion

There’s no real conclusion here.

However, just because C doesn’t have nice things by default doesn’t mean that we can’t make them.

About

Contact

Archive

Categories

Tags

Subscribe