End of the Cantonese Goldlisting Project

In March 2015, I posted about my Cantonese experiment, where I used a list of characters and a dictionary to goldlist Cantonese. At first, I thought it was just going to be a slightly inefficient way of going about it, but there were several unforeseen problems that made the experiment a major failure. Now that the project is in the closing stages and I have to wait before I can keep distilling, I wanted to write about what I learned. To be frank, it was more of a waste of time and effort than expected, and I would not do it again.

The whole project has spanned five years—seven if I count the work I did before the actual experiment. While I haven’t actually spent five years on it, this project has caused me to lose more time than what I actually spent working on it. For most of the last five years, I haven’t done much goldlisting. This was partly because of work, but also because all the problems that became apparent really demotivated me. That was probably the worst outcome of the entire project, because I could have finished many shorter projects instead of wasting time on one big one.

Length of the Project

I originally thought it would only take me about 20,000 headlist lines, but I didn’t stop until I reached 40,000. There was still no end in sight, and it was apparent that I had included hundreds (if not thousands) of words I shouldn’t have. There were also many duplicates, because it was hard to keep track of which characters I had already covered.

Lack of Context

I knew this before I started, but I didn’t think lack of context would be such a big problem. I had done small-scale projects like this before, and those worked out well. I sometimes misunderstood what a word meant or how to use it, but then I’d encounter it in the wild and realize the mistake, and although I had no context, the words I was learning were all very basic, so I could always imagine a context and be fairly sure it was accurate. This time, the words ranged from very basic all the way to needlessly advanced, and I had no idea which were which, because although CantoDict does divide them up into five categories, almost all of the words are put in the middle category, which kind of defeats the purpose.

Source Material

The biggest problem turned out to be my choice of source material. CantoDict is maintained by volunteers who work when they can and want to. Nobody is responsible for checking every single entry to ensure accuracy, so there are many inaccurate entries. I suspect these were added and parsed automatically, because they often lack the changed tones and are listed with their citation tones instead, and others use the wrong reading of one of the characters.

For this reason, it is useful as a reference or second opinion, but should not be relied upon alone (but be warned: many online dictionaries get their Cantonese data directly from CantoDict). Fortunately, the editors do a good job linking to discussions about each entry when they come up on the forums, in which case you do get a second opinion right there.

Early on, I did not realize just how many wrong entries there would be. I used the forums to ask when I suspected something was off, but I didn’t always get a reply there either, so I soon resorted to asking friends instead. Eventually, I got pretty good at spotting suspicious words to ask about, and this revealed another problem, namely that many of the words and expressions were completely unknown to my friends! Often, I’d ask how to pronounce something, and they’d ask me what it was supposed to mean.

The sheer amount of such words was the main reason I not only stopped adding to the headlist, but also simply gave up and crossed out hundreds of the words I had left. I crossed out anything that seemed suspicious, although I’d sometimes ask someone to make sure. I’m sure I also crossed out many legitimate words by doing it this way, but it was better than wasting more time learning the useless ones.

Final Stages

The last stages of the project involved filling up the last bronze book and distilling all the words I have left. The reason for this is that the CantoDict headlist only reached 38725 lines, so I had space left over. I filled it up with Teach Yourself Cantonese, Colloquial Cantonese, and Intermediate Cantonese. I wish I had started with these instead of saving them for last, but they also came with their own problems, which I may write about later.

After cutting out all the words I didn’t trust, there wasn’t that much left to distill, but I did it anyway. Now, I’m all caught up and waiting to be able to keep distilling. I estimate that I’ll finish the entire project in less than two months from now and have 16 bronze books, 3 silver books, and 1 gold book to show for it.


This experiment was really a failure. It was meant to be a somewhat inefficient way to achieve a very useful goal, but it turned out not to be. Even if I had used a more efficient method with the same source material, I would just have failed faster (which would have been preferable, but now I know).

That’s not to say I got nothing out of it, though. I did learn thousands of useful words and expressions and I got really good at spotting suspicious information about Cantonese (which did come in handy as I was going through the textbooks at the end—Cantonese learning materials always come with lots of mistakes in them, unfortunately!). Still, it wasn’t worth the wasted time and effort in the end, and I would not do it this way again.

The Way Forward

Although the experiment is over, my Cantonese goldlisting doesn’t have to be. I only got halfway through Intermediate Cantonese before running out of pages in my bronze book. I could start a 17th bronze book and keep going, but I haven’t decided yet. For future projects, I’ll also choose dictionaries that at least have example sentences. I do happen to have one for Cantonese, namely 東方廣東語辭典 (a Cantonese–Japanese dictionary that is better than any Cantonese–English one I’ve seen), so I might goldlist that and see how it goes. It is unfortunately not without errors either, but it’s far more accurate than CantoDict.