End of the Cantonese Goldlisting Project

In March 2015, I posted about my Cantonese experiment, where I used a list of characters and a dictionary to goldlist Cantonese. At first, I thought it was just going to be a slightly inefficient way of going about it, but there were several unforeseen problems that made the experiment a major failure. Now that the project is in the closing stages and I have to wait before I can keep distilling, I wanted to write about what I learned. To be frank, it was more of a waste of time and effort than expected, and I would not do it again.

The whole project has spanned five years—seven if I count the work I did before the actual experiment. While I haven’t actually spent five years on it, this project has caused me to lose more time than what I actually spent working on it. For most of the last five years, I haven’t done much goldlisting. This was partly because of work, but also because all the problems that became apparent really demotivated me. That was probably the worst outcome of the entire project, because I could have finished many shorter projects instead of wasting time on one big one.

Length of the Project

I originally thought it would only take me about 20,000 headlist lines, but I didn’t stop until I reached 40,000. There was still no end in sight, and it was apparent that I had included hundreds (if not thousands) of words I shouldn’t have. There were also many duplicates, because it was hard to keep track of which characters I had already covered.

Lack of Context

I knew this before I started, but I didn’t think lack of context would be such a big problem. I had done small-scale projects like this before, and those worked out well. I sometimes misunderstood what a word meant or how to use it, but then I’d encounter it in the wild and realize the mistake, and although I had no context, the words I was learning were all very basic, so I could always imagine a context and be fairly sure it was accurate. This time, the words ranged from very basic all the way to needlessly advanced, and I had no idea which were which, because although CantoDict does divide them up into five categories, almost all of the words are put in the middle category, which kind of defeats the purpose.

Source Material

The biggest problem turned out to be my choice of source material. CantoDict is maintained by volunteers who work when they can and want to. Nobody is responsible for checking every single entry to ensure accuracy, so there are many inaccurate entries. I suspect these were added and parsed automatically, because they often lack the changed tones and are listed with their citation tones instead, and others use the wrong reading of one of the characters.

For this reason, it is useful as a reference or second opinion, but should not be relied upon alone (but be warned: many online dictionaries get their Cantonese data directly from CantoDict). Fortunately, the editors do a good job linking to discussions about each entry when they come up on the forums, in which case you do get a second opinion right there.

Early on, I did not realize just how many wrong entries there would be. I used the forums to ask when I suspected something was off, but I didn’t always get a reply there either, so I soon resorted to asking friends instead. Eventually, I got pretty good at spotting suspicious words to ask about, and this revealed another problem, namely that many of the words and expressions were completely unknown to my friends! Often, I’d ask how to pronounce something, and they’d ask me what it was supposed to mean.

The sheer amount of such words was the main reason I not only stopped adding to the headlist, but also simply gave up and crossed out hundreds of the words I had left. I crossed out anything that seemed suspicious, although I’d sometimes ask someone to make sure. I’m sure I also crossed out many legitimate words by doing it this way, but it was better than wasting more time learning the useless ones.

Final Stages

The last stages of the project involved filling up the last bronze book and distilling all the words I have left. The reason for this is that the CantoDict headlist only reached 38725 lines, so I had space left over. I filled it up with Teach Yourself Cantonese, Colloquial Cantonese, and Intermediate Cantonese. I wish I had started with these instead of saving them for last, but they also came with their own problems, which I may write about later.

After cutting out all the words I didn’t trust, there wasn’t that much left to distill, but I did it anyway. Now, I’m all caught up and waiting to be able to keep distilling. I estimate that I’ll finish the entire project in less than two months from now and have 16 bronze books, 3 silver books, and 1 gold book to show for it.


This experiment was really a failure. It was meant to be a somewhat inefficient way to achieve a very useful goal, but it turned out not to be. Even if I had used a more efficient method with the same source material, I would just have failed faster (which would have been preferable, but now I know).

That’s not to say I got nothing out of it, though. I did learn thousands of useful words and expressions and I got really good at spotting suspicious information about Cantonese (which did come in handy as I was going through the textbooks at the end—Cantonese learning materials always come with lots of mistakes in them, unfortunately!). Still, it wasn’t worth the wasted time and effort in the end, and I would not do it this way again.

The Way Forward

Although the experiment is over, my Cantonese goldlisting doesn’t have to be. I only got halfway through Intermediate Cantonese before running out of pages in my bronze book. I could start a 17th bronze book and keep going, but I haven’t decided yet. For future projects, I’ll also choose dictionaries that at least have example sentences. I do happen to have one for Cantonese, namely 東方廣東語辭典 (a Cantonese–Japanese dictionary that is better than any Cantonese–English one I’ve seen), so I might goldlist that and see how it goes. It is unfortunately not without errors either, but it’s far more accurate than CantoDict.

著 and 着

Few characters are as quirky as 著. It seems to have been a variant that eventually split off from 箸, and now its own variant 着 is in the process of splitting off from 著. This illustrates a trend in the development of Chinese characters: they often start out as one sign that may represent several different (though usually similar-sounding) syllables and thus several meanings. Then, they split into more signs that share the burden of sounds and meanings associated with them. It’s rare for characters to merge, but there are many examples of such splits.


The Republic of China

It has been assigned 5 common readings in the national language of the Republic of China, or Standard Mandarin: ㄓㄨˋ、ㄓㄨㄛˊ、ㄓㄠˊ、ㄓㄠˉ、˙ㄓㄜ (zhù, zhuó, zháu, zhāu, zhe), as well as 2 that are rare enough to be ignored: ㄔㄨˊ、ㄓㄨˇ (chú, zhǔ). As far as the Ministry of Education of the Republic of China is concerned, 着 is a variant of 著, and that’s the whole story.

ㄓㄨㄛˊ is a merger of two Middle Chinese readings: one with a voiceless initial consonant, and one with a voiced one (cf. Cantonese zhoek³ and zhoek⁶). It is interesting to note that ㄓㄨㄛˊ and ㄓㄠˊ probably both developed from the reading with the voiced initial (one being literary and the other colloquial), but they ended up acquiring different meanings. ㄓㄠˉ is possibly a further development of the latter.

˙ㄓㄜ is a Mandarin-specific particle that had to be written down somehow, and this character did the job.

ㄔㄨˊ is only used in 著雍・著雝 (ㄔㄨˊ ㄩㄥˉ [chúyūng]), an alternate name for 戊, the fifth of the ten heavenly stems (十天干). ㄓㄨˇ is only used in 著任 (ㄓㄨˇ ㄖㄣˋ [zhǔrèn]), but I’m not sure what it means. Most dictionaries ignore these two, for obvious reasons.



In pre-war Japan, the situation was the same: 著 had two Sino-Japanese readings (since Japan never had the ㄓㄨㄛˊ–ㄓㄠˊ–ㄓㄠˉ split or a reading corresponding to the Mandarin particle): チョ and チャク (cho and chaku). Here from 詳解漢和字典, published just after WWII, identifying 着 as a vulgar variant of 著.



And here is the 著 entry in the same dictionary:

Screenshot at 2017-04-27 20:31:35 Screenshot at 2017-04-27 20:31:50 Screenshot at 2017-04-27 20:32:04 Screenshot at 2017-04-27 20:32:14


However, after WWII, 著 and 着 were assigned different roles: 著 for チョ, 着 for チャク. Native Japanese readings follow the meanings connected to the Sino-Japanese readings, so 著 for あらはꜜす and いちじるしꜜい (arawaꜜsu and ichijirushiꜜi) and 着 for きる and つꜜく・つくꜜ (kiru and tsuꜜku/tsukuꜜ).


Mainland China

When the Communist Party of China developed their standard, they assigned the two characters the same roles as the Japanese. 著 for ㄓㄨˋ, 着 for ㄓㄨㄛˊ、ㄓㄠˊ、˙ㄓㄜ and ㄓㄠˉ (this last one is also written 招, since they sound the same in Mandarin).



Korean usage is the same as well, suggesting this usage wasn’t just invented after the war. Korea never had any major character reforms, and still 著 is usually reserved for 저 (chŏː) and 着 is usually reserved for 착 (ch’ak). However, either can be used for either reading. In practice, though, most Koreans unfortunately don’t use characters at all anymore.



In Vietnam, characters are unfortunately used even less than in Korea, but dictionaries from the late 1800s suggest trứ and trước were both written 著, as in the Republic of China. Interestingly, there doesn’t seem to be a trược reading (corresponding to a Middle Chinese voiced initial consonant) for this character. The meanings that would be associated with that reading are listed under trước.

著 (trứ) in Bonet’s (1899) Vietnamese–French dictionary:Screenshot at 2017-04-25 21:37:54


And 著 (trước) in the same dictionary:

Screenshot at 2017-04-25 21:38:36


著 (trứ) in Génibrel’s (1898) Vietnamese–French dictionary:

Screenshot at 2017-04-25 21:35:31


And 著 (trước) in the same dictionary:

Screenshot at 2017-04-25 21:36:45 Screenshot at 2017-04-25 21:37:16


Modern dictionaries do include 着, and though they seem to prefer the reading trước for it, some also list trứ. 著 is always listed with both readings.


Hong Kong (and Macau)

Until recently, I was under the impression that Hong Kong usage, and presumably Macanese usage as well, differed from all of the above. CantoDict distinguishes them this way: 著 for zhy³ (ㄓㄨˋ) and zhoek³ (ㄓㄨㄛˊ), 着 for zhoek⁶ (ㄓㄨㄛˊ、ㄓㄠˊ、˙ㄓㄜ and ㄓㄠˉ).

著 in CantoDict (24.04.2017)

Screenshot at 2017-04-24 22:19:32


着 in CantoDict (24.04.2017)

Screenshot at 2017-04-24 22:20:07

However, my friend Kumono Shōta showed me the List of Graphemes of Commonly-used Chinese Characters, published by the Hong Kong Education Bureau. I was surprised to learn that people on the CantoDict forums seem to be wrong about the official Hong Kong division of 著 and 着. According to this list, 著 is for zhy³ (ㄓㄨˋ) and 着 is for zhoek³ (ㄓㄨㄛˊ、ㄓㄠˊ、˙ㄓㄜ and ㄓㄠˉ), just like in all the other jurisdictions (except for the Republic of China, of course).


著 in the List of Graphemes of Commonly-used Chinese Characters:


着 in the List of Graphemes of Commonly-used Chinese Characters:


Correspondance List

I made a list with the readings and the meanings the characters (roughly) correspond to!

Mandarin – Cantonese – Japanese – Korean – Vietnamese – English

ㄓㄨˋ – zhy³ – チョ – 저 – trứ – notoriety, authorship

ㄓㄨㄛˊ – zhoek³ – チャク – 착 – trước – to don

ㄓㄨㄛˊ – zhoek⁶ – チャク – 착 – trước – to make contact, to apply

ㄓㄠˊ – zhoek⁶ – チャク – 착 – trước – to ignite, to affect

ㄓㄠˉ – zhoek⁶ – チャク – 착 – trước – (boardgame) move

˙ㄓㄜ – zhoek⁶ – チャク – 착 – trước – stative particle

Cantonese Goldlisting Project

My goldlisting project for this year is an average of 100 lines per day (but I’m currently 18 days ahead of schedule), and the language I’m goldlisting is Cantonese. My goal is 15 000 items, but because I need extra lines for readings, I’ll need to reach about 20 000 headlist lines.

Source and Method

I’m using a great list of characters encountered in the Taiwanese school system, finding the CantoDict page for one character at a time, and goldlisting all the “compounds” that seem worthwhile. This usually means I’ll skip:

  • Things I don’t understand the English translation of.
  • Transparent compound words.
  • Proper nouns that I don’t recognize.

The first few characters in the list have huge lists of words containing them, and the last ones may have only one or two. Some are not even in the dictionary, and others are there, but with no words. So in the beginning, I will be using the same list for a long time, but the further I get, the shorter the lists get. In addition, I keep track of what characters I’ve already goldlisted, so I skip any words containing characters I’ve already done, which further shortens the lists.


In calculating project sizes in the goldlist system, David James multiplies the amount of headlist lines by three to get an approximate number of lines. In my case, the entire project should have 60,000 lines in total, although in reality it will likely have less than that. I’m currently at 10,300 headlist lines, but I’ve already started distilling, so I have about 8,000 lines in my distillations too. If I see every item approximately 3 times on average, I’ve finished about 1/3 of the project.

If I keep going at 100 lines per day, the remaining ~40,000 lines should take me ~400 days. But if I can manage to keep going at the current rate of 200 lines per day, I’ll need only ~200 days, which means I could finish before next year, if work and other circumstances allow it.


Initially, I planned on doing the project in 20 batches of 1,000 headlist lines each, and just finish them one by one, but I found out about David’s superior batch system early on and decided to use that instead. If the numbers look scary, don’t worry; I felt completely overwhelmed when I looked at them too. But it’s actually very easy:

  • Write your headlist for the first batch (in my case 2,000 lines).
  • Distil the headlist from 1–2,000, and and continue making the headlist from 2,001–3,900.
  • Return to the beginning of the book, where your D1 (first distillation) starts, and distil it. Now distil the second headlist batch, and finally continue making the headlist from 3,901–5,700.
  • Return to the beginning of the book, where your D2 starts. Distil your way through the book again, distilling one page at a time. Now continue making the headlist from 5,701–6,300.
  • Return to the beginning of the book, where your D3 starts. Distil your way through the whole book. Add 1,500 lines to your headlist.
  • Return to the beginning of the book, where your D4 starts. Distil your way through the whole book. Add 1,400 lines to your headlist.
  • Return to the beginning of the book, where your D5 starts. Distil your way through the whole book. Add 1,300 lines to your headlist.
  • Return to the beginning of the book, where your D6 starts. Distil your way through the whole book. Add 1,200 lines to your headlist.
  • And so on.

Of course, you won’t find a single book that can fit everything, so since my books have 100 sheets and are 35 lines deep, I can fit 2,500 headlist words with three distillations in each if I use the very last page and the very first as if they were a double page. After D3, I have to sample from several pages to make a new list of 25 lines per page in a new book. We call the first book “bronze” and the second “silver” – the next is “gold” and then even “platinum” if you want to keep going, but you may not need to continue once you finish the silver book.

Having explained the system I planned on using, I have to say I ended up not sticking to it after all, and there’s a good reason for that. I can only do the headlist when I have access to my source, and my source is online. Therefore, I decided to do the headlist when I can, since the other distillations can be done anywhere (except when I sample from one book to put into another, but at least I don’t need to be online for that. But I think I will follow the batch system for distillations, and so far, I have.

Books and Pens

I use these 100-page, 35-line Kokuyo Campus notebooks:


Each one can fit 2,500 headlist lines, so I’ll need 8 of them at the bronze stage, and probably 2 at the silver stage. I might just use a smaller book at the gold stage.

I don’t have a specific type of pen that I use, but I try to use comfortable ones. I prefer 0.38 mm., but 0.5 mm. pens are OK as well. If you want to try this at home, make sure you stock up on pens, because this really eats them up! I like to use black for the headlist, blue for D1, red for D2, and green for D3, then black for D4, and so on, rotating the colours, but you can do it with any colour you like.