February 26, 2014 at 8:53pm
At first people thought the Sun revolves around us. Then later we learn that it’s the Earth that orbits the Sun. Yet there is some qualitative difference between science and an internally consistent system of semantic memory.
Now we learn that both the Sun and the Earth orbit around some point in space, since the Sun has a lot more mass, this point is “pulled” into the inside of the Sun, but not at its centre.
February 25, 2014 at 8:49pm
What is fluency?
Imperially we know that Stephen Hawking “speaks” very fluent English, while Sarah Scott does not.
February 19, 2014 at 8:05pm
A few casual notes on ESL vs. STK
Perhaps due to the small number of speakers, there is more ‘special status’ to go around for the variations of, or deviations from, the idealised standard (of the Finnish language). Whereas, for multicentric languages they are more readily dismissed as wrong or vulgar.
* Colloquial usage equivalent to ‘gonna’, ‘gotta’, ‘wanna’ is taught alongside the ‘formal versions’ from Lesson 1. These (what we consider to be) phonetic phenomena are treated as if they were additional verb forms to remember.
* There’s also a tendency to ‘grammatise’ or codify these phono tactically convenient events into formal rules. For example, ‘Alvar Aalto’ -> ‘Alvar Aallon’ and the description of ‘consonant gradation’.
English, or in fact Latin, does this on an orthographic level, e.g. ‘inv-’ vs. ‘imp-‘… however only phoneticians really care about making explicit the more subtle ‘gradations’ such as the absense of ‘n’ in ‘ink’.
Here I’m gonna throw in a theory or hypothesis for the language-appearance parallelum:
- Why do Finns look like their neighbours but speak differently?
- Why are Turkish and Japanese geographically so far apart?
Well, the first possibility is that our century-old models of how languages and genes flow are completely pareidolic: seeing patterns when there is none, as a result of heuristic bias, whose potency comes from confirmation bias…
Perhaps, instead of a clear-cut, discretely clustered, lineage based, tree-like mapping of the flows — what happened in (pre)history resembled much more a network-like, incidental, trickle-feeding, absorption model.
And when 18-century pseudo-geneticists, anthropologists, and philologists fixated on a handful of hand-picked traits in a set of thousands to fulfill their pre-biased romantic fantasies of the time — some very convenient grouping naturally emerged.
Okay, now, onto the hypothesis:
- Let’s assume that language families are real;
- Say each language followed a group of its core speakers;
- When a group settled down, a ‘homeland’ for that hold of a language got established;
- Therefore, the geographical continuity of a language.
- Until the group was displaced, dispersed, or absorbed by other groups.
And my theory or hypothesis goes:
- In certain ‘pockets’, due to bleak climate or geographical barriers, interruptions from other groups were rare;
- Gene flow came from trickle-feeding from the group’s neighbours;
- Since the absorption was slow and gradual and bit-by-bit, the small numbers of incomers were absorbed and assimilated into the group linguistically.
- But over time, the genetic frequencies from the absorption accumulated — a slow but steady process.
- And since the skin-deep “grouping criteria” based on appearances may be from only a handful of alleles — the group would start to look like their neighbours when the frequencies reach a relative threshold.
- Which means this could happen in not that many generations — while the group still have generally the core of their language, with some borrowed elements.
A theoretical framework
* A language is made up of smaller components of phonology, morphology, etc
* These components form various levels of expressions (words, phrases, sentences) based on rules
* These rules may be generative, or they may be branching exceptions, or they may be idiomatic exceptions.
* Think of exceptions as rules that apply to a more specific subset. Even morphemes and their meaning can be considered as rules.
* The frequency of these rules in a natural language show a non-linear distribution
* The process of learning a language is to internalise these rules to a point of being able to interpret and generate expressions
* We learn rules based on Bayesian evidence. Exposure to comparative materials gradually builds a map of which rules are validated with what certainty.
How can we visualize a conversation? Francis Lam is a new media artist from MIT who created the above visualization to represent the kinds of shapes our exchanges can take. According to Lam’s professor at MIT, Judith Donath, the image has “a subtly prescriptive air…one…
December 20, 2013 at 3:17am
Finnish vocabulary is in fact quite logical:
land-ball = Earth
word-set = glossary
book-set = library
air-set = climate
land-air-all-ness = cosmos
big-big-ma = great-grandma
tooth-bump = surprise
flying-machine = airplane
knowing-machine = computer
November 16, 2013 at 12:28pm
Can we quantify the vocabulary of a language?
I would say the practical answer is yes.
Bear in mind that there are qualitative differences embedded in the quantity, that it’s not just linear addition. This is like asking how many apples you have. There may be larger ones and smaller ones, different varieties, some ripening, some decaying to the point that it’s bordering becoming something else…
But to think of this question as an engineering problem, we can simplify “languages” as shared sets of vocabularies. Naturally there’s a specialising effect: printers know more about fonts, zoologists are familiar with animals, and Finns may have more words for snow. But then again, if you think about learning a second language like French or Spanish or German — you would aim for a set of vocabulary that will allow you to “use that language”.
Then you can imagine the job of lexicographers: making dictionaries:
- Why does someone refer to a dictionary? Because he encountered some word that he doesn’t know…
- And the dictionary contains an entry that gives him that information.
- There is a frequency effect: if there’s a word you don’t know, it’s probably a rarer word.
- Then we can have dictionaries of difference sizes: a pocket-sized one, or a very big one — how many words to include is prioritised by frequency.
And this progression from a 20,000-word dictionary to a 100k dictionary is not linear. For a beginner, a thin dictionary is a good and practical goal. It’s less likely that you would encounter a word that you couldn’t find in there. Then as you become more advanced, you might encounter rarer words where you need a bigger dictionary. Until at some point, you might be standing at the frontier of a very niche specialised sub-field, and you need to invent some new words.
The level of competency in a second language learning setting is well studied — I suppose the same standard would just move on a continuum for native speakers as well (say, developmentally).
When you look at this: Common European Framework of Reference for Languages, the question as an engineering problem becomes simpler:
To reach an equivalent level of C2, what is the size of the dictionary one needs to master?
We may say it’s quite viable to quantify the “C2-Equivalent Vocabulary Size”.
A corpus-centric sociolinguitic approach to this question is only helpful in surfacing a lot of the issues faced by this question, but it does not lead to an answer.
I still think this question is most relevant to the field of lexicography: the practical art of making a dictionary. It is all that is concerned with the variety and variability of words, minus morphology, syntax, orthography, and all those other fields of linguistic competence or technology.
Again, let’s consider a dictionary for an adult native speaker who already has the linguistic competence:
- Why would he refer to a dictionary?
- Because he encountered a ‘word' the meaning of which cannot be derived from morphology or syntax or compositional analysis or logical screening e.g. “Ah, that’s an onomatopeic sequence”, or “That’s a typo" or "That’s just some jibberish"…
Let’s consider the “principle of compositionality”:
- If I say: “There’s a pink dog in the living room.”
- You would know what I mean and be able to imagine the situation.
- Even if I say: “There’s a pink-ish doggie in the living room.” It doesn’t really change much, you still get it. Because you have those faculties for morphological and syntactic processing.
But if someone says: “There’s a DALEK in the TARDIS.” Then, a competent but naïve speaker will have to refer to a dictionary, because the meanings cannot be deduced from other linguistic processes.
After those lexical entries are resolved, you can easily move to “There’s a delekishness in his tardismanship.” Right? Because they are not new lexical entries, they are just new lexical compositions.
As shown in the example of “naikimlyiia”, a verb can have half a million forms. In English, there’s “Antidisestablishmentarianism”. It’s not sensible to cover infinity with a morphological table (look-up data tables).
There must be an algorithmic solution, you put a string through, and the function breaks it up to sensible components.
Then for speed, we can cache the common look-ups in a table.
Also, in learning, words should be used as examples (or Bayesian evidence) for rules (lexical hypotheses), rather than each as a new separate rule.
In some languages, a whole sentence can be just a single word:
Polysynthetic languages typically have long “sentence-words” such as the Yupik word tuntussuqatarniksaitengqiggtuq which means “He had not yet said again that he was going to hunt reindeer.”
See: Polysynthetic language
What cannot be deduced compositionally, becomes lexicalised, e.g. “He kicked the bucket.” It’s no longer the composition of ‘kick’ and ‘bucket’.
Before man wrote the first tokeniser, there was already vocabulary. Just like in a dictionary, its size can be quanitified, the number of lexical entries can be counted, the number of senses listed under each entry is finite.
If we could do a frequency distribution on not just the text string and lemmatisation, but also accounting for the senses — then we’ll have a good picture of the cumulative curve — and at some point it must approach a limit.
Indeed, there are more variables to be accounted for, such as how transparent or opaque a derivation is, even modulated by their frequency effect, e.g. if you knew ‘ridonculous’ is from ‘ridiculous’ and ‘donkey’ and you heard it used a number of time in the last year…
These variables can still be modelled, and quantified.
November 15, 2013 at 7:13pm
Thinking about typical Broca’s aphasia and ToT problems… what would help the word-search? I remember the Mi’kmaq method of having a lot of photos on the wall just in the classroom. You can say “a man”, “an old man”, “an old man is fishing” and expand almost indefinitely.
When students are producing the sentences, they always have the reference of the pictures on the wall. Would a similar method help adult second language learners? What would the cue cards or hints really do?
One thing to note is that mutual intelligibility and ‘closeness’ may be two closely related but distinct things.
Here’s a project studying the mutual intelligibility of European langauges:
Mutual intelligibility of closely related languages in Europe: linguistic and non- linguistic determinants
Today I was having lunch with a French mathematician friend who work for a linguistic data company, and this question came up. Here is a summary of main points:
- If ANOVA can be used to analyse basic stats of a handful of sample, what would one use to study a massive and complex system? Your English would differ from your parents’ English, and AusE from AmE from SSBE. Then there’s Dutch, German etc. What modelling tools can we use to compare one person’s “language” to that of someone else?
- Let’s consider a simplified problem. Say there are 1,000 lexcial items (or concepts referring to actual entities or ideas in the world, e.g. APPLE, ORANGE) - then let’s assume for adults across the globe, they have the same mental models or ideas of these items, they just have different names for them, you call it ‘dog’, I call it ‘Hund’, and so on.
- Then such differences would also occur within the Englishes, to a somewhat amusing extent:
- Then, can we compare two languages simply by counting how many of these WORD-Meaning pairs are overlapping? Say take the 1k most frequent items from BrE, ask an American what they would call they, if 95% come up the same, then we say the two are very similar…
- This is complicated by two issues: one is that you can deduce meaning from context. This is why AmE and BrE speakers can still talk to each other: capsicum? Oh, you mean pepper! Because most of the other words and other linguistic aspects are intelligible. If everything were scambled, it would be like a foreign language.
- The other issue is that though we would theorise that every speaker has an internal mental model for his or her own “English”, when two speakers converse, they would adjust the output to accomodate and facilitate the interaction. For example, if a Scott speaks with a Kiwi, they would both modify their phonology to make it easier to understand. And when they speak to a non-native speaker, they would use simpler, more common words, and lower the speed.
- This means that both the sampling and the comparison must be very dynamic…
- Then we take it one step further. Consider derivational changes: when an Australian says ‘brekkie’, you may wonder, what? Then you learn it’s ‘breakfast’. The two terms share a common root. But can we say the same about the descendents of Latin?
- One way to consider this is that all generative approaches assume the evolution and transition are rule-based, like Grimm’s Law, ‘p’ systematically becomes ‘f’ and so on. Then you can view the changes from one word to another as a chain of rule-based events (like a Levinshtein algorithm). Then we might say, that the more steps in between, the more distant the two are. But this is always gonna be somewhat fuzzy.
- Another point to consider is lexical borrowing, thought the French for ‘attention’ is ‘attention’, English speakers are generally unaware of this. How do we account for the relatedness then. A silly question is, for speakers of languages, if two things are related but you don’t know they are, are they still related?
- Then we could consider the non-lexical aspects, say how syntax changed from Dutch to Afrikaans, or how phonology changed from German to English. We would assume these are also rule-based chains of steps…
- The German for ‘orange’ is ‘Orange’, you just have phoneme conversion rules, and stress determination rules. Then the distance can be calculation counting and weighing up the steps.
But whether some researchers have already tried these, I’m yet to find out.