中文 日本語 English

The Syllable Inventory of Basay: Native Vocabulary (source=B)

A Source-Separated Analysis of the Phonological System

Author: Tsai, Yung-kuei (蔡永桂)
Date: June 23, 2026
Type: Original research (linguistic typology / quantitative phonology)
License: CC BY 4.0 Citation ID: basay.tw/research/2026-06-basay-syllable-B/

Abstract

This paper presents a syllable inventory of Basay, an extinct Austronesian language of northern Taiwan, based exclusively on native vocabulary entries (source=B, 1,117 entries) drawn from the Basay lexical database. A prior analysis pooling all source types (B, T, M) yielded 486 syllable types; the present source-separated analysis identifies 266 syllable types and 22 onset categories from the native vocabulary alone. CVC is the dominant syllable structure (134 types, 50%), overturning the CV-dominant picture produced by the mixed analysis. The onsets h, /ʃ/ (s'), /tʃ/ (ts'), and sj are distinctive to native vocabulary and absent from the Yilan dialect data (T+M). Conversely, q, z, /ɮ/ (z'), and /ɭ/ (l'), which appeared prominently in the mixed inventory, are entirely absent from source=B, indicating that these phonemes do not belong to the core Basay phonological system.

Keywords: Basay, native vocabulary, syllable inventory, source separation, phonological system, Formosan languages

📚 Cite this article

APA:

Tsai, Y.-k. (2026). The syllable inventory of Basay: Native vocabulary (source=B) — A source-separated analysis of the phonological system. basay.tw. https://basay.tw/research/2026-06-basay-syllable-B/en/

BibTeX:

@misc{tsai2026syllableB_en,
  author = {Tsai, Yung-kuei},
  title  = {The Syllable Inventory of {Basay}: {Native} Vocabulary (source=B)},
  year   = {2026},
  month  = {6},
  url    = {https://basay.tw/research/2026-06-basay-syllable-B/en/}
}

1. Introduction

Basay is a Formosan Austronesian language formerly spoken by the Basay, a Pingpu (plains indigenous) people of northern Taiwan. Lexical records survive in Dutch colonial sources and Qing dynasty documents; the language ceased to have living speakers by the early twentieth century (Li 1996, 2000). Documentation and revitalization efforts continue at the Institute of Linguistics, Academia Sinica.

A preceding study (the mixed analysis) pooled all non-PAN entries (2,364) from the lexical database and reported 486 syllable types. The database contains entries from multiple distinct source types (Table 1), representing different dialect layers and contact situations. Pooling these without separation conflates distinct phonological systems under a single "Basay" label.

SourceEntriesContent
B1,117Native Basay vocabulary
T588Trobiawan dialect (Yilan area)
M541Trobiawan (vocabulary-only collection)
S113Suspected Kavalan admixture (excluded)
V5Unidentified (excluded)
PAN960Proto-Austronesian reconstructions (excluded)

This paper analyzes source=B exclusively. The T+M Yilan dialect data and the contact-with-Kavalan hypothesis are treated in a companion paper.


2. Method

Syllable extraction followed the procedure of the preceding study with two refinements. First, syllable structure classification was corrected to distinguish onset-less syllables: forms without an onset consonant are classified as V-type (V, VC, VV, VVC) rather than conflated with onset-bearing types. Second, complex onset clusters (two or more distinct consonant phonemes) are classified as "other," distinct from single-phoneme digraphs such as ts, n', s', l', z', and ts'.

Table 1. Orthography–IPA correspondence for source=B

OrthographyIPADescription
n'ŋVelar nasal
s'ʃPalato-alveolar fricative
o'əMid central vowel (schwa)
' (coda)ʔGlottal stop (syllable-final coda)
tstsAlveolar affricate
ts'Palato-alveolar affricate
sjsjPalatal fricative variant
jj ~ dʒApproximant or affricate

Note: source=B contains no occurrences of /ɭ/ (l'), /ɮ/ (z'), /q/, or /z/.


3. Results

3.1 Overall Statistics

ParameterValue
Entries analyzed1,117
Syllable types (frequency ≥ 2)266
Onset categories22
Highest-frequency syllablela (148 occurrences)
High frequency (≥ 50)12 types
Mid frequency (10–49)53 types
Low frequency (2–9)201 types

3.2 Syllable Structure Distribution

Note: o' represents the single vowel /ə/; syllables such as no' /nə/ and ko' /kə/ are classified as CV, not CVC.

StructureTypes%Description
V41.5%Vowel nucleus only (a, i, o, u)
VC10.4%Vowel + coda (at)
VV20.8%Diphthong nucleus (ai, au)
VVC10.4%Diphthong + coda (oat)
CV7524.8%Basic canonical type
CVC13453.8%Most frequent type
CVV3613.5%Diphthong nucleus
CVVC72.6%Diphthong + coda
other62.3%Onset clusters etc.
Total266100%

3.3 Onset Distribution

OnsetIPATypesTokensKey syllables
841a, i, o, u
bb19207ba, be, bu
hh20138ha, hi, he
jj~dʒ644ja, jen
kk23228ka, ke, ku
ll32434la, li, lu
mm21215ma, man, mu
nn20211na, nan, nu
n'ŋ317n'a, n'o
pp22243pa, pu, pi
rr1065ra, ri, ru
ss30476se, sa, su
s'ʃ316s'i, s'a
sjsj313sja, sje
tt26328te, ta, ti
tsts860tsa, tse
ts'29ts'i, ts'a
vv25va, ve
ww552wa, wan

4. Discussion

4.1 CVC Dominance: A Corrected Picture

The finding that CVC is the dominant structure (54%) represents a significant correction to the mixed analysis. Formosan languages generally retain syllable-final consonants inherited from Proto-Austronesian, in contrast to the coda-less tendencies of Polynesian languages (Blust 1999). Source=B's CVC dominance aligns with this conservative Formosan profile and suggests that the "CV-dominant Basay" picture of the mixed analysis was an artifact of T+M data mixing.

4.2 Palatality in Native Vocabulary

The onset series s' /ʃ/, ts' /tʃ/, and sj, which are present in source=B but entirely absent from T+M, indicate a palatal contrast in the native phonological system. A parallel palatal series is attested in Atayal (Li 1996), suggesting a possible areal feature of the northern Taiwan linguistic zone.

4.3 Absent Phonemes and Their Implications

The complete absence of /q/, /z/, /ɮ/ (z'), and /ɭ/ (l') from source=B — phonemes that featured prominently in the mixed analysis — indicates that these do not belong to the core Basay phonological inventory. Their appearance in the mixed data is attributable to T+M Yilan dialect entries, where they are present at high frequency and are hypothesized to reflect contact with Kavalan (see companion paper).


5. Conclusion

The source=B-only analysis yields a syllable inventory of 266 types across 22 onset categories, with CVC as the dominant structure (54%). The findings correct three erroneous claims of the mixed analysis: (1) CV dominance, (2) the presence of /q/, /z/, /ɭ/, /ɮ/ as native phonemes, and (3) the 486-type inventory size. Source separation is shown to be a methodological prerequisite for phonological description of multilayered documentary lexical databases.

References


📥 下載 PDF(中文) 📥 PDF(日本語) 📥 PDF(English)

← Back to Research