标准编号:ISO 24614-1:2010
中文名称:语言资源管理 书面文本的单词分割 第1部分:基本概念和一般原则
英文名称:Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles
发布日期:2010-11
标准范围
This part of ISO 24614 presents the basic concepts and general principles of word segmentation, and provides language-independent guidelines to enable written texts to be segmented, in a reliable and reproducible manner, into word segmentation units (WSU).NOTE 1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean.The many applications and fields that need to segment texts into words — and thus to which this part of ISO 24614 can be applied — include the following.