Monday, November 14, 2005

介绍几本 PHP 书籍和一些 PHP 相关资源链接

相关链接: http://www.php.net PHP 官方站点,包含非常全面的 PHP 信息,能够从这里获得帮助。 http://pear.php.net PEAR 官方主页。 http://smarty.php.net 官方 Smarty 模板引擎主页。 http://news.php.net 热闹非凡的 PHP 讨论社区。 http://bugs.php.net 在这里报告你发现的 PHP 的 BUG 。 http://snaps.php.net/ 在这里总能找到最新版本的 PHP 源代码和压缩包。 http://cvs.php.net http://qa.php.net http://www.zend.com Zend 公司官方站点。 http://www.mysql.com MySQL 公司官方网站 http://www.apache.org Apache 的官方站点 http://www.phpe.net 超越 PHP,里面有很多经典文章和 PHP Class 下载。 http://expert.csdn.net 有一个人气不错的 PHP 讨论板块,我也在那受益不浅。 http://blog.csdn.net/countstars/ 我的 Blog,如果有任何问题或者建议等,可以到这里和我交流 http://www.openphp.cn 最后这个,是我正在写的一个 PHP 技术站点,希望不久的将来能够完成它,并为喜欢 PHP 的朋友提供更多的 PHP 资源。 下面推荐我所看过的,认为很不错的,和 PHP 相关的一些书目,这些书都放在我的电脑旁边,学习 PHP 准备一些学习书籍是很必要的,即使 Internet 上面已经提供能很多 PHP 的极其相关资源。: 1.《PHP 程序设计》: Programming PHP 出版社:中国电力出版社原出版社:O'Reilly&Associates,Inc 译作者:[美]Rasmus Lerdorf &Kevin Tatroe著 邓云佳等译出版日期:2003年7月定价:68¥字数:519千字 页数:544 说明:该书编写者之一是 PHP 的创造者 Rasmus Lerdorf,本书深入浅出,有很多不错的建议和技巧,融合了作者多年 PHP 开发的井眼,非常不错的一本 PHP 书籍。我感叹一句:里面一些经验和技巧总结真是太棒了。 2.《PHP & MYSQL Web数据库应用开发指南原书名:Web Database Applications with PHP&MySQL 出版社:中国电力出版社原出版社:O'Reilly&Associates,Inc 译作者:Hugh E.Williams等著 谢君英 欧阳宇译出版日期:2003年5月定价:69.00¥ 字数:570千字 页数:599 说明:该书通过一个很好的例子来说明 PHP 和 MySQL 的交互,包括数据库的正规化设计等,推荐这本书的原因是他将程序实例化,并且使用各种技巧来解决 PHP 和 MySQL 数据库的交互,不过阅读该书的时候要有一定的 PHP 基础,该书在亚马逊上是五星级图书。第二版也出来了。值的一读,当你跟着作者做完那套系统后,你会发现你的 PHP 水平真的提高了非常之多。 3. 《PHP 经典实例》原书名:PHP Developer's Cookbook 出版社:中国电力出版社原出版社:Pearson Education 译作者:STERLING HUGHES等著 徐牧等译出版日期:2003年4月定价:39.00¥字数:536千字 页数:359 说明:该书作者之一也是 PHP 开发组成员。该书不是系统的 PHP 学习书目,该书通过很多非常有用的例子,讲述 PHP 的诸多功能,里面包含了非常多的经验和总结以及技巧。深入的探讨了 PHP 各个方面的内容,不是很厚,但是真的非常有价值。 4. 《MySQL 核心编程》原书名:Core MySQL:The Serious Developer's Guide 出版社: 清华大学出版社原出版社:Pearson Education 译作者: (美)Leon Atkinson 著 周靖 许青松出版日期: 2003-4-1 定价:69.00¥页数: 552 说明:一个好的 MySQL 工具是官方的 MySQL 手册,我有几本关于 MySQL 的书,其中一本是 O’reilly 的,不过感觉不是很好,因此又买了这本 MySQL 的系统书。一口气看了一上午,很是通俗易懂,深入浅出,特别是从各个范式讨论数据库设计,详细的内置函数讲解,详细的语句说明,非常不错的一本书,虽然书中有少许错误,呵呵。 5. 《JavaScript 权威指南(第四版)》原书名:javascript:The Definitive Guide,Fourth Edition 出版社:机械工业出版社原出版社:O'Reilly&Associates,Inc. 译作者:David Flanagan著 张铭泽等译出版日期:2003年1月定价:99¥字数:964千字 页数:1015 说明:一本非常经典 JavaScript 书籍,目前已经到了第四版本,如果你真的要投注 Web 开发的话,一本关于 JavaScript 的系统书籍是必需的。在一些处理中,它能帮你做很多事情,并且让你更容易控制 HTML 代码,特别的,我的正则表达式知识是从这里学到的,很详尽。 PHP Blog Personal Blog Aaron Wormus : http://www.wormus.com/aaron/ Adam Trachtenberg : http://www.trachtenberg.com Andrei Zmievski : http://www.gravitonic.com Avenger : http://blog.phpe.net Bitflux : http://blog.bitflux.ch Binzy Wu : http://0926.net/blog/ Chris Shiflett : http://shiflett.org David Sklar : http://www.sklar.com/blog/ Derick Rethans : http://derickrethans.nl George Schlossnagle : http://www.schlossnagle.org/~george/blog/ EasyChen : http://blog.ibkmk.com Harry Fuecks : http://www.sitepoint.com/blog-view.php?blogid=9 HaoHappy : http://blog.csdn.net/haohappy2004/ Ilia Alshanetsky : http://ilia.ws James Cox : http://imajes.info John Coggeshall : http://blog.coggeshall.org Justin Wu : http://www.phpsalon.com Marco Tabini : http://blogs.phparch.com/mt/ Martin Fowler : http://martinfowler.com/bliki/ Miguel de Icaza : http://primates.ximian.com/~miguel/activity-log.php phpComplete : http://phpcomplete.com/ Rasmus Lerdorf : http://lerdorf.com/ sebastian : http://www.sebastian-bergmann.de/blog/ ShenKong : http://blog.csdn.net/countstars/ Sterling Hughes : http://www.edwardbear.org/serendipity/ Wez Furlong : http://netevil.org PHP Website PHP Official Site: http://www.php.net Online Manual: http://www.php.net/manual/zh/ Smarty Template Engine: http://smarty.php.net PEAR: http://pear.php.net PECL: http://pecl.php.net PHP Snapshots: http://snaps.php.net PHP-GTK: http://gtk.php.net DataBase Site MySQL Official Site: http://www.mysql.com SQLite Official Site: http://www.sqlite.org PostgreSQL Official Site: http://www.postgresql.com PostgreSQL Chinese Site: http://www.pgsqldb.org Scripts Site PHP Classes: http://www.phpclasses.org PHP code exchange: http://px.sklar.com Softwares Download PHP: http://www.php.net/downloads.php PHP Manual: http://www.php.net/download-docs.php Apache: http://httpd.apache.org/download.cgi MySQL: http://dev.mysql.com/downloads/ SQLite: http://www.sqlite.org/download.html & http://pecl.php.net/package/SQLite Other Resource Sitepoint: http://www.sitepoint.com PHP Hub: http://www.phphub.com Zend: http://www.zend.com Open Source Web Development: http://www.devshed.com PHP Freaks: http://www.phpfreaks.com PHP Builder: http://www.phpbuilder.com WeberDev: http://www.weberdev.com PHP Editor Review: http://www.php-editors.com

Monday, October 31, 2005

快速调用U盘中的程序全攻略(使用批处理启动多个程序)

  工作中经常需要把一些常用软件、文件复制到U盘中随身携带使用。但在实际使用时发现,每次运行程序都要打开几层文件夹从众多文件中寻找主程序,实在相当麻烦。如果能为这些程序设置快捷方式就简单多了,不过Windows的快捷方式显然是不行的,因为移动硬盘的盘符是会随电脑而改变的。为此笔者找到了一种特殊的“快捷方式”,不仅不受盘符影响,还可以通过一个快捷方式同时运行U盘中的多个程序。(点击查看更多软件使用技巧)

  打开记事本,编辑如下内容:

  Start "" "工具\网络\Foxmail\Foxmail.EXE"

  保存到U盘的根目录下,文件名为“电子邮件.bat”,保存类型为“所有文件(*.*)”。这样只要双击运行“电子邮件.bat”就可以直接运行Foxmail程序,就和Windows的快捷方式一样。如果你需要用一个快捷方式同时打开多个程序,那么就把编辑内容改为:

  start "" 工具\网络\Foxmail\Foxmail.EXE

  start "" 工具\网络\QQ\QQ.exe

  start "" 文件\素材\图形\

  start "" 工具\娱乐\Winamp\Winamp.exe 文件\MP3\MP3列表.m3u

  start "" EXPLORER.EXE /e,文件\素材\图形\

  保存为“运行常用.bat”,则双击“运行常用.bat”就可以同时运行Foxmail和QQ这两个程序、打开“图形”文件夹、用Winamp播放“MP3列表.m3u”歌曲列表、用资源管理器打开“图形”文件夹。在此命令格式为:

  start+半角空格+""+半角空格+程序、文件或文件夹路径

  程序或文件夹路径中若含有空格,则整个路径必须用半角双引号括起来,否则无法识别。在后面如果接的是文件或文件夹会以默认程序打开,当然你若希望带参数运行或使用自己的程序打开文件,还是可以在文件路径前加上相应的程序路径和参数,比如例子中的最后一行命令中EXPLORER.EXE是指定的打开程序,参数“/e,”则表示用资源管理器打开。

  在此,程序、文件或文件夹的路径都使用了不标明盘符的相对路径,这样系统会自动在当前盘符和Windows安装目录下寻找,也就不怕盘符改变了。当然Explorer.exe也不能添加详细路径,毕竟也不是每台电脑的Windows都安装在C:\Windows。程序运行后DOS窗口会自动消失,如果没有消失请检查并删除批处理文件中多余的空行。

  Start命令不仅可打开程序、文件,也可以直接调用IE打开网址。因此我们也可以用这种方法同时打开所有常用的网址。编辑如下命令:

  start "" http:\\www.cce.com.cn

  start "" http:\\www.ccidnet.com

  start "" http:\\www.163.com

  保存为BAT文件,以后只要运行这个文件就可以同时打开《中国电脑教育报》、赛迪网和网易。此外,因为实际使用的是批处理文件,所以也可以在其中加上“copy "文件\MP3\*.*" D:\ ”这样的命令行来把文件夹内的文件全部复制到硬盘中。

  以上是针对Windows XP所做的设置,在Windows 98中由于命令格式不同,命令start ""中的双引号必须删除才行。

Thursday, August 25, 2005

很多有用的文章

http://lovehome.stronglong.com/lovehome/exchm/phpAndLinux/



?用PHP如何获得浏览器信息 ?php多文件上载系统完整版
?线人谁显示php代码 ?自己编的分页模块 ?php命令行参数详解及应用
?php积累的一些技巧 ?phpini中文版 ?php分页显示详解
?一家之言的经验之谈php+mysql扎实个人基本功 ?windowsxp快捷键完美篇 ?做个自己站内搜索引擎
?经验积累 ?数据库的日期格式转换 ?php邮件专题
?多文件上载系统完整版 ?文件上传类 ?浅谈phpmysql身份验证的方法
?关于下载docxls文件的办法javasource ?php新手上路 ?禁止ip的函数
?php时间显示例 ?php用于登录的类 ?使用模板
?如何使用php中的正则表达式 ?一个浏览器检查类 ?缩略图
?最简单的文本计数器 ?删除目录及其下的文件函数 ?让你拥有自己的qq在线显示代码
?中文时间显示的程序 ?日历显示程序 ?文本计数器
?一个判断oicq是否在线的小程序 ?php中如何使用header发送头部信息 ?用php调用oicq在线程序
?获取访问者操作系统增加了win2003 ?用php调用浏览者真实ip方法 ?url地址合法性检查
?页面运行时间代码 ?保存全部页面的方法 ?如何在多个页面之间传递数组
?php中重新定向到另一个页面三种方法 ?php中cookie及其使用 ?用php制作动态计数器
?一个全面获取图象信息的函数getimageinfo ?php文本数据库的搜索方法 ?php如何读取cookies
?php中比较简单的数据验证 ?php的密码验证 ?检验email地址的合法性函数
?定制浏览器地址栏前的小图标 ?一个取ip地址取网卡地址取ip网卡地址的函数 ?用正则表达式得到一个页面的所有链接
?取得当前得页面url ?一个分页导航类 ?制作网页的目录式导航菜单
?一个目录类 ?两个日期类 ?计算程序运算时间的类
?又一个发送mime邮件的类 ?ftp类 ?如何用php判断客户端浏览器的语系
?php中实现大图自动缩成小图及gd库的安装 ?关于生成缩略图的问题各位大侠请进来 ?上传图片自动生成缩略图函数
?自动生成缩略图的函数改进版 ?生成缩略图代码经本人测试 ?图片缩略图的类
?php实现实时时间 ?一个非常棒的上传附件函数 ?php处理http上传文件函数转载
?网页文字简繁转换函数 ?取得文件扩展名方法 ?按比例控制图片显示自动缩放函数
?首页的页面运行时间代码 ?中文时间日期显示的程序 ?php中的时间处理中的时间处理
?一个实时显示服务器时间的小程序 ?计算程序运行时间的类 ?计算两个时间相差的天数
?一个菜单类满好用的 ?判断一个email是否存在的类 ?贴一个文本操作的类
?一个改写的表单验证类 ?计算日期差的函数 ?写了个以交替背景色显示输出的函数
?substr函数中文版终极完美版 ?用php实现验证码功能 ?超越模板引擎
?php安全及相关 ?在php中得到多选的下拉菜单的各项值一个例子 ?php4新函数集锦
?用php写的进度条 ?加速php程序 ?测试gd库
?开发大型php项目的方法 ?关于在php中使用中文命名变量、函数、类 ?实现windows资源管理器风格的树型菜单
?初学php编写了一个显示天气预报的程序 ?繁体中文转换成简体中文 ?ip来源查询php源代码
?php5对盗链说再见 ?怎么在一个图片上打上标签 ?用gd库生成高质量的缩略图片已测试成功
?用php如何获得浏览器信息 ?用php获得浏览器信息 ?用php控制您的浏览器cache
?如何在页面显示来访者分辨率浏览器 ?文件的操作修改删除 ?如何删除文件内第一行或指定一行数据
?php十大 ?php动态图像的创建 ?基于php的聊天室
?获取客户端分辨率 ?如何对php程序中的常见漏洞进行攻击 ?使用php的错误处理
?用php42书写安全的脚本 ?php应用技巧七则 ?php读取某站点的链接的函数
?php中的日期处理 ?显示图片 ?简体中文转换为繁体中文的php函数
?繁体中文转换为简体中文的php函数 ?能把汉字转化为拼音的一个函数 ?功能齐全的发送邮件类
?用php发电子邮件1很简单?我也是这样认为的 ?文件系统基本操作类 ?好东西和大家分享同时上传100个文件上传的程序代码文
?php的ftp学习 ?实时时间显示代码 ?如何将php的结果输出到非php页面中
?输出控制类 ?php的十个高级技巧 ?文本数据库自定义函数集
?请问php里怎么得到访客的ip和端口号 ?php的编译配置详细选项 ?用php调用数据库的存贮过程
?如何取得用户的真实ip ?随机输出目录中的图片 ?怎样使phpinfo函数不起作用
?从网址分离得到域名 ?如何恢复mysql的root口令 ?用php来计算某个目录的大小
?通过gd库为图片添加透明水印 ?升级安装gd ?用php将mysql数据表转换为excel文件格式
?修改zend引擎实现php源码加密的原理及实践 ?喜悦国际村php函数库大全 ?一个用php实现的ubb类
?如何编译php源代码 ?highlight_stringphp语法加亮函数 ?在php中使用与perl兼容的正则表达式
?中文字符串截取函数 ?如何打印 ?文件上传??终结者
?如何对php程序中的常见漏洞进行攻击下 ?web追捕php版源代码 ?php新手上路进入实质性阶段
?php文件上载暴露任意文件 ?php聊天室技术 ?如何将字串里的小写全部转成大写,但不破坏中文字
?基于phpmysql的聊天室设计 ?php设计聊天室步步通 ?php输出控制功能在简繁体转换中的应用
?1900-2100超酷两百年日历 ?删除无限级目录与文件代码共享 ?php源码学习站内搜索html版
?实例学习php之投票程序 ?简单的页面缓冲技术 ?php代码优化及php相关问题总结
?用php计算身份证校验码 ?网页后退不再出现过期 ?php中的类什么叫类
?php柱形统计图 ?超酷php饼图 ?php实现全局静态变量类的一种实现方式
?php文件系统基本操作类 ?如何避免表单的重复提交 ?php5zendengine20的改进
?用php人工使网页过期 ?一个phpmysql的用户验证 ?zendoptimizer配置指南
?php的字符编码转换工具 ?判断是否全为中文另一方法繁体不能用哦 ?php的汉字转换gbk-big5
?php的汉字换unicodeutf8-gbk ?汉字转化为拼音 ?php中实现数字金额到中文大写字符的转换
?基于php的聊天室编程思想 ?正则表达式perl语言的文字处理模式 ?php中的正则表达式
?php正则表达式,删除链接 ?php自动生成月历代码 ?安全天使??端口在线检测php代码
?显示页面运行时间的代码3个 ?一个简单的防盗链东东 ?图片防盗链功能
?用脚本修改用户注册表 ?php套接字编程 ?php做的端口嗅探器--可以指定网站和端口
?php开发文件系统实例讲解 ?生成随机图象的代码php的 ?用zendencode编译php程序
?用php实现xml备份mysql数据库 ?实现跨域名cookie ?用文本文件实现的动态实时发布新闻的程序
?php生成html ?介绍几个array库的新函数 ?outputbuffer输出缓冲函数的妙用
?中文汉字截取函数支持gb2312、big5、utf-8 ?利用static实现表格的颜色隔行显示 ?php中的正规表达式一
?用php发送有附件的电子邮件 ?php和正则表达式 ?不用gd库生成当前时间的png格式图象的程序
?用libtemplate实现静态网页生成 ?短短几行代码就可以把sina的新闻偷过来新闻小偷程序 ?生成适合图片比例的假缩略图js实现
?框架自动跳转的困惑 ?web打印大全 ?实用代码
?自动浏览 ?怎样实现在线用户列表 ?如何正确统计中文字数
?日期联动菜单 ?php中利用gd输出汉字实例 ?怎么设session
?big5码完全解析 ?跟我学小偷程序教程之小偷原理 ?php的模板
?wwwyournametk国际域名完美攻略 ?分页类终结者 ?js滚动特效
?winxp开始→运行→输入的命令集锦 ?php实现文件安全下载 ?php原码颜色
?利用qq的ip查询数据库查询ip所在地php源码 ?js下拉框选择头像图片 ?php50新特性zt
?php生成带有雪花背景的验证码 ?如何去掉文章里的html语法 ?域名查询代码公布
?模式修正符 ?后缀自加程序 ?点每张不同的图片在右面显示出来的内容都不一样
?php中的正规表达式 ?取ip函数 ?用php实现pop3邮件的收取
?gd输出汉字的函数的分析 ?表单中一个文本框自动对另一个文本框赋值改变其样式 ?php在线统计
?怎样查数据库在100秒钟内的记录 ?研究了一下连动下拉菜单共享一下希望有人能继续完善 ?利用js调用后台php进行数据处理原码
?浅析php中实现多线程 ?smarty入? ?获取各用户分辨率
?文件结构图 ?用php得到网卡mac ?db_mysql.php
?ubb正则替换 ?请问如何得到90天以后的日期 ?新身份证校验位算法
?从代码安装完整的http+ftp+mail的linuxserver ?自己编译(升级)php5中的gd库中的jpeg、freetype2、png ?登录的类
?使cookie实现跨域名 ?php中的面向对象和面向过程 ?这些基本的东西你掌握了吗
?实例讲解session的使用方法 ?页面压缩gzip的运用 ?用gd库给图片加中文实例
?smarty的修饰符 ?cookie与session ?抛开cookie使用session
?验证码登陆校验 ?session全面教程 ?php4中session处理的定制
?中文字符串截取涵数 ?图片处理程序 ?得到所有的get参数
?一棵php的类树(支持无限分类) ?php如何更好更有效的实现用户注册页面 ?用php实现真正的连动下拉列表
?php分类列表模版 ?无限分类树型论坛的实现 ?无限分类
?php处理html ?简单的无限分类思想 ?ubbcode类
?简单的树形菜单 ?一个翻页类 ?日历类
?一个在php中利用递归实现论坛分级显示的例子 ?连动下拉菜单 ?计算农历的函数
?discuz!跨站大全 ?如何把php转成exe文件 ?wap服务器如何知道用户手机号码
?ie定制404错误及execommand saveas 警告绕过漏洞 ?php部分常见问题总结 ?php生成wap页面
?搜索引擎技术核心揭密php ?php中重新定向到另一个页面 ?程序员的进化从学生到首席执行官
?一篇介绍hawhaw及用它来做wap站的文章 ?session生存时间设置问题 ?php网站漏洞的相关总结
?功能齐全的发送php邮件类 ?以mysql方式操作文本数据库--又强又实用 ?动态页面生成静态页面的类
?取得随机字符串 ?抓取和分析一个文件 ?自动跳转中英文页面
?php简易实现域名判断跳转 ?让下拉列表又能选择又能输入 ?通用表单验证函数-改进版
?搜索和替换文件或目录的一个好类--很实用 ?php图形处理中的中文输出 ?非常好的目录导航文件代码
?header函数使用说明 ?php页面生成的html代码生成独立的文件 ?php调用功能强大的java 类库classes
?一个生成条形码的东东 ?签名档 ?通过对php一些服务器端特性的配置加强php的安全
?使用无限生命期session的方法 ?使用数据库保存session的方法 ?类的另类用法--数据的封装
?多php服务器实现多session并发运行 ?php脚本的8个技巧 ?最近收集的一些php的经经验技巧
?php-push技术实现刷新功能 ?正则表达式中的特殊字符一览 ?php5中xml-rpc函数的使用
?如何正确理解php的错误信息 ?htaccess文件使用手册 ?php面向对象编程快速入门
?php4之cookie支持详解档 ?php中的xml应用 ?phpshell的编写改进版
?php与xul ?关于php操作文件的一些faq总结 ?简单的数据缓存技术
?php中xml操作指南 ?正则表达式使用详解 ?gb码完全解析
?md5加密算法简介 ?收发邮件的一个程序 ?用php实现pop3邮件的解码
?smtp协议原始命令码和工作原理 ?pop3协议命令原始码及工作原理 ?发送mime邮件类
?发送mime邮件类--实例 ?用php读取imap邮件 ?最好的邮件编码解码类
?用socket发送电子邮件利用需要验证的smtp服务器 ?使用php的编码功能-问题发现 ?使用php的编码功能-mime.inc
?php实现ping ?用socket发送电子邮件(利用需要验证发smtp服务器 ?apache服务器配置全攻略
?在php5中实现自动装载类库 ?一个登录的类 ?一个简单的php在线端口扫描器
?如何将php作为shell脚本语言使用 ?生成excel的文件 ?年月日三下拉框联动
?重新整理源码下载地址及各类资源站点 ?取得源码里面的里面的url ?用php动态创建flash动画
?将access中的数据导入mysql ?简单音乐盒的实现 ?使用header发送状态代码
?php的面向对象编程 ?非技术类 - linux网址精选 ?非技术类 - 中国著名正版软体的网站
?非技术类 - joke ?非技术类 - Just for Fun ?非技术类 - 美国公认顶尖黑客榜
?非技术类 - 100个最佳linux站点 ?非技术类 - 鹦鹉的故事 ?非技术类 - 经典的海盗问题
?非技术类 - 关于软件的说法 ?非技术类 - 请不要做浮躁的人 ?非技术类 - 600个优秀网站
?非技术类 - Linux历史篇 ?非技术类 - 打围巾的八种方法 ?非技术类 - 用语言控制Linux:Linux的语音识别软件
?非技术类 - 完全用GNU - Linux工作 ?非技术类 - 教你如何在5460同学录贴图 ?非技术类 - 四个人性的经典故事
?非技术类 - 成长中必须知道的10个故事 ?非技术类 - CIO的诞生 ?非技术类 - 中关村的“技术个体户”
?基础知识 - 关于WINS服务器的问题 ?基础知识 - linux网络服务器配置基础 ?基础知识 - 网络配置文件
?基础知识 - linux系统服务 ?基础知识 - Modules的概念及使用 ?基础知识 - linux常用工具软件
?基础知识 - linux最多支持多少用户 ?基础知识 - 解析linux操作系统文件目录 ?基础知识 - FAQ
?基础知识 - 信号集合 ?基础知识 - linux重要知识 ?基础知识 - 几个小问题
?基础知识 - 系统日志 ?基础知识 - 文件和目录的权限 ?基础知识 - linux日志
?基础知识 - shadow密码 ?基础知识 - 嵌入式系统 ?基础知识 - 在linux下实现设备的配置
?基础知识 - linux新手最经常遇到的问题 ?基础知识 - linux中文件查找技术大全 ?基础知识 - 信号
?基础知识 - LVS的配置详解配置 ?基础知识 - Liunx系统的LOG日志文件 ?基础知识 - UNIX简介
?基础知识 - 网络协议全了解 ?基础知识 - 在Linux下配置TCP - IP ?基础知识 - 常见的几种光盘文件系统和刻录方式
?基础知识 - unix基础教程 ?基础知识 - 技巧小全 ?基础知识 - 跨平台开发
?基础知识 - LINUX系统、设备、软件简易安装指南 ?基础知识 - www.linuxforum.net入门版常见问题 ?基础知识 - 在linux下使用HPCD-Writer Plus 8210e (USB-接口)刻录机
?基础知识 - 目录结构 ?基础知识 - 日志管理 ?PHP安全编程之加密功能
?基础知识 - linux知识大全 ?基础知识 - fstab格式 ?基础知识 - Red Hat Linux 8.0自动运行程序的方法
?基础知识 - linux重要知识 ?基础知识 - 端口基础常识大全贴 ?基础知识 - 常用端口对照
?基础知识 - Linux系统备份 ?基础知识 - RedHat日志文件 ?基础知识 - 重新规划分割区
?基础知识 - 在Linux中访问硬盘DOS分区、软盘和光盘 ?基础知识 - linux99问 ?基础知识 - 在Linux中设置磁盘限额
?基础知识 - Linux基本安装方法 ?基础知识 - linux安装常见的FAQ问题(第二版) ?基础知识 - Linux与其他操作系统的区别
?基础知识 - 文件的存取权限?模式位疑难详解 ?基础知识 - Linux中文件的压缩与解压缩 ?基础知识 - ReiserFS文件系统
?基础知识 - Linux各项系统开机服务的功能是什么?有哪些可以关掉? ?基础知识 - Linux各种发行版简易说明 ?基础知识 - 什么是Linux
?基础知识 - Linux下的中文显示和支持常见问题解答 ?基础知识 - 如何在Linux下通过WEB认证方式上网 ?基础知识 - redhat8死机解决一例
?基础知识 - 必不可少的4个服务 ?基础知识 - linux爱好者入门教程及相关配置 ?基础知识 - Linux的发行版制作简要过程
?基础知识 - Linux下安装和使用杀毒软件AntiVir ?基础知识 - 关于磁盘阵列,分区加载的问题 ?基础知识 - 红旗桌面4.0正式版最新使用方法和问题解答200例
?基础知识 - ext2和ext3的区别 ?基础知识 - 虚拟块硬盘,新增点swap分区空间 ?基础知识 - 配额
?基础知识 - 如何增加swap ?基础知识 - 一些比较经典的问题与解答 ?基础知识 - Linux使用技巧33条
?PHP实现文件安全下载 ?基础知识 - 使用Linux的8个小技巧 ?基础知识 - Linux应用问答
?基础知识 - Linux与Windows硬盘资源互访 ?基础知识 - label的问题 ?基础知识 - Linux简明系统维护手册
?基础知识 - 菜鸟心得(二)请共享,请指正 ?基础知识 - Linux使用技巧集锦 ?基础知识 - . - configure make make install分别是什么意思
?基础知识 - halt poweroff reboot问题 ?基础知识 - 使用mc恢复被删除文件 ?基础知识 - 用红帽子的chkconfig管理Init脚本
?基础知识 - 为linux服务器增加新分区 ?基础知识 - 一些奇怪的unix指令名字的由? ?基础知识 - Linux下文件属性
?基础知识 - 三层交换技术解析 ?基础知识 - Wiki的初步了解 ?指令大全 - gunzip
?指令大全 - vi用法 ?指令大全 - find用法 ?指令大全 - find实例
?指令大全 - hdparm ?指令大全 - xargs实例 ?指令大全 - vi编辑器
?指令大全 - mkisofs ?指令大全 - sort ?指令大全 - man实例
?指令大全 - md5sum ?指令大全 - cdrecord ?指令大全 - du
?指令大全 - shutdown,halt,reboot,init ?指令大全 - wget ?指令大全 - ping
?指令大全 - linux指令大全 ?指令大全 - vi的简单用法 ?指令大全 - ls
?指令大全 - PortTunnel ?指令大全 - mount ?指令大全 - Red Hat9实用工具
?指令大全 - linux环境下的undelete ?指令大全 - chattr ?指令大全 - sar,iostat,vmstat
?指令大全 - ps ?指令大全 - du和df ?指令大全 - netstat
?指令大全 - RPM命令大全 ?指令大全 - rpm实例 ?指令大全 - df
?指令大全 - hwclock ?指令大全 - mkdev ?指令大全 - tar实例
?指令大全 - setup实例 ?指令大全 - vim实例 ?指令大全 - fuser
?指令大全 - od ?指令大全 - locate实例 ?指令大全 - quota
?指令大全 - vi大全 ?指令大全 - 指令大全 ?指令大全 - 命令技巧大全(需分解)
?指令大全 - head,tail,sed ?指令大全 - file ?网站加速 PHP 缓冲的免费实现方法
?指令大全 - man2html ?指令大全 - 一次处理整个目录 ?指令大全 - hexed
?指令大全 - bc ?指令大全 - vmstat ?指令大全 - quota
?指令大全 - 精通RPM之安装篇 ?指令大全 - scp ?指令大全 - rz - sz
?指令大全 - fdformat ?指令大全 - redhat9键盘的快捷操作 ?指令大全 - vim中的颜色
?指令大全 - emacs ?指令大全 - 格式化软盘 ?指令大全 - RPM命令手册
?指令大全 - 全文替换以修改档案方法 ?指令大全 - Linux中文件查找技术大全 ?指令大全 - 对光驱和软驱实现Automount
?指令大全 - RPM的使用 ?指令大全 - 在Linux中限制用户空间 ?指令大全 - vim显示彩色
?指令大全 - tee ?指令大全 - 档案目录管理--cat ?指令大全 - Linux 指令篇:档案目录管理--cd
?指令大全 - Linux 指令篇:档案目录管理--chmod ?指令大全 - Linux 指令篇:档案目录管理--chown ?指令大全 - Linux 指令篇:档案目录管理--cp
?指令大全 - Linux 指令篇:档案目录管理--cut ?指令大全 - Linux 指令篇:档案目录管理--find ?指令大全 - Linux 指令篇:档案目录管理--less
?指令大全 - Linux 指令篇:档案目录管理--ln ?指令大全 - Linux 指令篇:档案目录管理--locate ?指令大全 - Linux 指令篇:档案目录管理--ls
?指令大全 - Linux 指令篇:档案目录管理--mkdir ?指令大全 - Linux 指令篇:档案目录管理--more ?指令大全 - Linux 指令篇:档案目录管理--mv
?指令大全 - Linux 指令篇:档案目录管理--rm ?指令大全 - Linux 指令篇:档案目录管理--rmdir ?指令大全 - Linux 指令篇:档案目录管理--split
?指令大全 - Linux 指令篇:档案目录管理--touch ?指令大全 - Linux 指令篇:日期时间排程--at ?指令大全 - Linux 指令篇:日期时间排程--cal
?指令大全 - Linux 指令篇:日期时间排程--crontab ?PHP中的正规表达式(二) ?指令大全 - Linux 指令篇:日期时间排程--date
?指令大全 - Linux 指令篇:日期时间排程--sleep ?指令大全 - Linux 指令篇:日期时间排程--time ?指令大全 - Linux 指令篇:日期时间排程--uptime
?指令大全 - Linux 指令篇:使用者资讯与管理--chfn ?指令大全 - Linux 指令篇:使用者资讯与管理--chsh ?指令大全 - Linux 指令篇:使用者资讯与管理--finger
?指令大全 - Linux 指令篇:使用者资讯与管理--last ?指令大全 - Linux 指令篇:使用者资讯与管理--passwd ?指令大全 - Linux 指令篇:使用者资讯与管理--who
?指令大全 - Linux 指令篇:讯息传送与信件管理--aliases ?指令大全 - Linux 指令篇:讯息传送与信件管理--mail ?指令大全 - Linux 指令篇:讯息传送与信件管理--mailq
?指令大全 - Linux 指令篇:讯息传送与信件管理--mesg ?指令大全 - Linux 指令篇:讯息传送与信件管理--newaliases ?指令大全 - Linux 指令篇:讯息传送与信件管理--talk
?指令大全 - Linux 指令篇:讯息传送与信件管理--wall ?指令大全 - Linux 指令篇:讯息传送与信件管理--write ?指令大全 - Linux 指令篇:工作行程资讯与管理--kill
?指令大全 - Linux 指令篇:工作行程资讯与管理--nice ?指令大全 - Linux 指令篇:工作行程资讯与管理--ps ?指令大全 - Linux 指令篇:工作行程资讯与管理--pstree
?指令大全 - Linux 指令篇:工作行程资讯与管理--renice ?指令大全 - Linux 指令篇:工作行程资讯与管理--skill ?指令大全 - Linux 指令篇:文件系统--fstab
?指令大全 - Linux 指令篇:文件系统--fsck ?指令大全 - Linux 指令篇:文件系统--fdisk ?指令大全 - Linux 指令篇:文件系统--exportfs
?指令大全 - Linux 指令篇:文件系统--e2fsck ?指令大全 - Linux 指令篇:文件系统--df ?指令大全 - Linux 指令篇:文件系统--dd
?指令大全 - Linux 指令篇:设备管理--setleds ?指令大全 - Linux 指令篇:设备管理--rdev ?指令大全 - Linux 指令篇:设备管理--loadkeys
?指令大全 - Linux 指令篇:设备管理--dumpkeys ?指令大全 - Linux 指令篇:设备管理--MAKEDEV ?指令大全 - Linux 指令篇:磁片工具--mkdosfs
?指令大全 - Linux 指令篇:磁片工具--mformat ?指令大全 - Linux 指令篇:磁片工具--fdformat ?指令大全 - Linux 指令篇:文件打印--lprm
?指令大全 - Linux 指令篇:文件打印--lpr ?指令大全 - Linux 指令篇:文件打印--lpq ?指令大全 - Linux 指令篇:文件打印--lpd
?指令大全 - Linux 指令篇:编码压缩打包--uuencode ?指令大全 - Linux 指令篇:编码压缩打包--uudecode ?指令大全 - Linux 指令篇:编码压缩打包--uudecode
?指令大全 - Linux 指令篇:编码压缩打包--compress ?指令大全 - Linux 指令篇:终端机管理--reset ?指令大全 - Linux 指令篇:终端机管理--clear
?指令大全 - Linux 指令篇:字串处理--tr ?指令大全 - Linux 指令篇:字串处理--expr ?指令大全 - Linux 指令篇:工作行程资讯与管理--top
?指令大全 - Linux 指令篇:DOS相容指令--mlabel ?指令大全 - Linux 指令篇:DOS相容指令--mdeltree ?指令大全 - Linux 指令篇:DOS相容指令--mdel
?指令大全 - Linux 指令篇:DOS相容指令--mcopy ?指令大全 - Linux 指令篇:DOS相容指令--mcd ?指令大全 - Linux 指令篇:DOS相容指令--mattrib
?指令大全 - Linux 指令篇:起始管理--shutdown ?指令大全 - Linux 指令篇:起始管理--reboot ?指令大全 - Linux 指令篇:起始管理--init
?指令大全 - Linux 指令篇:起始管理--halt ?指令大全 - Linux 指令篇:使用者管理--sudo ?指令大全 - Linux 指令篇:使用者管理--su
?指令大全 - Linux 指令篇:使用者管理--adduser ?指令大全 - Linux 指令篇:文件系统--sync ?指令大全 - Linux 指令篇:文件系统--swapon
?指令大全 - Linux 指令篇:文件系统--mount ?指令大全 - Linux 指令篇:文件系统--mkfs ?指令大全 - hostid
?指令大全 - 用RPM校验文件 ?指令大全 - userdel ?指令大全 - chsh
?指令大全 - chfn ?指令大全 - MIRROR ?指令大全 - man.conf
?指令大全 - man ?指令大全 - ftpaccess ?指令大全 - groupadd
?PHP中的正规表达式(一) ?指令大全 - groupdel ?指令大全 - usermod
?指令大全 - whatis ?指令大全 - groupmod ?指令大全 - losetup
?指令大全 - mkdir ?指令大全 - mkfs ?指令大全 - apropos
?指令大全 - sudo ?指令大全 - useradd ?指令大全 - shutdown
?指令大全 - PPPD ?指令大全 - rpm命令参数列表 ?指令大全 - tripwire的用法
?指令大全 - 如何在vi中做到高亮显示和彩色 ?指令大全 - 如何模拟dos下的copy con a.txt生成a.txt文件 ?指令大全 - vi中怎么去掉响铃
?指令大全 - xhost ?指令大全 - scp ?指令大全 - Linux学习手册
?指令大全 - 已经配置了hosts.equiv及.rhosts, 为何仍不能使用rcp ?指令大全 - 如何取出RPM包中的文件 ?指令大全 - vi编辑器的使用技巧
?指令大全 - vi同时编辑多个文件 ?指令大全 - ip ?shell - 打印通配结果
?shell - Bash中对变量的操作 ?shell - 列出目录树 ?shell - while循环中使用read
?shell - 将多个空格替换为一个空格 ?shell - shell ?shell - 用脚本实现分割文件
?shell - shell ?shell - 删除一个月以前的文件 ?shell - shell中循环取出文件中每一行赋予一变量的问题
?shell - 在每个文件夹下建一个.qmail文件 ?shell - shell简介 ?shell - 得到上月未日期,格式为YYYYMMDD
?shell - 实现用backup或tar命令来做目录备份 ?shell - 编写一个只允许用户执行telnet的shell ?shell - 关于awk中计算正弦90度的问题
?shell - SHELL Warming up ?shell - 时间同步 ?shell - 判断文件的访问权限是不是600
?shell - 从一个目录中提取文件的问题 ?shell - rsh ?shell - 关于KSH中select建立菜单的问题
?shell - 设shell自己的隐含变量 ?shell - 合并某些行 ?shell - sh与csh的比较
?shell - 用shell编出来的查看dbf文件的脚本 ?shell - SCO UNIX 5.0.5下通用菜单程序(用bsh制作,含源码) ?shell - "Dogs" of the linux Shell
?shell - 算青蛙的脚本 ?shell - 怎么把一个文本的一列,换成一行 ?Cookie及其使用(二)
?shell - 用awk实现删除文件 ?shell - 怎样把一字符串(在变量里)翻转过来,再存到变量里 ?shell - 关于expr的用法
?shell - 查找日期为某一天的文件 ?shell - sed用法 ?shell - 用cshell逐行读文件逐行处理
?shell - awk中如何用print输出单引号 ?shell - 长篇连载--arm linux演艺---序 ?shell - 正则表达式一例
?shell - 把csh,sh或ksh语法的脚本相互转换的程序 ?shell - shell用法 ?shell - finger统计同ip地址的tty终端数
?shell - shell计算明天和昨天日期的函数 ?shell - 合并两个文件 ?shell - 数值和字母表一一对应
?shell - 停止终端多个进程 ?shell - 调试makefile ?shell - 读文本的最后一行
?shell - 实现两次变量替代 ?shell - 用gawk分析脚本调用ping命令时ping的进程号 ?shell - awk中如何使用shell的环境变量
?shell - 什么格式才能让SHELL正确地替换这样两个变量 ?shell - 找出一个文件中出现某str的次数 ?shell - 我想每天自动执行该shell
?shell - 重定向是什么 ?shell - shell中的行和列 ?shell - 隐藏光标的方法
?shell - 当while遇到重定向----sh的陷阱 ?shell - 在SHELL程序中实现‘按任意键继续’ ?shell - 在shell程序中判断一个变量是不是由4个数字组成
?shell - 代码解释 ?shell - awk中str可不可以相加 ?shell - 正则表达式
?shell - 关于设置命令行提示符(PS1) ?shell - 非交互方式改变登录用户密码 ?shell - 去掉awk中单引号的特殊性
?shell - 登陆到多台主机上去查看相关进程返回结果 ?shell - 实现两次变量替代 ?shell - linux网络安全和优化
?shell - shell的问题 ?shell - awk文本处理 ?shell - 用date获得前一天的日期
?shell - 如何用date获得前一天的日期 ?shell - 论正则表达式的“贪婪”性 ?shell - 在awk中如何引用shell的变量
?shell - 用sh列表显示oracle数据库单条查询结果 ?shell - awk中使用shell变量疑问 ?shell - 什么格式才能让SHELL正确的替换这样两个变量
?shell - shell参数问题,linux ?shell - eval用法三例 ?shell - 大小写转化
?shell - 显示小数点后面几位 ?shell - 获得一个变量的长度 ?shell - remsh疑问
?shell - 将文件中的“" 替换成“" ?shell - 重定向 ?shell - 用sed删除由空格组成的空行
?shell - 请问如何用shell作隔行删除 ?shell - 如何判断读入字符是回车键还是方向键 ?shell - 请问如何用Shell编: 在当前目录下保留指定日期的文件,其余的全部删除?
?shell - 如何计算一个日期是星期几 ?shell - 如何用Bshell转换cgi传入的变量中的非ASCII字符(汉字) ?shell - 请问如何抽取特征字的下一行
?shell - sed中如何替换出新行来 ?shell - 请问trap的用法和其作用 ?shell - shell 编程中的信号处理(signal handling in shell programming)
?shell - 定制自己的linux应用环境 ?shell - 删除指定内容的重复行 ?shell - 测试硬盘性能
?shell - 在linux环境下启动时打开numlock ?shell - 取出文件中特定的列内容 ?Cookie及其使用(一)
?shell - 实现Hex和Dec转换 ?shell - linux Boot Scripts ?shell - 可否用SHELL实现对SQL进行查询,修改,删除等等呢
?shell - 什么是Shell ?shell - 使程序的执行结果同时定向到屏幕和文件 ?shell - $@等特定shell变量的含义
?shell - cut的用法 ?shell - 把一个shell程序编译成二进制可执行文件 ?shell - 在shell里如何限制输入的长度
?shell - 双机(多机)自动互备份方案 ?shell - 用awk显示出现在两个模式之间的内容 ?shell - 文件序列a1,a2,a3...a11,a12...a1000改成a0001,a0002...a1000?
?shell - 做到限时登录 ?shell - awk脚本一例 ?shell - Bash的环境设定
?shell - 执行脚本 ?shell - 在shell中捕捉信号的trap命令 ?shell - 禁止从一个IP登录的shell
?shell - bash简介 ?shell - 重定向一例 ?shell - passwd -d aaa时报错
?shell - eval用法三例 ?shell - xset设置的含义 ?shell - 批量建立用户
?shell - 改变UNIX终端颜色 ?shell - 很方便的两个shell script ?shell - 有关awk字段分隔符
?shell - 在等待read时如何不换行输入 ?shell - Unix系列shell程序编写 ?shell - shell脚本问题
?shell - bash ?shell - Shell递归程序设计 - 批量转换大写文件名为小写 ?shell - HANDY ONE-LINERS FOR SED
?shell - 把连你电脑的人踢出去 ?shell - 如何取消beep声音 ?shell - 一个判断文件日期的问题
?shell - ls的问题 ?shell - ORACLE自动备份并且自动FTP到备份机的SHELL脚本 ?shell - 文件名转化大小写
?shell - ASH Shell的脚本编程 ?shell - php4使用session的时候出现O_RDWR failed ?shell - 如何比较两个字符串啊
?shell - 基于PPP协议的linux与Windows CE网络 ?shell - Shell高级屏幕输出 ?shell - 在Linux Shell程序中进行身份验证
?shell - 也谈在Unix系统中杀死相关终端的进程 ?shell - FAQ ?shell - bash
?shell - 续-----一个杀死终端所有进程的 Shell ?shell - 替换文件中的文本 ?shell - bash
?shell - gawk的使用方法 ?shell - sed实例 ?shell - SED手册
?shell - 设定环境变数 ?shell - 使用命令trap来捕捉信号 ?shell - expect使用一例
?shell - 恢复缺省bash提示符 ?shell - expect用法 ?shell - 一个检测show128文件更新的shell脚本
?shell - DISPLAY变量的用法 ?shell - 一支反砍站的iptables script ?安装启动 - 一张软盘启动的linux
?安装启动 - 解决多系统的最好、最安全的方法 ?安装启动 - 如何在单个硬盘驱动器上构建双引导linux系统 ?安装启动 - 忘了root密码的解决方法
?安装启动 - linux的引导过程 ?安装启动 - lilo中把dos - windows改为缺省启动的OS ?安装启动 - 操作系统的灵活性
?安装启动 - linux安全设置手册 ?安装启动 - 让双CPU的linux机器自动关机 ?安装启动 - linux新手安装教训
?安装启动 - linux启动盘制作法 ?安装启动 - 一个硬盘上装好win98 - nt - linux ?安装启动 - linux运行级别详解
?安装启动 - linux运行级init详解 ?安装启动 - linux单用户方式 ?安装启动 - linux系统分区
?安装启动 - grub引导管理器下恢复linux的root密码 ?安装启动 - 硬盘安装rh8 ?安装启动 - 用ghost对linux系统做备份
?安装启动 - 双引导问题 ?安装启动 - 将linux硬盘ghost到另一颗去 ?安装启动 - 安装rh72如何选择引导工具
?安装启动 - win2k和linux共存(grub在mbr) ?安装启动 - linux分区配制教程 ?安装启动 - linux的安装教程
?安装启动 - 开机和关机 ?安装启动 - lilo.conf中的read-only的作用 ?安装启动 - 比lilo更强劲的多操作系统引导程序grub
?安装启动 - 装win9x后lilo失效的解决方案 ?安装启动 - 如何配置VMware来通过令牌环卡访问外部LAN ?安装启动 - 双启动型USB优盘的使用举例和注意问题
?安装启动 - 3的硬盘安装 ?安装启动 - BluePoint linux的安装过程 ?安装启动 - 我的六个系统安装方法及其应用
?安装启动 - 修改grub的安装位置 ?安装启动 - grub为什么会在访问某些scsi硬盘的时候挂起 ?安装启动 - 在grub中指定内存大小
?安装启动 - 已经装了最新的binutils,为什么grub还是不能用 ?安装启动 - 反黑行动之数据恢复 ?安装启动 - fstab文件
?安装启动 - Red Hat linux 8.0 Package List ?安装启动 - linux各项系统开机服务的功能是什么? ?安装启动 - 恢复redhat的grub
?安装启动 - 如何进入linux去查看启动记录,这启动记录存在哪里? ?安装启动 - 一个PC上同时安装37个系统 ?安装启动 - 启动过程中sendmail启动慢
?安装启动 - 恢复redhat的grub ?安装启动 - 2000与linux双系统的安装 ?安装启动 - 将装过的lilo移到MBR上
?安装启动 - xfs文件系统 ?安装启动 - 从硬盘安装rh9的朋友注意了 ?安装启动 - linux上远程启动的无盘98
?安装启动 - linux系统的自动作业控制 ?安装启动 - win - linux双系统安装grub ?安装启动 - grub安装配置及使用汇总
?安装启动 - linux上远程启动的无盘98 ?安装启动 - 装scsi硬盘 ?安装启动 - 开机简述
?安装启动 - 将分区(39G)加载到 - home上 ?安装启动 - 关于redhat linux8.0系统的备份的体会和心得 ?安装启动 - 目录解析
?安装启动 - 不用软驱也照样能启动linux ?安装启动 - 让redhat8安装时使用reiserfs ?安装启动 - 文本模式的分辨率
?安装启动 - 制作linux的优盘启动盘 ?安装启动 - 建立优盘启动盘 ?安装启动 - sun服务器上安装linux
?安装启动 - 制作启动盘 ?安装启动 - 修改登录画面 ?安装启动 - 为linux划分分区
?安装启动 - 远程安装linux ?安装启动 - dos下用grub.exe修复启动故障 ?安装启动 - grub入门
?安装启动 - 大硬盘系统上安装linux系统的问题及其解决方案 ?安装启动 - 简单实现NT或WIN2000与linux共存 ?安装启动 - grub.conf中加入一项
?安装启动 - single模式需要密码怎么办 ?安装启动 - 关于lilo和nt loader的问题 ?安装启动 - 用安装光盘来修复grub
?安装启动 - os loader引导多系统实战 ?安装启动 - winxp - windows2003,还有mandrake9.1同时一个菜单引导 ?安装启动 - boot loader
?安装启动 - 全面探讨lilo---lilo学习笔记 ?安装启动 - grub学习笔记 ?安装启动 - grub的图形配置器--grubconf
?安装启动 - 再探安装多操作系统分区,grub的设置问题 ?安装启动 - 我的分区 ?安装启动 - 升级Linux系统的硬盘
?安装启动 - 安装grub ?安装启动 - 挑食的企鹅 ?安装启动 - 我的硬盘分区[双硬盘参考篇]
?PHP的面向对象编程 ?安装启动 - grub多重启动管理器 ?安装启动 - 关于硬盘分区
?安装启动 - 小菜鸟与grub的故事 ?安装启动 - 系统安装引导盘的制作 ?安装启动 - 关于安装redhat8后,导至原win2000变慢的解决方法
?安装启动 - 解决2k与linux共存后起动慢的经历 ?安装启动 - 装Linux后win2k - xp - server启动变慢的解决之道 ?安装启动 - 无光驱软驱恢复grub一例
?安装启动 - 差点要重装机器了 ?安装启动 - 给grub加上密码锁 ?安装启动 - grub能引导sco unix505吗
?安装启动 - 灾难恢复 ?安装启动 - grub scsi硬盘 mbr ?安装启动 - gnu grub faq (简体中文版)
?安装启动 - lilo使用指南 ?安装启动 - linux引导过程 ?PHP中重新定向到另一个页面
?安装启动 - 硬盘主引导记录详解 ?安装启动 - lilo配置攻略 ?安装启动 - vmware的vmware tools安装
?安装启动 - LILO的问题 ?安装启动 - lilo.conf之中文man手册 ?安装启动 - lilo原理
?安装启动 - 启动过程跟踪 ?安装启动 - grub中的分区命名方法 ?安装启动 - 重装grub
?安装启动 - GRUB三步通 ?安装启动 - 制作启动盘 ?安装启动 - LINUX和LILO
?安装启动 - grub一例 ?安装启动 - Windows 2000 Server - FreeBSD - RedHat Advanced Server 2.1 ?安装启动 - vga选项
?安装启动 - 脆弱的grub ?安装启动 - 自己动手做一个迷你Linux系统 ?安装启动 - 非正常关机导致文件系统破坏了
?安装启动 - 无法启动系统 ?安装启动 - RedHat开机起动流程 ?安装启动 - 开机 - 关机管理
?安装启动 - 装win9x后lilo失效的解决方案 ?安装启动 - Linux远程启动 ?安装启动 - lilo大杂耍
?安装启动 - 与NT和平共处 ?安装启动 - redflag的vmware安装 ?安装启动 - Linux关机命令详解
?安装启动 - 解读LILO错误提示信息 ?安装启动 - 安装Linux无盘工作站 ?安装启动 - MBR如果被覆盖了怎麽办
?安装启动 - Linux开机过程的分析(关于bootsect.S) ?安装启动 - Debian GNU - Linux 完全安装手册 ?安装启动 - 实例讲解LILO的配置和使用
?安装启动 - 标题 ?安装启动 - 备份和修复Linux LILO指南 ?安装启动 - 深入Linux的LILO
?安装启动 - lilo ?安装启动 - Lilo.conf(LILO 配置文件)手册 ?安装启动 - 深入解剖LILO
?安装启动 - Linux下Grub开机管理程式安装简介 ?安装启动 - 主引导扇区释疑 ?安装启动 - 安装xp - freebsd - linux
?安装启动 - 制作自己的Floppy-Linux Step By Step ?安装启动 - 如何找回redhat7.2的root密码 ?安装启动 - 安装redhat9时键盘找不到
?安装启动 - RH8,9中安装后如何添加新的语言包 ?安装启动 - 硬盘改变位置之后重新安装GRUB以及修改相应文件的方法 ?安装启动 - Linux启动盘boot - root盘的制作
?安装启动 - Kickstart+HTTP+DHCP+TFTP+PXElinux实现RedHat的网络自动安装 ?安装启动 - 制作Linux启动盘的四种方法 ?安装启动 - 无软驱和光驱安装Redhat方法
?安装启动 - 自动fsck ?安装启动 - 制作Fedora DVD ISO的方法 ?安装启动 - 使用yum把内核升级到Kernel 2.6.0test9
?安装启动 - edHat 7.3 Live in CDROM HowTo ?安装启动 - 朋友给的mosix不敢独享 ?安装启动 - 朋友给的openmosix不敢独享
?安装启动 - 终结大硬盘安装linux采用lilo启动问题 ?安装启动 - 一张光盘的RedHat Linux 9.0(387兆) ?安装启动 - 如何在win2000下隐藏linux的分区
?安装启动 - 没有软驱安装不能光盘引导的RH7.0心得(VPC中) ?安装启动 - 如何clear mbr ?安装启动 - 操作系统的启动
?安装启动 - Linux关机重启流程分析 ?安装启动 - Linux只能以软盘引导方式进入之处理办法 ?安装启动 - 如何在一个硬盘上装好了WIN98 - NT - Linux
?安装启动 - 制作Linux的优盘启动盘 ?安装启动 - 自己定制软盘上的Linux系统 ?安装启动 - 用DOS命令破除UNIX系统管理员口令
?安装启动 - kickstart无人值守安装linux ?安装启动 - 多系统安装实践(Win2k Server、FreeBSD、RH Linux AS2.1) ?安装启动 - UNIX的启动和关机过程
?安装启动 - RedHat 9.0的“绿色”安装 ?长沙发上的对话(四) ?安装启动 - lilo的问题
?安装启动 - 35M的迷你linux系统 ?安装启动 - lilo.conf中文手册 ?安装启动 - lilo启动的故障判断
?安装启动 - 如何用lilo引导不同的运行级别 ?安装启动 - 备份和修复linux LILO指南 ?安装启动 - linux忘记了密码怎么办
?安装启动 - 用安装盘来修复GRUB ?安装启动 - GRUB使用说明 ?安装启动 - 关于mbr的存取控制
?安装启动 - 安装scsi硬盘 ?安装启动 - scsi硬盘的安装 ?安装启动 - 解决RH9.0自动升级出现的SSL连接错误
?安装启动 - 安装grub ?安装启动 - 引导linux的3种方法 ?xwindow - x终端的详细使用方法
?xwindow - 桌面不见了 ?xwindow - XF86Conifg文件详解 ?xwindow - 完美安装mplayer手册
?xwindow - linux上安装QQ ?xwindow - 关闭Mozilla的自动安装插件对话框 ?xwindow - 如何取消虚拟屏幕
?xwindow - 以指定的颜色深度启动xwindow ?xwindow - 在XWindows环境下阅读PDF ?xwindow - 开启多个xwindow
?xwindow - RH7.2下XDM如何起动 ?xwindow - xwindow问题 ?xwindow - Redhat8.0下XMMS播放Mp3快速解决方案
?xwindow - RH8中配置非即插即用声卡(ISA,SB16) ?xwindow - linux下DISPLAY的使用方法 ?xwindow - REDHAT 7.2安装声卡驱动
?xwindow - flash插件的安装 ?xwindow - 通过exceed使用KDE ?xwindow - 同时启动6个X控制台
?xwindow - 解决gnome-terminal里汉字花屏问题 ?xwindow - 显示卡攻略 ?xwindow - 声卡配置
?xwindow - nvidia显卡驱动程序的安装 ?xwindow - 使用Xwin32登陆Redhat7.2图形界面的问题 ?xwindow - linux下的X Server配置快速攻略
?xwindow - 设置和修改XWindow的显示模式 ?xwindow - Debian中升级到gnome2.2 ?xwindow - X-windows下设置墙纸
?xwindow - xf86config使用说明 ?xwindow - vnc一问 ?xwindow - ound Blaster AWE 32 - 64 HOWTO 如何在Linux设定声卡
?xwindow - Soundblaster 16 PnP Mini-Howto 如何在Linux设定16位PnP声卡 ?xwindow - 设置和修改 X Window 的显示模式 ?xwindow - Linux中的字型(FONTS)设定
?xwindow - 修改刷新率 ?xwindow - 远程xwindow登录 ?xwindow - 如何将.tif.rgb.gif......的图片转换成.xpm的格式
?长沙发上的对话(三) ?xwindow - xwindow的语言选择 ?xwindow - 万能声卡驱动(Alsa)的安装方法
?xwindow - 让RedHat允许从Windows上的X登陆 ?xwindow - 把PC变成X Server ?xwindow - exceed中添加中文字体
?xwindow - 在linux怎样设置双显卡 ?xwindow - 板载声卡的四种安装方法 ?xwindow - 万能声卡驱动(Alsa)的安装方法
?xwindow - mozilla中使用realplay插件以及文本阅读插件的使用 ?xwindow - Play Encoded DVDs in Xine ?xwindow - Linux中文拼音输入法全接触
?xwindow - SWT(implemented with gtk)的可视化控件的X11窗口句柄 ?xwindow - GObject对象系统 ?xwindow - Redhat Linux9 Gnome桌面上搭建C - C++IDE开发环境
?xwindow - rh9下图形登陆windows ?xwindow - 在RH7.2中装上VIA的AC97的板载声卡 ?xwindow - x-win32和exceed的简单使用 v0.3
?xwindow - fedora core1中flash插件不能用的解决办法 ?xwindow - 在Virtual PC 5.2上配置Debian的网络 ?xwindow - *NIX下远程连接X Server几个方法
?xwindow - gaim0.76+libqq0.25+ssl(MSN)简明安装 ?kde - qt的安装 ?kde - kde下的软件
?kde - kde快捷键 ?kde - KDE 2.1安装及使用介绍 ?kde - 什么是kde---基本概念介绍
?gnome - gnome下的软件 ?gnome - 桌面不见了 ?gnome - gnome的快捷键
?gnome - 什么是gnome---基本概念介绍 ?输入法类 - 安装ole ?输入法类 - 在redhat 7.3或8.0下用智能ABC
?输入法类 - 输入法 ?长沙发上的对话(二) ?输入法类 - xsim的安装
?输入法类 - redhat8.0上成功运行xsim ?美化汉化 - rh8汉字乱码 ?美化汉化 - 解决rh8汉字乱码
?美化汉化 - 美化rh8 ?美化汉化 - 我的RedHat8.0美化方案 ?美化汉化 - 关于RH 8.0汉化的另类选择
?美化汉化 - RedHat linux 7.2xmms的汉化 ?美化汉化 - XFree86 字体美化 Mini HOWTO ?美化汉化 - Magic Chinese RedHat 7.1汉化redhat
?美化汉化 - 关于Redhat7.2系统汉化 ?美化汉化 - RedHat 8.0下使用缺省英文系统语言,同时可阅读和输入中文的方法 ?美化汉化 - 完美的RH8+gnome+KDE使用simsun的方案
?美化汉化 - RH8-----懒人不汉化 ?美化汉化 - oss397h声卡驱动for redhat 9 ?美化汉化 - 在linux下制作MP3
?美化汉化 - 扮靓你的Red Hat Linux 7.2 ?美化汉化 - 彻底搞定7.3下kde字体发虚 ?美化汉化 - 解决gnome-terminal里汉字花屏问题
?美化汉化 - 谈谈redhat9KDE的汉化 ?美化汉化 - Firefly的Xft2 for Fedora下载安装 ?美化汉化 - 我的redhat8.0完全桌面设置
?美化汉化 - 我的Redhat 7.3汉化 - 美化过程 ?美化汉化 - Fedora 1.0 core 安装Nvidia显卡驱动 ?美化汉化 - 用开源唐体包来美化Fedora Redhat 8.0 9.0 AS 3.0
?美化汉化 - 中文支持问题 ?美化汉化 - Linux英文环境下的中文输入 ?网络配置 - rh8下加入静态路由
?网络配置 - 一个网卡绑定多个IP地址 ?网络配置 - 多个网卡绑定一个IP地址(bonding) ?网络配置 - 使用nfs进行网络备份
?网络配置 - Redhat安装和使用40问 ?网络配置 - redhat linux8.0安装和相关软件配置(包括mplayerQQ等) ?网络配置 - 安装Webmin
?网络配置 - 为什么用了bond反而变慢了呢 ?网络配置 - redhat7.2安装双网卡 ?网络配置 - 网卡的安装
?网络配置 - vnc远程控制linux主机 ?网络配置 - 设置好telnet服务 ?网络配置 - 阻止用户浏览使用外部代理
?网络配置 - red hat as 2.1 (linux) 串行控制台配置实例 ?网络配置 - route ?网络配置 - 允许root用户远程登录
?网络配置 - 根据NETBIOS名字查找计算机IP ?网络配置 - 某个端口现在运行什么监听程序 ?网络配置 - 只ping得通网关,访问局域网的资源不能在浏览器里访问网页
?网络配置 - 监视某个tty ?网络配置 - linux下如何接ADSL一类的宽带猫 ?网络配置 - 查找给出的MAC是属于什么厂商的
?网络配置 - 在用户idle一定时间以后就断开连接 ?网络配置 - 局域网实现VLAN实例 ?网络配置 - linux下使用Win Modem
?网络配置 - Setting up PPTPD on Red Hat 8.0 with RPM packages ?网络配置 - 在red hat 7.3版里安装3c905网卡 ?网络配置 - 修改机器名和Ip
?网络配置 - a network tools of linux neighborhood browser ?网络配置 - 了解网络接口的状态 ?长沙发上的对话(一)
?网络配置 - x-win32的简单使用 ?网络配置 - ADSL上网 ?网络配置 - 在命令行下,给网卡加第二个IP地址且重启有效

又一个发送mime邮件的类

又一个发送mime邮件的类
2004-02-26 13



// 存成 "mime_mail.inc" ?案

class mime_mail {

var $parts;
var $to;
var $from;
var $headers;
var $subject;
var $body;


/*
* void mime_mail()
* ??建?者
*/

function mime_mail() {
$this->parts = array();
$this->to = "";
$this->from = "";
$this->subject = "";
$this->body = "";
$this->headers = "";
}


/*
* void add_attachment(string message,
* [string name],
* [string ctype])
* ?附加物(附?)加入?件物件
*/

function add_attachment($message, $name = "",
$ctype = "application/octet-stream") {
$this->parts[] = array( "ctype" => $ctype,
"message" => $message,
"encode" => $encode,
"name" => $name);
}


/*
* void build_message(array part)
* 建立 multipart ?件的?息部份
*/

function build_message($part) {
$message = $part["message"];
$message = chunk_split(base64_encode($message));
$encoding = "base64";
return "Content-Type: " . $part["ctype"] .
($part["name"] ? "; name="".$part["name"].""" : "") .
"nContent-Transfer-Encoding: $encodingnn$messagen";
}


/*
* void build_multipart()
* 建立一封 multipart ?件
*/

function build_multipart() {
$boundary = "b" . md5(uniqid(time()));
$multipart = "Content-Type: multipart/mixed; " .
"boundary = $boundarynn" .
"This is a MIME encoded message.nn--$boundary";

for ($i = sizeof($this->parts) - 1; $i >= 0; $i--) {
$multipart .= "n" . $this->build_message($this->parts[$i]) . "--$boundary";
}

return $multipart . "--n";
}


/*
* string get_mail()
* ?回已?合完成的?件
*/

function get_mail($complete = true) {
$mime = "";
if (!empty($this->from))
$mime .= "From: " . $this->from . "n";
if (!empty($this->headers))
$mime .= $this->headers . "n";

if ($complete) {
if (!empty($this->to))
$mime .= "To: $this->ton";
if (!empty($this->subject))
$mime .= "Subject: $this->subjectn";
}

if (!empty($this->body))
$this->add_attachment($this->body, "", "text/plain");

$mime .= "MIME-Version: 1.0n" . $this->build_multipart();

return $mime;
}


/*
* void send()
* 寄出?封信(最後一?被呼叫的函式)
*/

function send() {
$mime = $this->get_mail(false);
mail($this->to, $this->subject, "", $mime);
}

} // ???束

?>


/*
include "mime_mail.inc";

$filename = "testfile.jpg";
$content_type = "image/jpeg";

# ?取磁碟?的 JPEG ?形
$fd = fopen($filename, "r");
$data = fread($fd, filesize($filename));
fclose($fd);

# 建立物件??
$mail = new mime_mail;

# ?定所有?目
$mail->from = "your@address.com";
$mail->to = "recipient@remote.net";
$mail->subject = "?迎!";
$mail->body = "?是真正的?子?件?息,
?然....你可以?成一行以上。";
# 加上附?
$mail->add_attachment($data, $filename, $content_type);

# 送出?子?件
$mail->send();
*/
?>

\
\
// by Hsu
if (!preg_match("!^(http://)?" . getenv("SERVER_NAME") . "!", getenv("HTTP_REFERER")) ││
!preg_match("/^.+@.+..+/", $myemail)) {

echo "
n";
echo "Your E-Mail addr: n";
echo "n";
echo "
n";

} else {

include "mime_mail.inc";

$filename = "php_logo.gif";
$content_type = "image/gif";

# ?取磁碟?的 JPEG ?形
$fd = fopen($filename, "r");
$data = fread($fd, filesize($filename));
fclose($fd);

# 建立物件??
$mail = new mime_mail;

# ?定所有?目
$mail->from = $myemail;
$mail->to = $myemail;
$mail->subject = "?迎!";
$mail->body = "?是真正的?子?件?息,n" .
"?然....你可以?成一行以上。";
# 加上附?
$mail->add_attachment($data, $filename, $content_type);

# 送出?子?件
$mail->send();
echo "?件已送出,??查您的信箱。";
}
?>
\

\

将MySQL迁移到Microsoft SQL Server 2000

将MySQL迁移到Microsoft SQL Server 2000
摘要
本白皮书描述了 Microsoft SQL Server 2000 的迁移能力,并提供了帮助开发人员将 MySQL 数据库迁移到 SQL Server 2000 的特定信息。
引言

本指南解释如何利用几个内置的 SQL Server 工具和实用程序将 MySQL 迁移到 Microsoft? SQL Server? 2000。它还提供了如何修改 MySQL 应用程序,使之与 SQL Server 2000 一起工作的指南。如果您购买了 MySQL 应用程序,您可以继续让这项投资发挥效用,同时又为应用程序结构提供了 SQL Server 2000 的高级功能。
读者对象

本白皮书的读者对象可以是刚接触 SQL Server 及其操作的人,但应非常熟悉 MySQL DBMS 和普通数据库的概念。目标读者必须具备:
一般的数据库管理知识。
足够的 MySQL DBMS 基础知识背景。
熟悉 MySQL 语言。
具有 sysadmin 固定服务器角色的成员资格。sysadmin 角色对该服务器有全权控制。要想了解登录 SQL Server 的更多信息,请参见 SQL Server 2000 联机图书的“登录”一节。

为了让说明清楚易懂,使用的基准开发和应用程序平台是 Microsoft Windows? 2000 操作系统 和 SQL Server 2000。MySQL ODBC 驱动程序与 MySQL 一起使用,MySQL 平台是使用 MySQL 3.23.37 的 Red Hat Linux 7.1。
概述

MySQL 是一个开放源代码的数据库管理系统 (DBMS)。它采用客户端/服务器结构,是一个多线程、多用户的数据库服务器。MySQL 是为高速应用设计的,因此,它并不提供关系数据库系统提供的许多功能,比如子查询、外键、引用完整性、存储过程、触发器和视图。此外,它有一个锁定机制,这对同时有不同用户进行许多写操作的数据表来说是不够的。它还缺少对软件应用程序和工具的支持。
SQL Server 2000 是一个完整的关系数据库管理系统 (RDBMS),它还包括用于 OLAP 和数据挖掘的集成分析功能。SQL Server 2000 满足最大的数据处理系统和商业 Web 站点对数据及分析的存储要求,同时可以为个人和小企业提供易用的数据存储服务。
Microsoft SQL Server 的结构支持高级的服务器功能,比如行一级的锁定、高级查询优化、数据复制、分布式数据库管理以及分析服务。Transact-SQL (T-SQL) 是 SQL Server 2000 支持的 SQL 语言。
本章中提到的结构特点只是 SQL Server 2000 提供的众多特点的一部分。SQL Server 2000 联机图书是安装应用程序时可以利用的一个有用资源。要使用联机图书,请打开 Microsoft SQL Server 程序组并单击“联机图书”。
迁移过程

本章通过列出 MySQL 和 Microsoft SQL Server 2000 的结构来介绍迁移过程。本章包括以下内容:
迁移准备
数据类型、保留字和运算符
MySQL 的数据迁移工具
Microsoft SQL Server 的数据迁移工具
直接迁移:数据转换服务 (DTS)
使用数据加载能力:查询分析程序
扩展应用程序
故障排除
迁移准备

正确的迁移规划对确保最终成功极其重要。开始迁移前,请查看待迁移 MySQL 数据库的架构。比较 MySQL 和 SQL Server 2000 的数据类型,了解二者的区别。本白皮书的“比较 MySQL 与 Microsoft SQL Server”一节提供了可比数据类型的框架。注意某些 MySQL 数据库对象可能会与 SQL Server 2000 的保留字冲突。下一节中有这些保留字。使用 DTS 迁移到 SQL Server 2000 之前应该先备份并复制 MySQL 数据库文件。
数据类型、保留字和运算符

本节介绍 SQL Server 2000 中使用的数据类型。为了顺利迁移,这里提供了一张 MySQL 和 SQL Server 2000 的数据类型对照表。同时还提供了 Microsoft SQL Server 中使用的保留字列表。它包括以下信息:
支持的 SQL Server 数据类型
比较 MySQL 与 SQL Server 2000
SQL Server 保留字
支持的 SQL Server 数据类型
数据类型
说明

BIGINT
从 -2^63 (-9223372036854775808) 到 2^63-1 (9223372036854775807) 的整型数据(整数)。

INT
从 -2^31 (-2,147,483,648) 到 2^31-1 (2,147,483,647) 的整型数据(整数)。

SMALLINT
从 2^15 (-32,768) 到 2^15 - 1 (32,767) 的整型数据。

TINYINT
从 0 到 255 的整型数据。

BIT
非 1 即 0 的整型数据。

DECIMAL
从 -10^38 +1 到 10^38 -1 的固定精度和标度的数字数据。

NUMERIC
功能上相当于十进制数。

MONEY
从 -2^63 (-922,337,203,685,477.5808) 到 2^63 - 1 (+922,337,203,685,477.5807) 的货币型数据,精确到货币单位的万分之一。

SMALLMONEY
从 -214,748.3648 到 +214,748.3647 的货币型数据,精确到货币单位的万分之一。

FLOAT
从 -1.79E + 308 到 1.79E + 308 的浮点精度数字数据。

REAL
从 -3.40E + 38 到 3.40E + 38 的浮点精度数字数据。

DATETIME
从 1753 年 1 月 1 日到 9999 年 12 月 31 日的日期和时间数据,精确到三百分之一秒(3.33 毫秒)。

SMALLDATETIME
从 1900 年 1 月 1 日到 2079 年 6 月 6 日的日期和时间数据,精确到一分钟。

CHAR
最大长度 8000 个字符的固定长度非 Unicode 字符数据。

VARCHAR
最大长度 8000 个字符的可变长度非 Unicode 字符数据。

TEXT
最大长度 2^31 - 1 (2,147,483,647) 个字符的可变长度非 Unicode 数据。

NCHAR
最大长度 4,000 个字符的固定长度 Unicode 数据。

NVARCHAR
最大长度 4000 个字符的可变长度 Unicode 数据。sysname 是系统提供的用户定义数据类型,功能上相当于 nvarchar(128),用于引用数据库对象名称。

NTEXT
最大长度 2^31 - 1 (1,073,741,823) 个字符的可变长度 Unicode 数据。

BINARY
最大长度 8,000 个字节的固定长度二进制数据。

VARBINARY
最大长度 8,000 个字节的可变长度二进制数据。

IMAGE
最大长度 2^31 - 1 (2,147,483,647) 字节的可变长度二进制数据。

CURSOR
对光标的引用。

SQL_VARIANT
存储 SQL Server 支持的数据类型(text、ntext、timestamp 和 sql_variant 除外)值的数据类型。

TABLE
用于存储结果集合供以后处理的特殊数据类型。

TIMESTAMP
整个数据库中都唯一的一个数字,随着行的每次更新而更新。

UNIQUEIDENTIFIER
全局唯一标识符 (GUID)。


详细信息请参见 SQL Server 2000 联机图书的“数据类型”主题。
比较 MySQL 与 SQL Server 2000

下表显示了 MySQL 和 SQL Server 2000 的数据类型映射关系。对于某些 MySQL 数据类型,SQL Server 中有不止一种对应的数据类型。此表包括以下信息:
数字类型
数据和时间类型
字符串类型

注意
D:用于浮点型,表示小数点后面的位数。最大值可以是 30,但至少应大于 M-2。
L:列值的实际长度
M:表示最大显示尺寸。最大有效显示尺寸是 255。
数字类型
MySQL
大小
SQL Server 2000

TINYINT
1 字节
TINYINT

SMALLINT
2 字节
SMALLINT

MEDIUMINT
3 字节


INT
4 字节
INT

INTEGER
4 字节
INT

BIGINT
8 字节
BIGINT

FLOAT(X<=24)
4 字节
FLOAT(0)

FLOAT(25<=X<=53)
8 字节
FLOAT(25)

DOUBLE
8 字节
FLOAT(25)

DOUBLE PRECISION
8 字节
FLOAT (53)

REAL
8 字节
REAL

DECIMAL
M 字节(D+2,如果 M DECIMAL

NUMERIC
M 字节(D+2,如果 M NUMERIC


日期和时间类型
MySQL
大小
SQL Server 2000

DATE
3 字节
SMALLDATETIME

DATETIME
8 字节
DATETIME

TIMESTAMP
4 字节
TIMESTAMP

TIME
3 字节
SMALLDATETIME

YEAR
1 字节
SMALLDATETIME


字符串类型
MySQL
大小
SQL Server 2000

CHAR(m)
M 字节,1<=M<=255
CHAR

VARCHAR(m)
L+1 字节,L<=M 且 1<=M<=255
VARCHAR

TINYBLOB
L + 1 字节,L<2^8
BINARY

BLOB
L + 2 字节,L<2^16
VARBINARY

TEXT
L + 2 字节,L<2^16
TEXT

MEDIUMBLOB
L + 3 字节,L<2^24
IMAGE

MEDIUMTEXT
L + 3 字节,L<2^24
TEXT

LONGBLOB
L + 4 字节,L<2^32
IMAGE

LONGTEXT
L + 4 字节,L<2^32
TEXT

ENUM (VALUE1, VALUE2, ...)
1 或 2 字节,取决于枚举值的数量(最多 65535 个值)。
无可用数据类型,但 CHECK 约束* 提供功能。

SET (VALUE1, VALUE2, ...)
1、2、3、4 或 8 字节,取决于集合成员的最大数量



* Check 约束通过限制字段中可以接受的值,强制实现数据完整性。详细信息请参见联机图书的“CHECK 约束”主题。
Microsoft SQL Server 2000 保留字
ADD
EXCEPT
PERCENT

ALL
EXEC
PLAN

ALTER
EXECUTE
PRECISION

AND
EXISTS
PRIMARY

ANY
EXIT
PRINT

AS
FETCH
PROC

ASC
FILE
PROCEDURE

AUTHORIZATION
FILLFACTOR
PUBLIC

BACKUP
FOR
RAISERROR

BEGIN
FOREIGN
READ

BETWEEN
FREETEXT
READTEXT

BREAK
FREETEXTTABLE
RECONFIGURE

BROWSE
FROM
REFERENCES

BULK
FULL
REPLICATION

BY
FUNCTION
RESTORE

CASCADE
GOTO
RESTRICT

CASE
GRANT
RETURN

CHECK
GROUP
REVOKE

CHECKPOINT
HAVING
RIGHT

CLOSE
HOLDLOCK
ROLLBACK

CLUSTERED
IDENTITY
ROWCOUNT

COALESCE
IDENTITY_INSERT
ROWGUIDCOL

COLLATE
IDENTITYCOL
RULE

COLUMN
IF
SAVE

COMMIT
IN
SCHEMA

COMPUTE
INDEX
SELECT

CONSTRAINT
INNER
SESSION_USER

CONTAINS
INSERT
SET

CONTAINSTABLE
INTERSECT
SETUSER

CONTINUE
INTO
SHUTDOWN

CONVERT
IS
SOME

CREATE
JOIN
STATISTICS

CROSS
KEY
SYSTEM_USER

CURRENT
KILL
TABLE

CURRENT_DATE
LEFT
TEXTSIZE

CURRENT_TIME
LIKE
THEN

CURRENT_TIMESTAMP
LINENO
TO

CURRENT_USER
LOAD
TOP

CURSOR
NATIONAL
TRAN

DATABASE
NOCHECK
TRANSACTION

DBCC
NONCLUSTERED
TRIGGER

DEALLOCATE
NOT
TRUNCATE

DECLARE
NULL
TSEQUAL

DEFAULT
NULLIF
UNION

DELETE
OF
UNIQUE

DENY
OFF
UPDATE

DESC
OFFSETS
UPDATETEXT

DISK
ON
USE

DISTINCT
OPEN
USER

DISTRIBUTED
OPENDATASOURCE
VALUES

DOUBLE
OPENQUERY
VARYING

DROP
OPENROWSET
VIEW

DUMMY
OPENXML
WAITFOR

DUMP
OPTION
WHEN

ELSE
OR
WHERE

END
ORDER
WHILE

ERRLVL
OUTER
WITH

ESCAPE
OVER
WRITETEXT



用于数据迁移的 MySQL 工具

MySQL 提供了几个客户端工具和实用程序,最常用的有:
mysql - 一个交互式客户程序,可以对数据库发布查询并查看结果
mysqldump - 此工具可以提取 MySQL 数据库中的架构和数据,并放到一个文件中
mysqlimport - 此工具可以读取文件中的架构和数据,并放到一个 MySQL 数据库中
mysqladmin - 此工具可以执行管理任务,比如创建数据库和删除数据库
myODBC - 一个 32 位的开放式数据库连接软件,可提供 ODBC 级别 0(有级别 1 和级别 2 的功能)驱动程序,用于将 ODBC 识别的应用程序连接到 MySQL
SQL Server 的迁移工具

SQL Server 有一组丰富的工具和实用程序,可以简化从 MySQL 的迁移。SQL Server 2000 数据转换服务 (DTS) 是一组图形化工具和可编程对象,用于从各种来源提取、转换和合并数据到一个或多个目标。
数据转换服务的功能

Microsoft SQL Server 2000 中的数据转换服务提供了从不同数据源迁移数据的方法。DTS 可以用向导程序驱动,也可以用 DTS 程序包设计器创建。DTS 向导可以快速完成数据直接复制。程序包设计器允许开发人员用多种编程语言编写自定义转换脚本。DTS 工具允许您:
将数据从 MySQL 迁移到 SQL Server 2000
在迁移前显示数据
迁移数据表、数据类型,例如文本和日期
用 MySQL 数据表迁移 MySQL 数据库
生成并查看迁移报告
自定义数据表和默认的数据类型映射规则
解决冲突,比如 SQL Server 保留字冲突
删除并重命名 SQL Server 架构模型中的对象
迁移单个数据表数据
数据转换服务术语

以下是用于描述 DTS 的术语:
DTS 程序包是一个连接、DTS 任务、DTS 转换以及工作流约束的有组织的集合,可以在 DTS 设计器中用图形化方式或用编程方式汇编在一起。
DTS 任务是一个分立的功能集合,在程序包中单步执行。每个任务都定义一个数据移动和数据转换过程中要执行的工作项目,或者一个要执行的作业。
DTS 转换是数据到达目的地之前要对它应用的一个或多个功能或操作。
DTS 程序包工作流允许数据转换服务 (DTS) 逐步运行,由优先约束对 DTS 程序包中的工作项目进行排序。您可以在 DTS 设计器中用图形方式设计 DTS 数据包工作流,也可用编程方式设计。
元数据为 DTS 提供的功能可以将程序包元数据和数据沿袭信息保存到元数据服务,并链接那些信息类型。您可以存储程序包中引用的数据库的类别元数据,以及统计与数据集市或数据仓库中特定数据行有关的历史信息。
直接迁移

将数据从 MySQL 迁移到 Microsoft SQL Server 的最直接选项是安装 myODBC 支持并创建一个 DTS 程序包,用它们将数据库从 MySQL 导入并创建到 Microsoft SQL Server。
以下是设置 Microsoft SQL Server 以迁移 MySQL 数据库的逐步操作。
安装 MyODBC 支持,它可在以下网址下载 http://www.mysql.com/
安装过程中,系统会提供以下对话框:

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
填写 ODBC 安装设置,使用如下信息:
Windows DSN 名称:

test


说明:

这是个测试数据库


MySQL 数据库:

test


服务器:

seawolf.microsoft.com


用户:

cgunn


密码:

my_password


端口:

3306


使用上述设置后,Windows DSN 名称在建立连接的计算机上必须唯一,服务器设置会完全验证域名(确保 DNS 或您提供的名称具备名称解析)或 IP 地址的有效性。
然后,执行 DTS 向导程序。从 Microsoft SQL Server 程序组中选择“Import and Export Data”,您会看到以下对话框。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
单击 Next 到下一步。
现在提供必要的数据源选择信息,此信息应该是,ODBC 数据源为 MySQL,test 为 System DSN,然后提供安全证书、用户名和密码(见下一个对话框),然后单击 Next。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
填写目标连接的详细信息,如下面的对话框所示,然后单击 Next。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
Specify Table Copy or Query 对话框让您在此选择数据源中的数据库对象选项,这里的数据源是 MySQL。在来源数据库中选择 Copy Table(s) 和 View(s)。另外需要说明的重要一点是,MySQL 不支持视图,所以选择此选项后,它将只复制数据表对象,单击 Next 继续。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
下一个是 Select Source Tables and View 对话框,您可以在这个对话框中选择来源数据表和目标数据表。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
单击椭圆按钮进行数据转换,如下面的 Column Mappings and Transformations 对话框所示。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
在这个对话框中,来源数据类型已经与目标数据类型匹配,空数据字段已经被选中。完成后,单击 OK。
然后会出现 Save, Schedule, and Replicate Package 对话框,允许您安排迁移时间,避开使用高峰期,同时允许您将 DTS 程序包用不同格式保存到不同地方。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
DTS 保存程序包对话框对 DTS 程序包提供了两类密码。第一个密码是所有者密码,允许您保护程序包内的所有用户/密码信息,而用户密码用于执行程序包和防止对 DTS 程序包的任何未授权执行,如下所示,单击 Next 继续。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
最后,Completing the DTS Import/Export Wizard 对话框会显示在 DTS 向导程序中所选选项的概要。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
单击 Finish 开始数据迁移过程。
Executing Package 对话框显示每项任务执行时的状态。绿色对钩表示任务成功完成。如果任务不能完成,有错误终止了进程,则会出现显示此错误的错误对话框。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。

现在您可以成功地将数据从 MySQL 迁移到 SQL Server 2000。
使用数据加载

您可以使用与 MySQL Server 一起提供的客户程序 mysqldump 将 MySQL 数据库的架构和数据输出到各种格式的 .sql/.txt 文件。DTS 可以使用 mysqldump 输出文件为大型数据表提供脱机数据加载能力。以下主题解释了数据加载过程:
生成 mysqldump 数据提取脚本
设置脚本传输
使用提取的脚本
生成 mysqldump 数据提取脚本

MySQL 有一个实用程序可以转储数据库和数据库集合进行备份,或者将数据传输到 SQL Server。
mysqldump 实用程序提供了创建数据库 SQL 脚本的能力。
mysqldump 最简短的语法是:
Shell> mysqldump [OPTIONS] database [tables]
本白皮书后面有 mysqldump 的可用选项信息,也可以查看 MySql 参考手册获得此信息。
使用 mysqldump 后,您会获得一个数据库的 SQL 脚本。
设置脚本传输

用 mysqldump 生成脚本后,可以将脚本传输到 SQL Server - 使用类似文件传输协议 (FTP) 的应用程序将脚本从 MySQL 主机传输到 SQL Server 2000 计算机。
通过 SQL 查询分析器使用提取的脚本

生成的脚本现在可以用于创建数据库对象和插入数据。从 MySQL 脚本构建数据库架构的比较好的方法是使用 SQL Server 2000 中的 SQL 查询分析器。
您可以直接从开始菜单运行 SQL 查询分析器,也可以从 SQL Server 企业管理器运行。也可以通过执行 isqlw 实用程序从命令行运行 SQL 查询分析器。
为了让脚本正确执行,还需要一些额外的工作,这需要对 SQL 语言进行某些更改。同样,记住逐步运行 SQL 脚本,并将数据类型更改为 SQL Server 兼容类型。下图显示了从 mysqldump 导入的一个脚本,需要说明的重要一点是,转储的是一个 ASCII 脚本文件。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
Microsoft SQL Server 2000 SQL 查询分析器允许您:
创建查询和其它 SQL 脚本并对 SQL Server 数据库执行这些脚本
用预定义脚本迅速创建常用数据库对象
迅速复制现有的数据库对象
无需知道参数就可以执行存储过程
调试存储过程
调试查询性能问题
定位数据库中的对象,或者查看并使用对象
在数据表中迅速插入、更新或删除行
为常用查询创建键盘快捷方式
将常用命令添加到工具菜单
扩展应用程序

将 MySQL 应用程序的数据管理部分移到 Microsoft SQL Server 后,您可以让 SQL Server 保护数据并维护所有引用完整性和用 Transact-SQL 编写的业务规则。
诸如 ADO、OLE DB 和 ODBC 这样的数据库应用程序编程接口 (API) 通过多种编程语言显示数据库数据。您可以用 Microsoft Visual C++、Microsoft Visual Basic 和 Microsoft Visual J++ 这样的开发系统访问这些 API。
此外,如果应用不断扩展,您不需要更改应用程序就可以将 Microsoft SQL Server 移到更大的计算机;SQL Server 能自动识别硬件配置,并因此自我调节,以获得最佳的内存、I/O 和处理器利用率。
从 Internet 访问数据

SQL Server 提供了将应用程序扩展到基于 Web 的接口的能力。这个能力使您可以随时随地访问应用程序。通过使用 IIS Web 服务器并在 Active Server Pages (ASP) 中使用 ActiveX 数据对象 (ADO),SQL Server 可以与 Microsoft Internet Information Services (IIS) 集成在一起,从而提供了一个访问 SQL Server 中所保存数据的快速、高效的用户接口。
详细信息请参见 http://www.msdn.microsoft.com
安全性

SQL Server 2000 中的数据库安全性既稳定又便于维护。不论是 SQL Server 还是 MySQL,重要的是要在两个层面考虑安全性。1) 能访问服务器,2) 能访问单个数据库。
MySQL 有一个独特的加强服务器访问安全性的方法 - 限制对数据源的访问。如果是客户端,则使用 IP 地址或完全合格的域名、通配符(如‘%’)。SQL Server 需要用户帐户,不论是由操作系统管理还是保存在 SQL Server 的 master 数据库中。
SQL Server 利用角色提供了组访问,这可以通过为用户组建立通用访问来方便数据库的管理。
以下步骤概要介绍了 Microsoft SQL Server 如何通过企业管理器工具提供对服务器和数据库的访问。
打开企业管理器,找到“Security Folder”,选择 Logins 图标,用鼠标右键单击并选择 New Login。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
出现 SQL Server Login Properties 对话框后,输入登录名称,这与 MySQL 中的用户名类似。选择 SQL Server 身份验证以提供一个对该 SQL Server 有效的安全级别。
指定默认数据库和语言。

在对话框顶部选择 Server Roles 选项卡,以提供对服务器权限的访问信息,这里突出显示的角色是 sysadmins(系统管理员),它相当于 MySQL 中的根访问。

下一个选项卡是 Database Access。这个属性页不但提供对单个数据库的访问,而且可以访问实际位于 SQL Server 上的索引数据库。选择数据库后,再设置数据库角色。默认情况下,所有用户都可访问公共角色。这个角色仍然需要分配权限。此图中还选择了另一个角色 db_owner,它只允许用户无限制访问数据库,但不能无限制访问整个 SQL Server 或者其它数据库,除非单独选择了其它数据库并分配了 db_owner 权限。

单击 OK 后,会出现输入密码的提示。


企业管理器中出现新的登录。您还会注意到此图中有一个名为“sa”的登录帐户,这个系统管理员帐户需要有密码,在安装 SQL Server 的过程中,会有一个为此登录保留空密码的选项,您应该指定这个密码。
如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
有关创建 Microsoft SQL Server 登录的详细信息,请参阅 SQL Server 联机图书的“管理安全性”主题。
数据库权限

SQL Sever 2000 也通过限制对数据库定义语言 (DDL) 以及数据操纵语言 (DML) 声明权限的访问提供了数据库安全保护能力,设置步骤与创建登录类似。使用企业管理器工具可以方便地设置 SQL Server 数据库权限。
数据操纵语言权限
打开企业管理器,找到数据库文件夹,然后选择要设置权限的数据库。选择 users 图标,然后选择数据库用户,用右键单击并选择 Properties。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
单击 permissions 按钮。

权限窗口提供了对所有数据库对象(比如数据表、视图和存储过程)设置 DML 声明的能力。选择权限后,请单击 OK。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
数据定义语言权限
要想为数据库提供 DDL 声明访问,需要选择该数据库的属性。选择数据库图标并用右键单击。选择 Properties。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
然后选择数据库属性窗口中的 permissions 选项卡。

如果您的浏览器不支持内嵌框,请单击此处在单独的页中查看。
选择相应权限后,单击 OK。
故障排除

本章提供以下方面的故障排除方案和有关信息:
定义用户帐户
转储 MySQL 数据
优化命令行选项
定义用户帐户

当您往系统中安装 MySQL 服务器时,会默认设置一个根用户,它是拥有全部 DBA 特权的用户帐户。您应该用根用户通过 ODBC 登录到 MySQL 服务器。(注意:默认情况下,根用户只能登录对本地主机的访问,请记住允许根用户从运行 DTS 向导的计算机 IP 或 DNS 地址登录。)
转储 MySQL 数据

下面的表提供了转储 MySQL 数据和用 mysqldump 文本文件重新生成数据库时使用的语法解释。 命令
说明

mysqldump

此工具可以将 MySQL 数据库中的架构和数据提取到一个文件中。


mysql

加载 MySQL 以便您使用命令。


-u user name

MySQL 根用户名。此用户应该有全部的 DBA 特权。


-ppassword

您的 MySQL 数据库服务器的根用户密码。


--opt

优化数据表转储速度并写一个保证重载速度最快的转储文件。此选项可以启用 -add-drop-table、--add-locks、--all、--extended-insert, --quick 和 -lock-tables 选项。由 -opt 启用的选项列表请参见“MySQL 优化选项”部分。


databasename

您要将其内容转储到一个文本输出文件的数据库的名称。


<

用于重定向 UNIX 和 Windows NT/2000 中的输入的符号。


filename.sql

含有 MySQL 的文件名。



要想转储 MySQL 数据,请使用以下命令:
#> mysqldump -u user name -ppassword -opt databasename < filename.sql
若想用 mysqldump 输出文本文件重新创建数据库,请使用以下命令:
#> mysql -u user name -ppassword databasename < filename.sql
优化命令行选项

使用 -opt 可以自动在 mysqldump 命令行中启用选项。有关转储 MySQL 数据的更多信息,请参见“转储 MySQL 数据”一节。下表是 --opt 命令: 命令
说明

--add-drop-table

在每个 CREATE TABLE 语句之前添加 DROP TABLE If EXISTS 语句。


--all

包括所有 MySQL 特有的创建选项。


--extended-insert

写多个行插入语句


--quick

不缓存查询,直接转储到标准输出。如果使用此选项时您暂停了 mysqldump,您可能会干扰其它客户机,因为它会导致服务器等待。


--lock-tables

将所有表锁定为只读



MySQL 错误消息

本节提供了在 MySQL 数据库迁移到 SQL Server 2000 的过程中可能会出现的错误消息。
错误消息

用 DTS 迁移数据时,可能会出现以下错误消息: 错误消息
解决方案

无法连接 MSQL Server
您要连接的系统/或端口上是否正在运行 MySQL 服务器?

引发该错误的原因可能是:
?源端口默认设置为 3306。这个端口号指 MySQL 通讯要使用的端口。如果 MySQL 上对这个端口的定义不同,请在 MySQL ODBC 设置中更改此端口设置。
?确保用户具有访问 MySQL 服务器的相应 DBA 权限。
?确保用户名有效。


数据库中已经有一个名为“tablename”的对象

这个数据表是在执行 DTS 程序包的过程中创建的,请确保在程序包执行过程中已经删除或重新创建了该数据表。



结论

本白皮书提供了成功将数据库架构和数据从 MySQL 迁移到 Microsoft SQL Server 2000 所需的基本信息和背景知识。对应用程序来说,SQL Server 2000 更可靠、伸缩性更强、功能更多。

搜索引擎技术核心揭密(PHP)

编者按:这是一篇精彩的编程教学文章,不但详细地剖析了搜索引擎的原理,也提供了笔者自己对使用PHP编制搜索引擎的一些思路。整篇文章深入浅出,相信无论是高手还是菜鸟,都能从中得到不少的启发。

  谈到网页搜索引擎时,大多数人都会想到雅虎。的确,雅虎开创了一个互联网络的搜索时代。然而,雅虎目前用于搜索网页的技术却并非该公司原先自己开发的。2000年8月,雅虎采用了Google(www.google.com)这家由斯坦福大学学生创建的风险公司的技术。理由非常简单,Google的搜索引擎比雅虎先前使用的技术能更快、更准确搜索到所需要的信息。

  让我们自己来设计、开发一个强劲、高效的搜索引擎和数据库恐怕短时间内在技术、资金等方面是不可能的,不过,既然雅虎都在使用别人的技术,那么我们是不是也可以使用别人现成的搜索引擎网站呢?

剖析编程思路

  我们可以这样设想:模拟一个查询,向某个搜索引擎网站发出相应格式的搜索命令,然后传回搜索结果,对结果的HTML代码进行分析,剥离多余的字符和代码,最后按所需要的格式显示在我们自己的网站页面里。

  这样,问题的关键就在于,我们要选定一个搜索信息准确(这样我们的搜索才会更有意义啊)、速度快(因为我们分析搜索结果并显示需要额外的时间),搜索结果简洁(便于进行HTML源代码分析和剥离)的搜索网站,由于新一代搜索引擎Google的各种优良特性,这里我们选择它为例,来看看用PHP怎样实现后台对Google(www.google.com)搜索、前台个性化显示这一过程。

  我们先来看看Google的查询命令的构成。进入www.google.com网站,在查询栏中输入“abcd”,点击查询按钮,我们可以发现浏览器的地址栏变成:"http://www.google.com/search?q=abcd&btnG=Google%CB%D1%CB%F7&hl=zh-CN&lr=",可见,Google是通过表单的get方式来传递查询参数并递交查询命令的。我们可以使用PHP中的file()函数来模拟这个查询过程。

了解File()函数

  语法: array file(string filename);

  返回值为数组,将文件全部读入数组变量中。这里的文件可以是本地的,也可以是远程的,远程文件必须指明所使用的协议。例如: result=file(“http://www.google.com/search?q=abcd&btnG=Google%CB%D1%CB%F7&hl=zh-CN&lr=”),该语句将模拟我们在Google上查询单词“abcd”的过程,并将搜索结果以每行为元素,传回到数组变量 result中。因为这里读取的文件是远程的,所以协议名“http://”不能缺少。

  如果要让用户输入搜索字符进行任意搜索,我们可以做一个输入文本框和提交按钮,并将上文中的被搜索字符“abcd”用变量替换:
echo '
'; //没有参数的form,默认提交方式为get,提交到本身
echo ''; //构造一个文本输入框
echo ''; //构造一个提交查询按钮
echo '
';

if (isset( keywords)) //提交后PHP会生成变量 kwywords,即要求下面的程序在提交后运行
{
urlencode( keywords); //对用户输入内容进行URL编码
result=file("http://www.google.com/search?q=". keywords."&btnG=Google%CB%D1%CB%F7&hl=zh-CN&lr=");
//对查询语句进行变量替换,将查询结果保存在数组变量 result中
result_string=join(" ", result); //将数组$result合并成字符串,各数组元素之间用空格粘和
... //进一步处理
}
?>

  上面的这段程序已经能按用户输入内容进行查询,并将返回的结果合成一个字符串变量$result_string。请注意要使用urlencode()函数将用户输入内容进行URL编码,才可以正常地对输入的汉字、空格以及其他特殊字符进行查询,这样做也是尽可能逼真地模拟Google的查询命令,保证搜索结果的正确性。

对Google的分析

  为了便于理解,现在假设我们所真正需要的东西是:搜索结果的标题。网址和简介等,这是一个简洁而典型的需求。这样,我们所要做的便是:去除Google搜索结果的台头和脚注,包括一个Google的标志、再次搜索的输入框和搜索结果说明等,并且在剩余的搜索结果各项条目中剥离原来的HTML格式标记,替换成我们想要的格式。

  要做到这一点,我们必须仔细地分析Google搜索结果的HTML源码,找到其中的规律。不难发现,在Google的搜索结果的正文总是包含在源码的第一个

标记和倒数第二个

标记之间,并且倒数第二个

标记后紧跟table字符,而且这个组合“

\在源码中也仅有一次,利用这个特点,我们可以这样去除Google的台头和脚注。

  以下所有程序均依次接续在上文程序的“进一步处理”处。

  result_string = strstr( result_string, "

"); //取 result_string从第一个

开始后的字符串,以去除Google台头
position= strpos( result_string,"

table符号的位置
result_string= substr( result_string,0, position);//截取第一个

table符号之前的字符串,以去除脚注

应用与实现

  OK,现在我们已经得到有用的HTML源码主干了,剩下的问题是如何自主地显示这些内容。我们再分析一下这些搜索结果条目,发现每个条目之间也是很有规律的用
分隔,也就是各成一个段落,按这个特点我们用explode()函数把每个条目切开:

  语法:explode(string separator, string string);

  返回一个数组,按separator切开后的各个小字串被保存在数组中。

  于是:
result_array=explode("

", result_string); //用字串"

"把结果切开

  我们就得到一个数组 result_array,其中每个元素都是一个搜索结果条目。我们所要做的仅仅是研究每个条目及其HTML显示格式代码,然后按要求替换就行了。下面用循环来处理 result_array中的每个条目。
for( i=0; i {
... //处理每个条目
}

  对于每个条目,我们也很容易找到一些特点:每个条目都由标题、摘要、简介、类别、网址等组成,每个部分都换行,即包含
标记,于是再次分割:(以下处理程序放在上文的循环中)
every_item=explode("
", result_array[ i]);

  这样我们得到一个数组 every_item,其中 every_item[0]就是标题, every_item[1]和 every_item[2]两行为摘要, every_item[3]和 every_item[4]等等的头部如果包含“简介:”、“< font size=-1 color=#6f6f6f>类别:< /font>”字符,则是简介或类别(因为有的结果条目没有该项),如果头部包含“< font color=green>”则肯定就是网址啦,这种对比判断我们常使用正则表达式(略),如果要替换也很方便,比如包含标题的$every_item[0],其本身是有链接的,我们希望修改这个链接属性,让它在新窗口打开链接:
echo eregi_replace(' {
... //处理每个条目中除去第一项(第一项为标题,已经显示)的每一项
... //更多格式修改
}

  这样就修改了链接属性,其余很多显示格式的修改、剥离、替换都能用正则替换eregi_replace()来完成。

  至此我们已经得到了每个搜索条目的每一项,并能任意修改每项的格式,甚至可以给他套上漂亮的表格。然而一个好的程序应该能适应各种运行环境的,这里也不例外,我们其实还只是讨论了搜索结果的HTML剥离的一种框架方法,真正要做得完美,还要考虑很多内容,比如要显示一共搜索出多少结果,分成多少页等等,甚至还可以刨除与Google相关的那些“类别”、“简介”等代码,让客户根本看不到原始网站。不过这些内容和要求我们都能通过分析HTML进行剥离得到。现在大家完全能自己动手,做个极富个性化的搜索引擎啦。

如何将动态页面改为用HTM格式访问(php/asp通用)

Internet上网站数量的增多,网站的宣传越来越多地依赖搜索引擎的搜索结果,怎样让搜索引擎更好地为站点服务与提高站点的访问量有着非常密切的关系。搜索引擎并非是上帝的赠送给Internet产物,搜索引擎其本身既是站点,同时也是由各个程序来建设的。而各种搜索引擎一般都使用一种称为搜索引擎机器人的技术,这种机器人会根据一定规则的在Internet上访问站点,并把有价值的页面收集到搜索引擎的缓存数据库保存。一旦有用户来搜索,那么搜索引擎会直接在其缓存中搜索结果,并将结果报给用户。

搜索机器人的查找规则比较复杂,但是其中有一个很重要的规则,就是搜索机器人对静态页面的处理能力要强于动态页面。一般情况下搜索机器人简单的把静态页面理解为扩展名成.html或者.htm的页面,而将扩展名是.ASP、.PHP及.CGI的页面理解成动态页面。换言之如果一个站点都是.html页面,那么它被搜索引擎全文搜到的可能性就要比.PHP的页面高几个数量级,当然因此而来的访问量也会高出很多。


如何把自己站点的内容全都静态页面化,最简单的做法自然是每个页面都用页面设计软件直接作成静态页面,这对小型站点不是难事,但是对页面总数上万的大中型站点,都用手工的静态页面设计就会带来高昂的成本和保存、修改上的困难。在这种情况下,资金雄厚的大网站会采用能在后台生成.html文件的内容管理(CMS)系统管理。无论是手工做的.html文件,还是后台生成的.html文件,都能实现真正意义上的静态页面。

但仍有相当数量的中型站点采用动态发布的CMS系统,动态系统对网页的更新效率很高,可在后台发排的同时在前台显示,缺点是要消耗相当量的服务器资源,同时得到一堆扩展名为.ASP.PHP的页面。要完全替换CMS系统并不容易,而且具有静态页面后台生成功能的成熟CMS系统价格都很高昂。

动态CMS系统有无简单获取.html文件扩展名的方法?当然有,采用URL重写转向功能。

对URL重写转向的支持,在Apache服务器上由一非缺省模块(mod_rewrite)来完成,这个模块的功能很强大,同时也很烦琐。而在IIS下也同样有类似的模块,分别是ISAPI REWRITE及IIS REWRITE。无论是在Apache下还是在IIS下,重写转向的语法都基于正则表达式,只有少量的不同。当然对一般的应用,没必要把所有手册和说明文档翻熟,下面以一个虚拟的http://www.siyizhu.com动态站点为例介绍一些简单的方法,读者可以根据自己网站的情况做调整。

网络栏目:http://www.siyizhu.com/content.asp?sort=3

在IIS的安装isapirewrite的情况下只须设置:RewriteRule /content/(d+).html /content.asp?sort= [N,I]

这样就将:/content/3.html 这样的请求映射成为/content.asp?sort=3

然后通过:http://www.siyizhu.com/content/3.html 同样能访问到刚才的页面。

另一个更通用的能将所有的动态页面做参数映射的表达式是:

RewriteRule (.*?.php)(?["/]*)?/(["/]*)/(["/]*)(.+?)? (?2&:?)=?5: [N,I]

这样就把http://www.siyizhu.com/foo.php?a=A&b=B&c=C表现成http://www.siyizhu.com/foo.php/a/A/b/B/c/C。

当然用URL重写转向而得的.html的URL实质上还是个动态页面,只是搜索引擎上的机器人及浏览器上的链接与正常的静态页面一摸一样,URL对用户的亲和度非常高。即便是在用模块方式运行的Apache下,这样或多或少都会有一些性能上的损失。同时如果真的把论坛这种更新非常快的内容也让Google搜索进去并不能方便用户,有时候还会带来很多负面影响。所以URL重写转向最合适的用途是一些中小型CMS动态页面发布平台,以便让搜索引擎能记录下主页内容从而让更多的人能搜索到。

用正确的小汽车对象学习和熟悉类的概念

很多书讲到类总喜欢拿小汽车来做例子,但是有些例子实在是又臭又烂误人子弟,骗人钱财,毁人前程,弱智低级到瞎编一个什么 set_color()函数来教人。实在是白白糟踏了好东西。今天在phpx.com又看到一个受害者,忍不住花了两个小时写了这个教程。

闲话少说,我们来正经的,我们的小车可不是随便让人图图颜色就完了(只能图颜色的是废车)。我们的这个小车不但能够到处乱跑,而且装备了高级GPS全球定位系统,油表,里程表。由于使用了面向对象的技术,驾驭这样的一部小汽车一点都不难。

举例子首先要提供一些背景材料。我们有一辆小汽车,可以在一个拥有xy坐标的地图上按照东南西北方向任意的行驶,你可以设定小车行驶的方向和距离,小车会向你汇报它的坐标位置。

其实学习类应该和我们学习其它事物一样,从学习使用开始,然后再学习他的原理。所以我们先来熟悉一下如何正确驾驶这样的一个小汽车:


$startPoint = & new Position(3,9); //初始一个出发点坐标x=3,y=9

$myCar = & new Car(500,$startPoint); //我得到一个新的小车,新车初始燃油 500 升,出发地点$startPoint。

$myCar->setHeading('s'); //给小车设定方向 s:南方 n:北方 w:西方 e:东方。

if($myCar->run(100)) //然后让小车跑100公里,如果顺利完成任务显示燃油量。如果半途而废,我们显示警报信息。
{
print('
小车一切正常,目前还有燃油:'.$myCar->getGas().'');//获得燃油数
}
else
{
print('
小车出问题了: '.$myCar->getWarning().'');//显示警报信息
}

$myPosition=$myCar->getPosition();//获得小车当前的位置

print('
我的小车现在
X:'.$myPosition->getX().'Y:'.$myPosition->getY());//显示小车的坐标位置
?>
先给自己制造一个小汽车,并且给他装备上一个定位对象 Position。 然后设定方向, 然后让小车奔跑。 最后检查并输出小车的方位。 复杂么?很难理解吗? 虽然这里我们用到了两个对象(类):Car 和 Position 但是我相信即使是初学者也不会觉得上面的代码很困难。

我们学会如何开车了以后,再来仔细看一看这个小车对象是怎样工作的。定义一个对象其实很简单只需要 用一个关键字class 和一对{}就可以了,所以我们这样定义这两个对象:

class Car {}
class Position{}

当然,仅仅这样的两个类什么也做不了,我们还需要给他们增加一些功能,先从小汽车开始,我们需要能够给小车设定方向并且让小车奔跑所以我们增加两个方法,也就是2个函数只不过这两个函数包含在小车对象内只有通过小车对象才可以使用。

setHeading()
run()
class Car
{
function setHeading($direction)
{

}

function run($km)
{

}
}

特别提示:设计一个良好的类的窍门是从如何使用它下手,也就是说先考虑这个对象应当有哪些方法,而不是先确定它有哪些属性。
为了更好的了解小车的状况我们还需要这些方法:
getGas() 获得小车当前的燃油数
getPosition() 获得小车当前的位置
getWarning() 警报信息
为了完成这些功能我们的小车还需要自己的油表,警报消息,和定位仪。我们把这些也添加到 Car 类中,同时我们还给这个类增加了一个初始化的函数 这个函数名字和类的名字一样,这样就有了一个大体的框架。


class Car
{
/**
* 小车的汽油量
*
*@var
*@access
*/
var $gas;

/**
* 里程记录
*
*@var
*@access
*/
var $meter;

/**
* 车的位置(由GPS自动控制)
*
*@var Object position
*@access private
*/
var $position;

/**
* 发动机每1公里耗油量,这个车是0.1升
*
*@var Integer
*@access private
*/
var $engine=0.1;

/**
* 警报信息
*
*@var
*@access
*/
var $warning;

/**
小车的初始化。新车出场当然要
1、加汽油。
2、里程表归零。
3、清除警报信息。
4、设定出发位置。
*/
function Car($gas,&$position)
{
$this->gas= $gas; //加汽油
$this->meter = 0;
$this->warning =''; //清除警报信息
$this->position = $position; //设定出发位置
}

function getWarning() //返回警报信息
{
return $this->warning;
}

function getGas() //返回汽油表指数
{
return $this->gas;
}

function &getPosition()
{
return $this->position; //返回当前小车的位置
}

function setHeading($direction='e')
{

}

/**
* 开动小汽车
*@access public
*@param INT 公里数
*/
function run($km)
{

}

}
?>
这时候最关键的两个方法 setHeading 和 run 就变得简单了,由于小车装备了 Position 对象 $this->position, 所以关于坐标定位的事情它也不用管了, 交给 Position 对象好了, 他自己只要管理好自己的油表,里程表就可以了。完成以后的 Car 类变成这个样子了:


class Car
{
/**
* 小车的汽油量
*
*@var
*@access
*/
var $gas;

/**
* 里程记录
*
*@var
*@access
*/
var $meter;

/**
* 车的位置(由GPS自动控制)
*
*@var Object position
*@access private
*/
var $position;

/**
* 发动机每1公里耗油量,这个车是0.1升
*
*@var Integer
*@access private
*/
var $engine=0.1;

/**
* 警报信息
*
*@var
*@access
*/
var $warning;


/**
小车的初始化。新车出场当然要
1、加满汽油。
2、里程表归零。
3、清除警报信息。
4、设定出发位置。
*/
function Car($gas,&$position)
{
$this->gas= $gas; //加满汽油
$this->meter = 0;
$this->warning =''; //清除警报信息
$this->position = $position; //设定初始位置
}

function getWarning() //返回警报信息
{
return $this->warning;
}

function getGas() //返回汽油表指数
{
return $this->gas;
}

function &getPosition()
{
return $this->position; //返回当前小车的位置
}

function setHeading($direction='e')
{
$this->position->setDirection($direction); //因为使用了Position 对象,小汽车不需要自己来操心XY坐标值了,交给Position 对象吧。
}

/**
* 开动小汽车
*@access public
*@param INT 公里数
*/

function run($km)
{
$goodRunFlag = true;//是否成功完成任务。
$maxDistance = $this->gas/$this->engine; //小车能够跑的最大距离。

if(($maxDistance)<$km)
{
$this->warning = '没有汽油了!';//设定警告信息,能跑多远就跑多远吧。
$goodRunFlag = false;//但是任务肯定完成不了。
}
else
{
$maxDistance=$km; //没有问题,完成任务以后就可以停下来休息了。
}
$this->position->move($maxDistance);//在坐标上移动由Position对象来完成,小汽车只要负责自己的油耗和公里表就可以了。
$this->gas -= $maxDistance*$this->engine;//消耗汽油
$this->meter += $maxDistance; //增加公里表计数
return $goodRunFlag;
}
}
?>
讲到这里我想我的这篇文章也该结束了。别着急,我当然还记得 Position 类还没有完成,但是有了上面小汽车的例子 Position 应该就非常简单了, 如果你理解了这个小汽车的类, 现在就是你一展身手的时候了, 你来完成这个Position 对象吧, 我相信你能够完成它(其实这正是面向对象和封装的美妙之处)。你需要记住先从Position 的方法开始设计比如:


getX()
getY()
move()
setDirection()
所谓类就是指某一类的事物,它可以是具体的(Car)也可以是抽象的(Position),我们通过封装简化了使用和操作就像我们使用电视,手机一样一点都不复杂。

一篇好的入门教程应该

1.生动真实的例子。
2.不但提供了正确的概念,在变量和函数命名,函数封装和调用上也值的学习。
3.即便你熟悉了面向对象编程以后也不会认为当初的例子有什么不妥之处。
4.如果你读完教程动手的话一定能够深刻体会到教程的美妙之处,大大减少了走弯路的机会。
5.好的代码是可以被人像书一样读懂,你认为呢?

PHP实现文件安全下载

  你一定会笑我"下载文件"如此简单都值得说?当然并不是想象那么简单。例如你希望客户要填完一份表格,才可以下载某一文件,你第一个想法一定是用"Redirect"的方法,先检查表格是否已经填写完毕和完整,然后就将网址指到该文件,这样客户才能下载,但如果你想做一个关于"网上购物"的电子商务网站,考虑安全问题,你不想用户直接复制网址下载该文件,笔者建议你使用PHP直接读取该实际文件然后下载的方法去做。程序如下:

  $file_name = "info_check.exe";

  $file_dir = "/public/www/download/";

  if (!file_exists($file_dir . $file_name)) { //检查文件是否存在

  echo "文件找不到";

  exit;

  } else {

  $file = fopen($file_dir . $file_name,"r"); //打开文件

  //输入文件标签

  Header("Content-type: application/octet-stream");

  Header("Accept-Ranges: bytes");

  Header("Accept-Length: ".filesize($file_dir . $file_name));

  Header("Content-Disposition: attachment; filename=" . $file_name);

  //输出文件内容

  echo fread($file,filesize($file_dir . $file_name));

  fclose($file);

  exit;}

  而如果文件路径是"http"或者"ftp"网址的话,则源代码会有少许改变,程序如下:

  $file_name = "info_check.exe";

  $file_dir = "www.easycn.net/";

  $file = @ fopen($file_dir . $file_name,"r");

  if (!$file) {

  echo "文件找不到";

  } else {

  Header("Content-type: application/octet-stream");

  Header("Content-Disposition: attachment; filename=" . $file_name);

  while (!feof ($file)) {

  echo fread($file,50000);

  }

  fclose ($file);

  }

  这样就可以用PHP直接输出文件了

用php写的进度条

用php写的进度条
iwind 2004-03-01 15


//by iwind netcool@163.com
set_time_limit("3600");
ob_end_clean();
for($i = 1;$i <= 300; $i++ ) echo(" ");
$file="jicheng.rar";//你要上传的东东
$obj="upload/website.rar";//目标文件,就是文件上传到哪里
$length="100";//进度条长度,可能不准备
$pimg="pro.gif";//进度条图片
$csize="100000";//每次拷贝的尺寸,单位字节

$size=filesize($file);
if(file_exists($obj)&&is_file($obj)){
$fsize=filesize($obj);
}
else{
$fsize="0";
}
$data=fread(fopen("$file","rb"),$size);
$nums=ceil(($size-$fsize)/$csize);
echo"";
for($i="0";$i<$nums;$i++){
$start=$fsize+$i*$csize;
$cdata=substr($data,$start,$csize);
$msize=strlen($cdata);
fwrite(fopen($obj,"ab"),$cdata);
echo"";
flush();
sleep(1);
}
echo"上传成功";

?>

Wednesday, August 17, 2005

www.canvasrus.com.au/

手头有一个项目,一个购物网站,例子:http://www.canvasrus.com.au/

这里写一个帖子,作为记录,总结,一边做一边写,希望对以后的项目有大的帮助。


已经完成的:
database design
GD libary 分割图片,参考
http://www.nyphp.org/content/presentations/GDintro/gd1.php

alpha
上传图片,储存图片路径,生成缩略图。
参考
http://www.devarticles.com/c/a/PHP/Creating-a-MultiFile-Upload-Script-in-PHP/1/

alpha1
改用 Build An Automated PHP Gallery System In Minutes
 参考 http://www.sitepoint.com/article/php-gallery-system-minutes

alpha2
图片的管理功能完全修改完毕,以适应现在的数据库
找到 shopping cart 例子
根据
Dreamweaver Article
Building a Persistent Shopping Cart with PHP and MySQL
http://www.macromedia.com/devnet/dreamweaver/articles/php_cart.html
admin部分全部完成,有log in,
(参考http://www.tutorialized.com/tutorial/Simple-login-script-with-Sessions/5288)
图片和各种产品的修改



2005-8-17
alpha3
To do:
shopping cart
根据
Dreamweaver Article
Building a Persistent Shopping Cart with PHP and MySQL
http://www.macromedia.com/devnet/dreamweaver/articles/php_cart.html
改写

改写products.php基本完毕,目前每次只能添加一个商品

2005-8-18
alpha3
To do:
完成cart.php,实现购买,查看和删除商品
给出所有product的preview链接

问题:是否要给出每次添加n个的选项?
页面间跳转注意给出的参数是否正确?
用户上传部分的具体细节要跟client讨论.

Done:
完成cart.php,实现购买,查看和删除商品
给出所有product的preview链接

出现的问题:sql的问题,已经解决

新问题:GD libray 生成的新图片偏色,不知道原因,在php user group new zealand上询问中

答案:
Okay here try this J (btw this can resize the image too)





$theImage = imagecreatetruecolor($newWidth, $newHeight);

$theSource = imagecreatefromjpeg($sourceImage);



// okay do it

imagecopyresampled($theImage, $theSource, 0, 0, 0, 0, $newWidth, $newHeight, $currentWidth, $currentHeight);

根据上面,把生成缩略图的地方也改为imageCreateTrueColor

现在整个网站基本成形

To do:
credit card gateway
user input details
write order to db
user upload image and order based on this image

beta1
解决的问题:
上传后有的文件变成一行,重新在UltraEdit新建一个空白文件,把当初最早的有回车的文件拷贝过来,另存来解决。好像是关于DOS格式的问题,也许是 /n, /r, /n/r在不同操作系统下的问题。
另一个是mysql查询的问题
DO:
$query = "SELECT * FROM table";
$result = mysql_query($query);

DO NOT:
$result = mysql_query("SELECT * FROM table");

Monday, July 18, 2005

锦集:2000-XP-2003操作系统常见问题(一)

第一部分:安装启动问题

1、关于所有版本XP在安装进度还剩下34分钟进度条就停止不动的说明以及解决方法

在安装还剩34分钟,也就是安装设备时屏幕会一直定在那儿不走,硬盘灯不亮!造成这种现象的原因极有可能是主板上的USB2.0的问题,因为XP在才出来的时候并不支持USB2.0。只有安装了SP1补丁包XP本身才能USB2.0设备。如果你在安装XP时,使用了USB2.0接口的鼠标或键盘,就会发生问题了,因为XP在安装设备时不能识别出USB2.0设备。

解决方法:在安装之前用PS/2的鼠标换下USB的鼠标,或者在BIOS设置不加载USB设备,等安装结束以后再设置为加载。


2、双启动菜单丢失

故障现象:这是多操作系统不注意安装顺序经常出现的问题,如Windows 98和Windows 2000双系统,在重装Windows 98后,双启动菜单就会丢失。

解决之道:用Windows 2000启动光盘启动电脑并选择“安装新的Windows 2000”,按默认状态安装。在“复制文件”过程结束后安装程序会给出一个“正在重新启动计算机”的对话框,请马上单击“不要重新启动”按钮以退出安装过程。

如果没有及时按键,系统已经重启了,也不要紧,您将会看到一个有3项内容启动菜单,选择第一项或第三项都可以,待系统启动后,进入C盘,你会发现根口录上多了一些Windows 2000的安装文件,包括一个文件夹$win nt$.~bt和5个文件$drvltr$~-~、$ldr$、boot.bak、bootsect.dat以及txtsetup.if等,其他磁盘分区上也会多一个磁盘加速文件,即$drvltr$.~-~,将它们删除即可。

此时,在Windows 98中的“查看”选项里选择“显示所有文件”,然后编辑C盘根目录的boot.ini文件,将[bootloaderl段的"default=C:\$WIN_NT$.~BT\BOOTSECT.DAT”改为“default=C:\”(Windows 98为默认系统)或“default=multi(0)disk(0)rdisk(0)partition(1)\WINNT”(Windows 2000为默认系统),然后再将[operatingsystems]段的“C:\$WIN_NT$.~BT\BOOTSECT.DAT="MicrosoftWindows 2000 Professional安装程序"”.?行直接删除即可。


3、XP系统启动时出现NTLDR is missing的错误提示

出现这种情况一般有以下两种情况:

1)ntldr文件丢失/破坏:这个文件位于C盘根目录,我们只需要从WinXP安装光盘里面提取这个文件,然后放到C盘根目录上即可。

2)如果替换文件后仍出现上述提示,则可以按以下方法进行修复:

使用一张含有SYS.COM的Win9X启动盘启动电脑,执行SYS C:命令,然后重新启动,会发现无法启动WinXP,这不要紧,这是正常的现象。然后再次重新启动,使用WinXP安装光盘启动,进入故障恢复控制台,执行fixboot即可。如果你不会使用故障恢复控制台,又安装了XP/9x双系统,你可以:在Win9X里面执行WinXP安装,系统拷贝完文件以后会重新启动,启动的是后手已经要快速按下方向键的上或下箭头,然后选择Windows回到Win9X下,接着编辑Boot.ini文件,确认你的Boot.ini文件和你电脑上WinXP的启动相匹配,最后删除C盘根目录上以$开头的全部文件即可。

顺便说一下,如果XP/9x双系统中的9x系统启动失败并提示I/O错误,可能是错误删除C盘根目录上的一个启动Win9x的重要文件bootsect.dos造成的,修复方法同上述2)所示。


4、XP系统启动时提示找不到HAL.DLL文件,启动中止

这个是由于C盘根目录下的boot.ini文件非法,导致默认从C:\Windows启动,但是又由于你的WinXP没有安装在C盘,所以系统提示找不到HAL.DLL文件。启动因而失败。解决方法是重新编辑Boot.ini文件。可用的方法有很多,在此不一一详述,最简单的方法是使用故障恢复控制台里面的bootcfg命令,当然也可以在别的电脑上创建好以后,拷贝到受损电脑的C盘根目录上覆盖源文件。


5、安装的简体中文版的WinXP在启动选单的时候出现的是英文提示

一般是由于C:\Bootfont.bin丢失造成的,但是如果你安装了更高版本的英文版本的Windows,那么这个现象就是很正常的。如果没有的话,从WinXP安装光盘里面提取bootfont.bin到C盘根目录即可。


6、安装 Windows 2000 后2000/xp双系统中无法启动 Windows XP

试图启动 Windows XP 时,您可能会收到下面的错误消息:

"Starting Windows...

Windows 2000 could not start because the following file is missing or corrupt: WINDOWSSYSTEM32 CONFIGSYSTEMd startup options for Windows 2000, press F8.

You can attempt to repair this file by starting Windows 2000 Setup using the original Setup floppy disk or CD-ROM.

Select 'r' at the first screen to start repair."

出现此问题的原因是,在 Windows 2000 发行时 Windows XP 尚不存在。Windows 2000引导程序不知道已在 Windows XP 中做了改动。计算机需要知道这些改动才能加载 Windows XP。若要解决此问题,请用 Windows 2000启动计算机,然后将 Windows XP 光盘上 I386 文件夹中的NTLDR、Bootfont.bin和Ntdetect.com文件复制到系统驱动器的根目录中。


7、关于系统开机时自动打开C盘的解决方法

打开优化大师-系统性能优化-桌面菜单优化-去掉启动系统时为桌面和Explorer创建独立的进程!如果还不行,则查看一下系统在启动时加载的程序,有无可疑之处!

8、Windows2000/XP中的自启动程序

当Windows完成登录过程,鼠标指针从繁忙到安静,除桌面上的图标,你还看到了什么?也许表面没什么变化,但你有没有注意到,你的系统托盘区多出了许多图标,你的进程表中出现了很多的进程!Windows在启动的时候,自动加载了很多程序,你知道它们是在什么地方被加载的吗?

许多程序的自启动,给我们带来了很多方便,这是不争的事实,但是否每个自启动的程序对我们都有用呢?更甚者,也许有病毒或木马在自启动行列,而你却不知!

到现在,你是不是觉得了解自启动文件的藏身之处有必要呢?那好,下面我就一一指出,让它们无外可藏!

其实Windows2000/XP中的自启动文件,除了从以前系统中遗留下来的Autoexec.bat文件中加载外,按照两个文件夹和9个核心注册表子键来自动加载程序的。

1)“启动”文件夹--最常见的自启动程序文件夹。它位于系统分区的“Documents and Settings-->User-->〔开始〕菜单-->程序”目录下。这时的User指的是你登录的用户名。

2)“All Users”中的自启动程序文件夹--另一个常见的自启动程序文件夹。它位于系统分区的“Documents and Settings-->All User-->〔开始〕菜单-->程序”目录下。前面提到的“启动”文件夹运行的是登录用户的自启动程序,而“All Users”中启动的程序是在所有用户下都有效(不论你用什么用户登录)。

3)“Load”键值--一个埋藏得较深的注册表键值。位于〔HKEY_CURRENT_USER\Software\Microsoft\Windows NT\CurrentVersion\Windows\load〕主键下。

4)“Userinit”键值--它则位于〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Winlogon\Userinit〕主键下,也是用于系统启动时加载程序的。一般情况下,其默认值为“userinit.exe”,由于该子键的值中可使用逗号分隔开多个程序,因此,在键值的数值中可加入其它程序。

5)“Explorer\Run”键值--与“load”和“Userinit”两个键值不同的是,“Explorer\Run”同时位于〔HKEY_CURRENT_USER〕和〔HKEY_LOCAL_MACHINE〕两个根键中。它在两个中的位置分别为〔HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer\Run〕和〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer\Run〕下。

6)“RunServicesOnce”子键--它在用户登录前及其它注册表自启动程序加载前面加载。这个键同时位于〔HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\RunServicesOnce〕和〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\RunServicesOnce〕下。

7)“RunServices”子键--它也是在用户登录前及其它注册表自启动程序加载前面加载。这个键同时位于〔HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\RunServices〕和〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\RunServices〕下。

8)“RunOnce\Setup”子键--其默认值是在用户登录后加载的程序。这个键同时位于〔HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\RunOnce\Setup〕和〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\RunOnce\Setup〕下。

9)“RunOnce”子键--许多自启动程序要通过RunOnce子键来完成第一次加载。这个键同时位于〔HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\RunOnce〕和〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\RunOnce〕下。位于〔HKEY_CURRENT_USER〕根键下的RunOnce子键在用户登录扣及其它注册表的Run键值加载程序前加载相关程序,而位于〔HKEY_LOCAL_MACHINE〕主键下的Runonce子键则是在操作系统处理完其它注册表Run子键及自启动文件夹内的程序后再加载的。在Windows XP中还多出一个〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\RunOnceEX〕子键,其道理相同。

10)“Run”子键--目前最常见的自启动程序用于加载的地方。这个键同时位于〔HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Run〕和〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Run〕下。其中位于〔HKEY_CURRENT_USER〕根键下的Run键值紧接着〔HKEY_LOCAL_MACHINE〕主键下的Run键值启动,但两个键值都是在“启动”文件夹之前加载。

11)再者就是Windows中加载的服务了,它的级别较高,用于最先加载。其位于〔HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services〕下,看到了吗,你所有的服务加载程序都在这里了!

12)Windows Shell──它位于〔HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Winlogon\〕下面的Shell字符串类型键值中,基默认值为Explorer.exe,当然可能木马程序会在此加入自身并以木马参数的形式调用资源管理器,以达到欺骗用户的目的~~

13)BootExecute──它位于注册表中〔HKEY_LOCAL_MACHINE\System\ControlSet001\Session Manager\〕下面,有一个名为BootExecute的多字符串值键,它的默认值是"autocheck autochk *",用于系统启动时的某些自动检查。这个启动项目里的程序是在系统图形界面完成前就被执行的,所以具有很高的优先级~~~~

14)策略组加载程序??打开Gpedit.msc,展开“用户配置??管理模板??系统??登录”,就可以看到“在用户登录时运行这些程序”的项目,你可以在里面添加。在注册表中[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Group Policy Objects\本地User\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer\Run]你也可以看到相对应的键值~~~

第二部分:关机问题

1、WindowsXP不能正常关机的解决之道

1)操作系统及主板对ACPI或APM支持不够完善。

2)主板之外的各种硬件对ACPI或APM支持不够完善。

3)硬件驱动程序的BUG。

4)主板的BIOS需要改进。

5)系统对快速关机支持不够好(仅对Windows98而言)。

6)关机前有一些常驻内存的程序示退出并与系统的关机进程有冲突。

7)病毒。

8)磁盘子系统的故障,如IDE驱动程序与系统兼容性不够好。

首先进入主板的CMOS设置界面,在power management里将PM control by APM关闭,启动进入系统后,再关机,电脑仍然重启。再启动windowsXP,依次进入控制面板\电源管理,点击“休眠”选单,此时启用休眠处于未选中状态,遂勾选“启用休眠”,可仍未能见效。重新进入控制面板里的电源选项,点击“高级电源管理”子页,系统显示:本机支持高级电源管理(APM),使用APM可降低系统的电源损耗。而此时“启用高级电源管理支持”处于选中态。毫不犹豫地去掉“启用高级电源管理支持”,多耗点电都也罢了,不能正常关机,多别扭啊。接下来当然是关机,OK


2、休眠与等待

故障现象:开始菜单中,“关闭Windows”这个对话框甲的等待或是休眠不见了,是什么原因呢?

解决之道:

1)BIOS的电源管理功能被关闭了,打开Power Management即可。以Award的BIOS为例,开机后按“Del'’进入COMS Setup:进入“Power management Setup?Powermanagement”设为“Enable”、“Min Saving”、“Max Saving”、“User Define”都可以,但不能为“Disable"。

2)安装Windows XP后没有了休眠功能。例如笔者的K7S5A主板,使用Windows XP系统,发现原来的休眠变成了“等待”,以为无法在开始关机菜单中进行休眠。后来仔细查看,发现用鼠标点击是不能休眠,但只要按住Shift键,等待就会转变为“休眠”,再点击即可。其实后面是有个括号(S)提示的,S就是Shift的首字母。

3)临时空间不够了。休眠功能需要和物理内存相等的磁盘空间,并且是设置在你安装Windows的分区上的,如果该分区空间不够,休眠功能会被自动关闭,当然就在菜单中消失了。如果你是用Windows 98,禁用虚拟内存也会引起“待机”选项消失。

一般说来,自动关机、等待、休眠等功能的异常,都是由于电源相关的选项设置不当或是不兼容引起的,注意在BIOS中调节一下Power Management或下载主板的补丁即可解决。


3、Win2000关机保存时间过长的解决办法

如果你遇到Windows 2000关机保存时间过长的问题,可以试试下面的方法,看是否有效解决问题:

运行gpedit.msc,依次选择:计算机配置->管理模块->系统->登陆->加载和更新用户配置文件的最大重试次数改为1

第三部分:日常使用问题

1、光盘内容显示为什么“喜旧厌新”

故障现象:在WindowsXP中明明更换了光驱内的光盘,但显示却还是上一次的数据内容。

解决之道:其实这是Windows XP内设的?个功能,它能够尢将光碟内容作成一个镜像文件保存,可以让你在取出光盘之后的短暂时间内还能看到光盘内容,但是这样也导致无法立即使用新放人的光盘。依次点击“开始一控制面板一性能和维护一管理工具一计算机管理”,在计算机管理界面中选择“存储一可移动存储一库”,然后找到自己的光驱的型号并点击右键,选择“属性”命令之后在弹出窗口中将“延迟卸除”?栏中的时间设置为“0分钟”。确认之后就可以让Windows XP即时查看新放入的光盘了。



2、硬盘空间与日俱减

故障现象:安装Windows XP之后,使用??段日子,发现硬盘空间少了很多。

解决之道:其实这不是故障,系统还原功能是WindowsXP系统的一个重要特色,它可以在Windows运行出现问题后将系统还原列以前正常的状态。不过因为Windows XP要记录操作以便口后还原,随着使用时间的增加,用来保存数据的硬盘空间会越来越多。如果想取消这项功能,可以右键单击“我的电脑”图标,点击“属性”命令之后会弹出系统属性对话框,这时在“系统还原”标签下选中“在所有驱动器上关闭系统还原”一项,这样就屏蔽了系统还原功能,也释放厂人量宝贵的硬盘空间。

Windows Me也具有系统还原功能,但功能并不完善,建议予以禁止。



3、取消WinXP中[我的电脑]中用户文档的显示图标或者同样的问题是双击打开[我的电脑]时反应非常慢:

刚刚安装完毕的Windows XP,一打开[我的电脑],就会看到在[在这台计算机上存储的文件]这个栏目,非但不好看,而且会严重影响[我的电脑]双击打开速度!通过修改注册表特定的键值是可以取消:

方法一(隐藏法)

打开注册表编辑器,找到

HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer

新建2进制值,命名为:NoSharedDocuments,修改其值为01000000,注销以后就可以看到效果。相应的REG文件的内容就是:

Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer]

"NoSharedDocuments"=hex:01,00,00,00

方法二(彻底取消):

打开注册表编辑器,找到

HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Explorer\My Computer\NameSpace\DelegateFolders

在其下找到名为{59031a47-3f72-44a7-89c5-5595fe6b30ee}的键将其删除即可。


4、WindowsXP恢复经典搜索样式

到注册表HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explorer\CabinetState下,新建字符串“Use Search Asst”,修改其键值:把键值修改为“no”时使用Win2000的经典界面,键值为“yes”时为Winxp的界面。还原IE中的搜索界面:到注册表HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main下,新建字符串“Use Search Asst”,修改其键值:把键值修改为“no”时使用Win2000的经典界面,键值为“yes”时为Winxp的界面。

顺便说一下,如果搜索功能丢失,可以查看C:\WINDOWS\SYSTEM32\下的SHELL32.DLL文件语言版本{中文(0804)、英文(0600)、新加坡中文(1004)、台湾中文(0409)},并将C:\windows\srchasst\mui\下的目录名原为0804或0409等改为与SHELL32.DLL语言版本相同的数字,应该就能找回搜索功能了。这种情况一般会出现在使用所谓的VLK版XP系统中。



5、使用Windows NT/2000的登录界面

Windows XP带了一个新的登录界面,使登录过程更加流畅,还可以显示用户的设置信息。不过有时候还是需要使用老的Windows NT/2000登录界面,这样只有通过组合键才能访问。按住Ctrl+Alt不放,按Del键两次。登录界面会返回到以前的登录窗口界面,你可以按Cancel键返回到XP界面。



6、XP中启用USB设备自动运行功能

如果你希望插入USB设备后自动运行,可以打开注册表找到

[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\USBSTOR]

将DWORD类型的AutoRun值改为0x00000001(1),相同如果改为0则不自动运行。

7、XP中关闭光驱自动运行功能

如果你希望光驱插入时自动运行,则打开注册表找到

[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer]

将Dword类型的NoDriveTypeAutoRun键十六位值改为95,否则改为b5。



8、打开控制面板中的用户帐户,就会出现HTML错误,“Microsoft (R) HTML Application host 遇到问题需要关闭。我们对此引起的不便表示抱歉。”是怎么回事?

目前已知的此类的问题时有金山毒霸2003引起的,解决方法是到金山的官方网站升级,金山也给出了一个比较完整地解决方案,如下:

1)打开命令行窗口[点击"开始"-"运行"-输入"cmd"回车win2000与winxp用"cmd",win98与winme用"command")]

2)在命令行窗口中输入"cd c:\kav2003"(假定你安装的是c:\kav2003,如果不是请换成你所安装的目录文件名)

3)在命令行窗口中输入"regsvr32 /u C:\kav2003\kaieplus.dll"

4)在命令行窗口中输入"regsvr32 /u C:\kav2003\kascript.dll"

5)删除两个文件c:\kav2003\kaieplus.dll与C:\kav2003\kascript.dll

6)把附件中的三个文件拷贝到毒霸安装目录c:\kav2003

7)在命令行窗口中输入"regsvr32 C:\kav2003\kaieplus.dll"

8)打开IE,在IE工具栏点右键,把金山毒霸工具栏选取中,使之显示在IE工具栏上


9、Windows 2000/XP中无法删除文件故障的解决办法

出现这类问题一般有以下几种情况:

1)位于NTFS文件系统上,而起使用了ACL(Access Control List),没有权限访问你

要删除的文件;

2)文件正在被另外的程序使用者;

3)文件系统损坏导致无法访问你要删除的文件;

4)文件的路径太长导致无法访问;

5)文件名使用了非法的字符或Windows保留关键字。

出现这些问题的可能原因如下:

1)可以使用管理员帐户通过重新设定ACL的方法获得访问权限;

2)找到那个使用被删除文件的程序并关闭;

3)检查文件系统,排除错误;

4)路径过长,超过了大多数Windows所能接受的255个字节(NTFS文件系统没有这个问

题);

5)因为Windows认为这个命名是不合法的或这个命名与硬件设备有关。常见的保留字有

LPT1、CON等。

针对每种情况的相应解决方法如下:

1)对于这种情况,可以用下列方法解决:使用管理员帐户登录,在无法访问的文件上

点击鼠标右键选择属性,选择『安全』标签页,选择『高级』按钮,再选择『所有者』

标签页,在“将所有者更改为”BOX里面选择管理员帐户,反色,接着点击“应用”按

钮使所有者变为你自己。最后点击2次确定按钮,关闭属性对话框。再次打开属性对话框,可以看到『安全』标签页下的“添加”按钮已经编程可选状态了,点击这个按钮,在“选择用户和组”对话框里面输入你要访问这个文件的帐户名(注意格式:计算机名\帐户名)。

点击确定按钮返回上一层对话框,然后在“帐户名的权限”BOX中选中完全控制复选,点击确定即可重新获得访问权限。同样,可以在命令行模式下使用cacls命令分配权限。

2)常见的故障发生在删除一个AVI文件的时候。因为Windows有一个预读机制,预读会

使文件处于被使用状态,所以无法删除。解决方法有很多:

a)关闭全部资源管理器,使用命令del或rd删除文件或目录(推荐);

b)删除注册表中下面这个键值:

HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\{87D62D94-71B3-4b9a-9489-5FE6850DC73E}\InProcServer32。建议导出备份这个注册键值,以便以后需要时复原;

c)进入DOS命令窗口,运行:REGSVR32 /U SHMEDIA.DLL注销掉预读功能;

d)使用“Windows传统风格的文件夹”查看方式(文件夹选项--任务下面选择)

e)使用能浏览本地文件的第三方工具,如FlashFXP、CuteFTP等,进行删除。

3)当看到以下提示时就需要注意一下你的文件系统了:

:\ is not accessible

The file or directory is corrupt and non-readable.The file or directory iscorrupt and non-readable. The file or directory \ is corrupt andunreadable.

Please Run the Chkdsk utility.

排除方法:使用chkdsk命令检查你的驱动器。修复受损的文件系统。起因有很多,例如:硬盘坏道、硬件设备的错误或软件的bug都有可能引起这个问题。

4)使用8.3格式缩小长度或更改路径中部分目录名以减少路径的长度。例如可以暂时的

把路径中某些目录改改名字,或在命令行模式下使用8.3格式。例如:假设你要删除的

文件位于以下路径:

C:\Documentations\HOWTO\2003\May\WindowsDocumentations\ForWebsites_Forum\Tips\Smallfrogs\Smallfrogs_Test_Project\YuanChuan_Articls\20030530\TheTroubleShootingAboutCannotDeleteFilesInWindows
以上目录已经能够足以表达意思呢,不过路径长度还是不够255字节。

那么可以输入:

cd C:\Docume~1\howto\2003\may\window~1\forweb~1\tips\smallf~1\yuanch~1\20030530\thetro~1
可以看到,使用8.3格式以后可以节省很多长度的。因为进入这种还有长路径的目录也是不成问题的。一旦进入了这类目录,就可以使用del命名随意的删除你想删除的文件了。

5)对于含有保留字的文件,当我们发出删除指令的时候,Windows会检查被删除的文件

是否有合法的路径,如果你的文件名含有Windows认为的非法字符或保留字,那么删除

就会失败。

我们有3种方法可以删除这类文件:

a)采用Linux或其他非Windows的操作系统,以Linux/Unix为例:可以使用rm命令删除:

rm -d //driveletter/path using forward slashes/filename

rm -r "//C/Program Files/BadFolder"

b)使用命令行工具的一个特殊参数解决:

RD\\.\:\
DEL\\.\driveletter:\path\filename

在删除命令后面跟上\\.\参数就可以避免Windows检查文件名的合法性,因此可以删除含有Windows保留字或非法名字的文件。

c)对于文件,如果可以使用通配符,那么也可以采用通配符解决:

DEL DEL PR?.*

DEL LPT?.*



10、关于RealONE Player无法正常运行的解决办法

RealONE Player无法正常运行,通常是由于Real本身的解码器与第三方的解码器冲突造成的,有时卸载重装都不能解决问题。遇到这种情况,你可以尝试:

运行Program Files\real\realone player\setup目录下的rlpclean.exe,在弹出的命令行窗口输入Y,然后回车,运行完毕后重新运行安装程序。

Wednesday, June 29, 2005

MySQL忘记超级用户口令怎么办

 如果MySQL正在运行,首先杀之: killall -TERM mysqld。

  启动MySQL:bin/safe_mysqld --skip-grant-tables &

  就可以不需要密码就进入MySQL了。

  然后就是

  >use mysql

  >update user set password=password("new_pass") where user="root";

  >flush privileges;

  重新杀MySQL,用正常方法启动MySQL 。

Sunday, June 26, 2005

统计Google广告点击次数的方法

  象Google的广告,展示啥内容,都是由Google自己控制的,使用普通的页面提交连接的方式,我们是无法统计我们页面上的Google广告被点击了多少次,被谁点击了。因为这些页面都不受我们控制。
  下面介绍一个可以统计Google那样广告点击次数的方法。



点击计数






value='1119342517' id="keyid">





点击IFrame中的次数:

0






本页其它连接


  上述代码中,我们在点击、移动等事件中,判断用户点击的是不是某个需要的范围内。然后进行计数,如果我们需要额外的记录,可以在这些事件函数中,向一个我们可控的页面进行提交。为了不影响页面的展示,这个页面被提交的页面,是在一个隐含的IFrame中实现的,具体看上述代码就明白了。
通过以上的方法,我们就可以实现不论点本网站的自己广告,还是Google广告,每点击一次,增加多少可用分这类的逻辑了。(当然这个逻辑可以更复杂)

Saturday, January 29, 2005

Character Model for the World Wide Web 1.0: Fundamentals

W3C
Character Model for the World Wide Web 1.0: Fundamentals
W3C Proposed Recommendation 22 November 2004

This version:
http://www.w3.org/TR/2004/PR-charmod-20041122
Latest version:
http://www.w3.org/TR/charmod
Previous version:
http://www.w3.org/TR/2004/WD-charmod-20040225
Editors:
Martin J. Dürst, W3C
Fran?ois Yergeau (Invited Expert)
Richard Ishida, W3C
Misha Wolf (until Dec 2002), Reuters Ltd.
Tex Texin (Invited Expert), XenCraft

This document is also available in these non-normative formats: XHTML with visible change markup and zip file.

Copyright ? 2004 W3C? (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
Abstract

This Architectural Specification provides authors of specifications, software developers, and content developers with a common reference for interoperable text manipulation on the World Wide Web, building on the Universal Character Set, defined jointly by the Unicode Standard and ISO/IEC 10646. Topics addressed include use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, and string indexing.

For normalization and string identity matching, see the companion document Character Model for the World Wide Web 1.0: Normalization [CharNorm]. For resource identifiers, see the companion document Character Model for the World Wide Web 1.0: Resource Identifiers [CharIRI].
Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

Publication as a Proposed Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than "work in progress."

This is the 22 November 2004 Proposed Recommendation of “Character Model for the World Wide Web 1.0: Fundamentals”. Publication as a Proposed Recommendation indicates that W3C seeks endorsement of the stable technical report. The W3C Membership and other interested parties are invited to review the document and send comments to www-i18n-comments@w3.org (public archive) through 20 December 2004. Advisory Committee Representatives should consult their WBS questionnaires. Note that substantive technical comments were expected during the Last Call review period that ended 19 March 2004. Changes since the Last Call draft are documented in a version with change markup. The Last Call dispositions are available in a public version and a Members only version. There is also an implementation report.

This document has been developed as part of the W3C Internationalization Activity by the W3C Internationalization Working Group (I18N WG) (Members only), with the help of the Internationalization Interest Group.

Patent disclosures relevant to this specification may be found on the Working Group's patent disclosure page. This document has been produced under the 24 January 2002 CPP as amended by the W3C Patent Policy Transition Procedure. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.
Table of Contents

1 Introduction
1.1 Goals and Scope
1.2 Background
1.3 Terminology and Notation
2 Conformance
3 Perceptions of Characters
3.1 Introduction
3.2 Units of aural rendering
3.3 Units of visual rendering
3.3.1 Visual Rendering and Logical Order
3.4 Units of input
3.5 Units of collation
3.6 Units of storage
3.7 Summary
4 Digital Encoding of Characters
4.1 Character Encoding
4.2 Transcoding
4.3 Reference Processing Model
4.4 Choice and Identification of Character Encodings
4.4.1 Mandating a unique character encoding
4.4.2 Character encoding identification
4.5 Private use code points
4.6 Character Escaping
5 Compatibility and Formatting Characters
6 Strings
6.1 String concepts
6.2 String indexing
7 Referencing the Unicode Standard and ISO/IEC 10646
Appendices

A References
A.1 Normative References
A.2 Other References
B Examples of Characters, Keystrokes and Glyphs (Non-Normative)
C Example text (Non-Normative)
D List of conformance criteria (Non-Normative)
E Acknowledgements (Non-Normative)
1 Introduction
1.1 Goals and Scope

The goal of the Character Model for the World Wide Web is to facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access. One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way.

The main target audience of this specification is W3C specification developers. This specification and parts of it can be referenced from other W3C specifications. It defines conformance criteria for W3C specifications as well as other specifications.

Other audiences of this specification include software developers, content developers, and authors of specifications outside the W3C. Software developers and content developers implement and use W3C specifications. This specification defines some conformance criteria for implementations (software) and content that implement and use W3C specifications. It also helps software developers and content developers to understand the character-related provisions in W3C specifications.

The character model described in this specification provides authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text manipulation on the World Wide Web. Working together, these three groups can build a more international Web.

Topics addressed in this part of the Character Model for the World Wide Web include use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, and string indexing .

Other parts of the Character Model address normalization and string identity matching ([CharNorm]) and Internationalized Resource Identifiers (IRI) conventions ([CharIRI]).

Topics as yet not addressed or barely touched include fuzzy matching, and language tagging. Some of these topics may be addressed in a future version of this specification.

At the core of the model is the Universal Character Set (UCS), defined jointly by the Unicode Standard [Unicode] and ISO/IEC 10646 [ISO/IEC 10646]. In this document, Unicode is used as a synonym for the Universal Character Set. The model will allow Web documents authored in the world's scripts (and on different platforms) to be exchanged, read, and searched by Web users around the world.
1.2 Background

This section provides some historical background on the topics addressed in this specification.

Starting with Internationalization of the Hypertext Markup Language [RFC 2070], the Web community has recognized the need for a character model for the World Wide Web. The first step towards building this model was the adoption of Unicode as the document character set for HTML.

The choice of Unicode was motivated by the fact that Unicode:

*

is the only universal character repertoire available,
*

provides a way of referencing characters independent of the encoding of the text,
*

is being updated/completed carefully,
*

is widely accepted and implemented by industry.

W3C adopted Unicode as the document character set for HTML in [HTML 4.0]. The same approach was later used for specifications such as XML 1.0 [XML 1.0] and CSS2 [CSS2]. W3C specifications and applications now use Unicode as the common reference character set.

When data transfer on the Web remained mostly unidirectional (from server to browser), and where the main purpose was to render documents, the use of Unicode without specifying additional details was sufficient. However, the Web has grown:

*

Data transfers among servers, proxies, and clients, in all directions, have increased.
*

Characters outside the US-ASCII [ISO/IEC 646][MIME-charset] repertoire are being used in more and more places.
*

Data transfers between different protocol/format elements (such as element/attribute names, URI components, and textual content) have increased.
*

More and more APIs are defined, not just protocols and formats.

In short, the Web may be seen as a single, very large application (see [Nicol]), rather than as a collection of small independent applications.

While these developments strengthen the requirement that Unicode be the basis of a character model for the Web, they also create the need for additional specifications on the application of Unicode to the Web. Some aspects of Unicode that require additional specification for the Web include:

*

Choice of Unicode encoding forms (UTF-8, UTF-16, UTF-32).
*

Counting characters, measuring string length in the presence of variable-length character encodings and combining characters.
*

Duplicate encodings of characters (e.g. precomposed vs decomposed).
*

Use of control codes for various purposes (e.g. bidirectionality control, symmetric swapping, etc.).

It should be noted that such aspects also exist in various encodings, and in many cases have been inherited by Unicode in one way or another from these encodings.

The remainder of this specification presents additional requirements to ensure an interoperable character model for the Web, taking into account earlier work (from W3C, ISO and IETF).

The first few chapters of the Unicode Standard [Unicode] provide very useful background reading. The policies adopted by the IETF for on the use of character sets on the Internet are documented in [RFC 2277].
1.3 Terminology and Notation

Unicode code points are denoted as U+hhhh, where "hhhh" is a sequence of at least four, and at most six hexadecimal digits.

Text has been used for examples to allow them to be cut and pasted by the reader. Characters used will not appear as intended unless you have the appropriate font, but care has been taken to annotate the examples so that they remain understandable even if you do not. In some cases it is important to see the result of an example, so images have been used; by clicking on the image it is possible to link to the text for these examples in C Example text.
2 Conformance

This section explains the conditions that specifications, software, and Web content have to fulfill to be able to claim conformance to this specification.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119].

NOTE: RFC 2119 makes it clear that requirements that use SHOULD are not optional and must be complied with unless there are specific reasons not to: "This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course."

This specification defines conformance criteria for specifications, for software, and for Web content. To aid the reader, all conformance criteria are preceded by '[X]' where 'X' is one of 'S' for specifications, 'I' for software implementations, and 'C' for Web content. These markers indicate the relevance of the conformance criteria and allow the reader to quickly locate relevant conformance criteria by searching through this document.

A specification conforms to this document if it:

1.

does not violate any conformance criteria preceded by [S],
2.

documents the reason for any deviation from criteria where the imperative is SHOULD, SHOULD NOT, or RECOMMENDED,
3.

where applicable, requires implementations conforming to the specification to conform to this document,
4.

where applicable, requires content conforming to the specification to conform to this document.

An implementation (software) conforms to this document if it does not violate any conformance criteria preceded by [I].

Content conforms to this document if it does not violate any conformance criteria preceded by [C].

NOTE: Requirements placed on specifications might indirectly cause requirements to be placed on implementations or content that claim to conform to those specifications. Likewise, requirements placed on content may affect implementations designed to produce such content, and so on.

Where this specification places requirements on processing, it is to be understood as a way to specify the desired external behavior. Implementations can use other means of achieving the same results, as long as observable behavior is not affected.
3 Perceptions of Characters
3.1 Introduction

The glossary entry in the Unicode Standard [Unicode 4.0] gives:

"Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape ..."

The word 'character' is used in many contexts, with different meanings. Human cultures have radically differing writing systems, leading to radically differing concepts of a character. Such wide variation in end user experience can, and often does, result in misunderstanding. This variation is sometimes mistakenly seen as the consequence of imperfect technology. Instead, it derives from the great flexibility and creativity of the human mind and the long tradition of writing as an important part of the human cultural heritage. The alphabetic approach used by scripts such as Latin, Cyrillic and Greek is only one of several possibilities.

EXAMPLE: A character in Japanese hiragana and katakana scripts corresponds to a syllable (usually a combination of consonant plus vowel).

EXAMPLE: Korean Hangul combines symbols for individual sounds of the language into square blocks, each of which represents a syllable. Depending on the user and the application, either the individual symbols or the syllabic clusters can be considered to be characters.

EXAMPLE: In Indic scripts each consonant letter carries an inherent vowel that is eliminated or replaced using semi-regular or irregular ways to combine consonants and vowels into clusters. Depending on the user and the application, either individual consonants or vowels, or the consonant or consonant-vowel clusters can be perceived as characters.

EXAMPLE: In Arabic and Hebrew vowel sounds are typically not written at all. When they are written they are indicated by the use of combining marks placed above and below the consonantal letters.

The developers of specifications, and the developers of software based on those specifications, are likely to be more familiar with usages of the term 'character' they have experienced and less familiar with the wide variety of usages in an international context. Furthermore, within a computing context, characters are often confused with related concepts, resulting in incomplete or inappropriate specifications and software.

This section examines some of these contexts, meanings and confusions.
3.2 Units of aural rendering

In some scripts, characters have a close relationship to phonemes (a phoneme is a minimally distinct sound in the context of a particular spoken language), while in others they are closely related to meanings. Even when characters (loosely) correspond to phonemes, this relationship may not be simple, and there is rarely a one-to-one correspondence between character and phoneme.

EXAMPLE: In the English sentence, "They were too close to the door to close it." the same character 's' is used to represent both /s/ and /z/ phonemes.

EXAMPLE: In the English language the phoneme /k/ of "cool" is like the phoneme /k/ of "keel".

EXAMPLE: In many scripts a single character may represent a sequence of phonemes, such as the syllabic characters of Japanese hiragana.

EXAMPLE: In many writing systems a sequence of characters may represent a single phoneme, for example 'th' and 'ng' in "thing".

C001 [S] [I] [C] Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language.
3.3 Units of visual rendering

Visual rendering introduces the notion of a glyph. Glyphs are defined by ISO/IEC 9541-1 [ISO/IEC 9541-1] as "a recognizable abstract graphic symbol which is independent of a specific design". There is not a one-to-one correspondence between characters and glyphs:

*

A single character can be represented by multiple glyphs (each glyph is then part of the representation of that character). These glyphs may be physically separated from one another.
*

A single glyph may represent a sequence of characters (this is the case with ligatures, among others).
*

A character may be rendered with very different glyphs depending on the context.
*

A single glyph may represent different characters (e.g. capital Latin A, capital Greek A and capital Cyrillic A).

A set of glyphs makes up a font. Glyphs can be construed as the basic units of organization of the visual rendering of text, just as characters are the basic unit of organization of encoded text.

C002 [S] [I] [C] Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text.

See the appendix B Examples of Characters, Keystrokes and Glyphs for examples of the complexities of character to glyph mapping.
3.3.1 Visual Rendering and Logical Order

Some scripts, in particular Arabic and Hebrew, are written from right to left. Text including characters from these scripts can run in both directions and is therefore called bidirectional text. The Unicode Standard [Unicode] requires that characters be stored and interchanged in logical order, i.e. roughly corresponding to the order in which text is typed in via the keyboard or spoken (for a more detailed definition see [Unicode 4.0], Section 2.2). Logical ordering is important to ensure interoperability of data, and also benefits accessibility, searching, and collation.

C003 [S] [I] [C] Protocols, data formats and APIs MUST store, interchange or process text data in logical order.

In the presence of bidirectional text, two possible selection modes can be considered. The first is logical selection mode, which selects all the characters logically located between the end-points of the user's mouse gesture. Here the user selects from between the first and second letters of the second word to the middle of the number. Logical selection looks like this:
Visual display The same example, showing how the text would look on-screen when highlighted, showing two separate highlighted character ranges.
Logical order An example showing the logical order of characters in a string containing two Arabic words followed by a year number. In logical selection mode, the range of characters selected by starting the selection in the middle of the second word and ending in the middle of the year number is depicted using highlighting. The highlighting covers a single block of contiguous characters.
Logical selection resulting in discontiguous visual ranges

It is a consequence of the bidirectionality of the text that a single, continuous logical selection in memory results in a discontinuous selection appearing on the screen. This discontinuity makes some users prefer a visual selection mode, which selects all the characters visually located between the end-points of the user's mouse gesture. With the same mouse gesture as before, we now obtain:
Visual display The same example, showing how the text would look on-screen when highlighted, showing a single highlighted block of contiguous characters.
Logical order An example showing the logical order of characters in a string containing two Arabic words followed by a year number. In visual selection mode, the range of characters selected by starting the selection in the middle of the second word and ending in the middle of the year number is depicted using highlighting. The highlighting covers two separate blocks of characters.
Visual selection resulting in discontiguous logical ranges

In visual selection mode, as seen in the example above, a single visual selection range may result in two or more logical ranges, which may have to be accommodated by protocols, APIs and implementations. Other, related aspects of a user interface for bidirectional text include caret movement, behavior of backspace/delete keys, and so on.

Currently, most implementations provide logical selection, while only very few provide visual selection.

C075 [I] Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage.

C004 [S] Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs.
3.4 Units of input

In keyboard input, it is not always the case that keystrokes and input characters correspond one-to-one. A limited number of keys can fit on a keyboard. Some keyboards will generate multiple characters from a single keypress. In other cases ('dead keys') a key will generate no characters, but affect the results of subsequent keypresses. Many writing systems have far too many characters to fit on a keyboard and must rely on more complex input methods, which transform keystroke sequences into character sequences. Other languages may make it necessary to input some characters with special modifier keys. See B Examples of Characters, Keystrokes and Glyphs for examples of non-trivial input.

C005 [S] [I] Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world.
3.5 Units of collation

String comparison as used in sorting and searching is based on units which do not in general have a one-to-one relationship to encoded characters. Such string comparison can aggregate a character sequence into a single collation unit with its own position in the sorting order, can separate a single character into multiple collation units, and can distinguish various aspects of a character (case, presence of diacritics, etc.) to be sorted separately (multi-level sorting).

In addition, a certain amount of pre-processing may also be required, and in some languages (such as Japanese and Arabic) sort order may be governed by higher order factors such as phonetics or word roots. Collation methods may also vary by application.

EXAMPLE: In traditional Spanish sorting, the character sequences 'ch' and 'll' are treated as atomic collation units. Although Spanish sorting, and to some extent Spanish everyday use, treat 'ch' as a single unit, current digital encodings treat it as two characters, and keyboards do the same (the user types 'c', then 'h').

EXAMPLE: In some languages, the letter '?' is sorted as two consecutive collation units: 'a' and 'e'.

EXAMPLE: The sorting of text written in a bicameral script (i.e. a script which has distinct upper and lower case letters) is usually required to ignore case differences in a first pass; case is then used to break ties in a later pass.

EXAMPLE: Treatment of accented letters in sorting is dependent on the script or language in question. The letter '?' is treated as a modified 'o' in French, but as a letter completely independent from 'o' (and sorting after 'z') in Swedish. In German certain applications treat the letter '?' as if it were the sequence 'oe'.

EXAMPLE: In Thai the sequence '??' (U+0E44 U+0E01) must be sorted as if it were written '??' (U+0E01 U+0E44). Reordering is typically done during an initial pre-processing stage.

EXAMPLE: German dictionaries typically sort '?', '?' and 'ü' together with 'a', 'o' and 'u' respectively. On the other hand, German telephone books typically sort '?', '?' and 'ü' as if they were spelled 'ae', 'oe' and 'ue'. Here the application is affecting the collation algorithm used.

C006 [S] [I] Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application.

C007 [S] [I] Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user.

C066 [S] [I] Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering.

C008 [S] [I] Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode.

Note that this requires, as a minimum, that a collation algorithm does not break down if the text contains Unicode characters that are not covered by its rules. It does not necessarily require full implementation of complex algorithms for all scripts. One useful way of satisfying the requirement is to apply a default collation algorithm that covers all Unicode characters.

ISO/IEC 14651 [ISO/IEC 14651] and Unicode Technical Report #10, the Unicode Collation Algorithm [UTR #10], describe a model for collation that accommodates most languages and provide a default collation order. They are appropriate references for collation and provide implementation guidelines. The default collation order can be used in conjunction with rules tailored for a particular locale to ensure a predictable ordering and comparison of strings, whatever characters they include.
3.6 Units of storage

Computer storage and communication rely on units of physical storage and information interchange, such as bits and bytes (8-bit units, also called octets). A frequent error in specifications and implementations is the equating of characters with units of physical storage. The mapping between characters and such units of storage is actually quite complex, and is discussed in the next section, 4.1 Character Encoding.

C009 [S] [I] Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage.
3.7 Summary

The term character is used differently in a variety of contexts and often leads to confusion when used outside of these contexts. In the context of the digital representations of text, a character can be defined as a small logical unit of text. Text is then defined as sequences of characters. While such an informal definition is sufficient to create or capture a common understanding in many cases, it is also sufficiently open to create misunderstandings as soon as details start to matter. In order to write effective specifications, protocol implementations, and software for end users, it is very important to understand that these misunderstandings can occur.

This section, 3 Perceptions of Characters, has discussed terms for units that do not necessarily overlap with the term 'character', such as phoneme, glyph, and collation unit. The next section, 4.1 Character Encoding, lists terms that should be used rather than 'character' to precisely define units of encoding (code point, code unit, and byte).

C010 [S] When specifications use the term 'character' the specifications MUST define which meaning they intend.

C067 [S] Specifications SHOULD use specific terms, when available, instead of the general term 'character'.
4 Digital Encoding of Characters
4.1 Character Encoding

On the WWW, as in any computing environment, characters must be encoded to be of any use. To achieve text encoding, a large variety of character encodings have been devised. Character encodings can loosely be explained as mappings between the character sequences that users manipulate and the sequences of bits that computers manipulate.

Given the complexity of text encoding and the large variety of mechanisms for character encoding invented throughout the computer age, a more formal description of the encoding process is useful. The process of defining a text encoding can be described as follows (see Unicode Technical Report #17: Character Encoding Model [UTR #17] for a more detailed description):

1.

A set of characters to be encoded is identified. The characters are pragmatically chosen to express text and to efficiently allow various text processes in one or more target languages. They may not correspond precisely to what users perceive as letters and other characters. The set of characters is called a repertoire.
2.

Each character in the repertoire is then associated with a (mathematical, abstract) non-negative integer, the code point (also known as a character number or code position). The result, a mapping from the repertoire to the set of non-negative integers, is called a coded character set (CCS).
3.

To enable use in computers, a suitable base datatype is identified (such as a byte, a 16-bit unit of storage or other) and a character encoding form (CEF) is used, which encodes the abstract integers of a coded character set (CCS) into sequences of the code units of the base datatype. The character encoding form can be extremely simple (for instance, one which encodes the integers of the CCS into the natural representation of integers of the chosen datatype of the computing platform) or arbitrarily complex (a variable number of code units, where the value of each unit is a non-trivial function of the encoded integer).
4.

To enable transmission or storage using byte-oriented devices, a serialization scheme or character encoding scheme (CES) is next used. A character encoding scheme is a mapping of the code units of a character encoding form (CEF) into well-defined sequences of bytes, taking into account the necessary specification of byte-order for multi-byte base datatypes and including in some cases switching schemes between the code units of multiple character encoding schemes (an example is ISO 2022). A character encoding scheme, together with the coded character sets it is used with, is called a character encoding, and is identified by a unique identifier, such as an IANA charset identifier. Given a sequence of bytes representing text and a character encoding identified by a charset identifier, one can in principle unambiguously recover the sequence of characters of the text.

NOTE: See 4.4.2 Character encoding identification for a discussion of the term 'charset' and further details on character encodings.

NOTE: The term 'character encoding' is somewhat ambiguous, as it is sometimes used to describe the actual process of encoding characters and sometimes to denote a particular way to perform that process (as in "this file is in the X character encoding"). Context normally allows the distinction of those uses, once one is aware of the ambiguity.

NOTE: Given a sequence of characters, a given 'character encoding' may not always produce the same sequence of bytes. In particular for encodings based on ISO 2022, there may be choices available during the encoding process.

In very simple cases, the whole encoding process can be collapsed to a single step, a trivial one-to-one mapping from characters to bytes; this is the case, for instance, for US-ASCII [ISO/IEC 646] and ISO-8859-1.

Text is said to be in a Unicode encoding form if it is encoded in UTF-8, UTF-16 or UTF-32.
4.2 Transcoding

Transcoding is the process of converting text from one character encoding to another. Transcoders work only at the level of character encoding and do not parse the text; consequently, they do not deal with character escapes such as numeric character references (see 4.6 Character Escaping) and do not adjust embedded character encoding information (for instance in an XML declaration or in an HTML meta element).

NOTE: Transcoding may involve one-to-one, many-to-one, one-to-many or many-to-many mappings. In addition, the storage order of characters varies between encodings: some, such as the Unicode encoding forms, prescribe logical ordering, while others use visual ordering; among encodings that have separate diacritics, some prescribe that they be placed before the base character, some after. Because of these differences in sequencing characters, transcoding may involve reordering: thus XYZ may map to yxz.

EXAMPLE: This first example shows the transcoding of the Russian word 'Русский' meaning 'Russian' (language), from the UTF-16 encoding of Unicode to the ISO 8859-5 encoding:
UTF-16 ISO 8859-5
Code unit Char. name (abbreviated) Code unit Char. name (abbreviated)
0420 CAPITAL ER C0 CAPITAL ER
0443 SMALL U E3 SMALL U
0441 SMALL ES E1 SMALL ES
0441 SMALL ES E1 SMALL ES
043A SMALL KA DA SMALL KA
0438 SMALL I D8 SMALL I
0439 SMALL SHORT I D9 SMALL SHORT I

EXAMPLE: This second example shows a much more complex case, where the Arabic word '??????', meaning 'peace', is transcoded from the visually-ordered, contextualized encoding IBM CP864 to the UTF-16 encoding of Unicode:
IBM CP864 UTF-16
Code unit Char. name (abbreviated) Code unit Char. name (abbreviated)
EF FINAL MEEM 0627 ALEF
9E MEDIAN LAM-ALEF 0644 LAM
D3 MEDIAN SEEN 0633 SEEN
E4 MEDIAN LAM 0644 LAM
C7 INITIAL ALEF 0627 ALEF
0645 MEEM

Notice that the order of the characters has been reversed, that the single LAM-ALEF in CP864 has been converted to a LAM ALEF sequence in UTF-16, and that the contextual variants (initial, median or final) in the source encoding have been converted to generic characters in the target encoding.
4.3 Reference Processing Model

Many Internet protocols and data formats, most notably the very important Web formats HTML, CSS and XML, are based on text. In those formats, everything is text but the relevant specifications impose a structure on the text, giving meaning to certain constructs so as to obtain functionality in addition to that provided by plain text (text that is not in the context of markup or a programming language). HTML and XML are markup languages, defining documents entirely composed of text but with conventions allowing the separation of this text into markup and character data. Citing from the XML 1.0 specification [XML 1.0], section 2.4:

"Text consists of intermingled character data and markup. [...] All text that is not markup constitutes the character data of the document."

For the purposes of this section, the important aspect is that everything is text, that is, a sequence of characters.

A textual data object is a whole text protocol message or a whole text document, or a part of it that is treated separately for purposes of external storage and retrieval. Examples include external parsed entities in XML and textual MIME entity bodies [MIME-entity].

C013 [S] [C] Textual data objects defined by protocol or format specifications MUST be in a single character encoding.

Note that this does not imply that character set switching schemes such as ISO 2022 cannot be used, since such schemes perform character set switching within a single character encoding.

Since its early days, the Web has seen the development of a Reference Processing Model, first described for HTML in RFC 2070 [RFC 2070]. This model was later embraced by XML and CSS. It is applicable to any data format or protocol that is text-based as described above. The essence of the Reference Processing Model is the use of Unicode as a common reference. Use of the Reference Processing Model by a specification does not, however, require that implementations actually use Unicode. The requirement is only that the implementations behave as if the processing took place as described by the Model. Also, while this document uses the term Reference Processing Model and describes its properties in terms of processing, the model also applies to specifications that do not explicitly define a processing model.

C014[S] All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model, namely:

1.

Specifications MUST define text in terms of Unicode characters, not bytes or glyphs.
2.

For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form.
3.

Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows:
*

The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form.
*

All processing MUST take place on this sequence of Unicode characters.
*

If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification.

4.

If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects.

NOTE: All specifications which define applications of the XML 1.0 specification [XML 1.0] automatically inherit this Reference Processing Model. XML is entirely defined in terms of Unicode characters and requires the UTF-8 and UTF-16 character encodings while allowing any other character encoding for parsed entities.

NOTE: When specifications choose to allow character encodings other than Unicode encoding forms, implementers should be aware that the correspondence between the characters of such encodings and Unicode characters may in practice depend on the software used for transcoding. See the Japanese XML Profile [XML Japanese Profile] for examples of such inconsistencies.

C070 [S] Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive.

C077 [S] Specifications MUST NOT allow code points above U+10FFFF.

Unicode contains some code points for internal use (such as noncharacters) or special functions (such as surrogate code points).

C079 [S] Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use.

C078 [S] Specifications MUST NOT allow the use of surrogate code points.

Excluding code points without good reason conflicts with the W3C goal of universal accessibility. Excluding code points would prevent some scripts from being used which may be important to a user community or communities. For example, without strong reasons to do so, decisions to exclude code points above the Basic Multilingual Plane or to limit code points to the US-ASCII or Latin-1 repertoire are inappropriate. Also, please note that the Unicode Standard requires software to not corrupt any code points.

Other examples of legitimate and non-arbitrary reasons to exclude characters can be seen in Unicode in XML and other Markup Languages [UXML], where the use of certain characters is discouraged for reasons such as:

*

They are deprecated in the Unicode Standard.
*

They cannot be supported without additional data.
*

They are better handled by markup.
*

They conflict with equivalent markup.

4.4 Choice and Identification of Character Encodings

Because encoded text cannot be interpreted and processed without knowing the encoding, it is vitally important that the character encoding (see 4.1 Character Encoding) is known at all times and places where text is exchanged, stored or processed. In what follows we use 'character encoding' to mean either character encoding form (CEF) or character encoding scheme (CES) depending on the context. When text is transmitted or stored as a byte stream, for instance in a protocol or file system, specification of a CES is required to ensure proper interpretation. In contexts such as an API, where the environment (typically the processor architecture) specifies the byte order of multibyte quantities, specification of a CEF suffices.

C015 [S] Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified.

C016 [S] When designing a new protocol, format or API, specifications SHOULD require a unique character encoding.

C017 [S] When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules.

EXAMPLE: An XML-based format should use the existing XML rules for choosing and determining the character encoding of external entities, rather than invent new ones.
4.4.1 Mandating a unique character encoding

Mandating a unique character encoding is simple, efficient, and robust. There is no need for specifying, producing, transmitting, and interpreting encoding tags. At the receiver, the character encoding will always be understood. There is also no ambiguity as to which character encoding to use if data is transferred non-electronically and later has to be converted back to a digital representation. Even when there is a need for compatibility with existing data, systems, protocols and applications, multiple character encodings can often be dealt with at the boundaries or outside a protocol, format, or API. The DOM [DOM Level 1] is an example of where this was done. The advantages of choosing a unique character encoding are greater when text sizes are small or the specification is close to the actual processing.

C018 [S] When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32.

US-ASCII is upwards-compatible with UTF-8 (an US-ASCII string is also a UTF-8 string, see [RFC 3629]), and UTF-8 is therefore appropriate if compatibility with US-ASCII is desired. In other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Possible reasons for choosing one of these include efficiency of internal processing and interoperability with other processes.

NOTE: The IETF Charset Policy [RFC 2277] specifies that on the Internet "Protocols MUST be able to use the UTF-8 charset".

NOTE: The XML 1.0 specification [XML 1.0] requires all conforming XML processors to accept both UTF-16 and UTF-8.
4.4.2 Character encoding identification

The MIME Internet specification provides a good example of a mechanism for character encoding identification [MIME-charset][RFC 2978]. The MIME charset parameter definition is intended to supply sufficient information to uniquely decode the sequence of bytes of the received data into a sequence of characters. The values are drawn from the IANA charset registry [IANA].

NOTE: Unfortunately, some charset identifiers do not represent a single, unique character encoding. Instead, these identifiers denote a number of small variations. Even though small, the differences may be crucial and may vary over time. For these identifiers, recovery of the character sequence from a byte sequence is ambiguous. For example, the character encoded as 0x5C in Shift_JIS is ambiguous. This code point sometimes represents a YEN SIGN and sometimes represents a REVERSE SOLIDUS. See the [XML Japanese Profile] for more detail on this example and for additional examples of such ambiguous charset identifiers.

NOTE: The term charset derives from 'character set', an expression with a long and tortured history (see [Connolly] for a discussion).

C020 [S] Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED.

NOTE: In XML, the XML declaration or the text declaration contains the encoding pseudo-attribute which identifies the character encoding using the IANA charset.

The IANA charset registry is the official list of names and aliases for character encoding schemes on the Internet.

C021 [S] If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset registry names, and in particular the names identified in the registry as 'MIME preferred names', to designate character encodings in protocols, data formats and APIs.

C022 [S] [I] [C] Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement.

C023 [S] [I] [C] If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed.

C049 [I] [C] The character encoding of content SHOULD be chosen so that it maximizes the opportunity to directly represent characters (ie. minimizes the need to represent characters by markup means such as character escapes) while avoiding obscure encodings that are unlikely to be understood by recipients.

NOTE: Due to Unicode's large repertoire and wide base of support, a character encoding based on Unicode is a good choice to encode a document.

C034 [C] If facilities are offered for identifying character encoding, content MUST make use of them; where the facilities offered for character encoding identification include defaults (e.g. in XML 1.0 [XML 1.0]), relying on such defaults is sufficient to satisfy this identification requirement.

C024 [I] [C] Content and software that label text data MUST use one of the names required by the appropriate specification (e.g. the XML specification when editing XML text) and SHOULD use the MIME preferred name of a character encoding to label data in that character encoding.

C025 [I] [C] An IANA-registered charset name MUST NOT be used to label text data in a character encoding other than the one identified in the IANA registration of that name.

C026 [S] If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the specification).

C027 [S] Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them.

C028 [S] Specifications MUST NOT propose the use of heuristics to determine the encoding of data.

Examples of heuristics include the use of statistical analysis of byte (pattern) frequencies or character (pattern) frequencies. Heuristics are bad because they will not work consistently across different implementations. Well-defined instructions of how to unambiguously determine a character encoding, such as those given in XML 1.0 [XML 1.0], Appendix F, are not considered heuristics.

C029 [I] Receiving software MUST determine the encoding of data from available information according to appropriate specifications.

C030 [I] When an IANA-registered charset name is recognized, receiving software MUST interpret the received data according to the encoding associated with the name in the IANA registry.

C031 [I] When no charset is provided receiving software MUST adhere to the default character encoding(s) specified in the specification.

Receiving software may recognize as many character encodings and as many charset names and aliases for them as appropriate.

A field-upgradeable mechanism may be appropriate for this purpose. Certain character encodings are more or less associated with certain languages (e.g. Shift_JIS with Japanese). Trying to support a given language or set of customers may mean that certain character encodings have to be supported. However, one cannot assume universal support for a favoured but non-required encoding. The character encodings that need to be supported may change over time. This document does not give any advice on which character encoding may be appropriate or necessary for the support of any given language.

Because of the layered Web architecture (e.g. formats used over protocols), there may be multiple and at times conflicting information about character encoding.

C035 [S] Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding.

C033 [I] Software MUST completely implement the mechanisms for character encoding identification and conflict resolution.
4.5 Private use code points

Certain ranges of Unicode code points are designated for private use: the Private Use Area (PUA) (U+E000-F8FF) and planes 15 and 16 (U+F0000-FFFFD and U+100000-10FFFD). These code points are guaranteed to never be allocated to standard characters, and are available for use by private agreement. However, private agreements do not scale on the Web. Code points from different private agreements may collide. Also, a private agreement, and therefore the meaning of the code points, can quickly become lost.

C073 [C] Publicly interchanged content SHOULD NOT use codepoints in the private use area.

NOTE: A typical exception would be the use of the PUA to design and test the encoding of not yet encoded (e.g. historic or rare) scripts.

C076 [C] Content MUST NOT use a code point for any purpose other than that defined by its coded character set.

This prohibits the construction of fonts that misuse e.g. iso-8859-1 to represent different scripts, characters, or symbols than what is actually encoded in iso-8859-1.

C038 [S] Specifications MUST NOT require the use of private use area characters with particular assignments.

C039 [S] Specifications MUST NOT require the use of mechanisms for defining agreements of private use code points.

C040 [S] [I] Specifications and implementations SHOULD NOT disallow the use of private use code points by private agreement.

As an example, XML does not disallow the use of private use code points.

C041 [S] Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters.

EXAMPLE: MathML (see [MathML2] section 3.2.9) defines an element mglyph for mathematical symbols not in Unicode.

EXAMPLE: SVG (see [SVG] section 10.14) defines an element altglyph which allows the identification of specific display variants of Unicode characters.

C068 [S] Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics.
4.6 Character Escaping

Markup languages or programming languages often designate certain characters as syntax-significant, giving them specific functions within the language (e.g. '<' and '&' serve as markup delimiters in HTML and XML). As a consequence, these syntax-significant characters cannot be used to represent themselves in text in the same way as all other characters do, creating the need for a mechanism to "escape" their syntax-significance. There is also a need, often satisfied by the same or similar mechanisms, to express characters not directly representable in the character encoding chosen for a particular document or program (an instance of the markup or programming language).

Formally, a character escape is a syntactic device defined in a markup or programming language that allows one or more of:

1.

expressing syntax-significant characters while disregarding their significance in the syntax of the language, or
2.

expressing characters not representable in the character encoding chosen for an instance of the language, or
3.

expressing characters in general, without use of the corresponding encoded characters.

Escaping a character means expressing it using such a syntactic device, appropriate to the format or protocol in which the character appears; expanding a character escape (or unescaping) means replacing it with the character that it represents.

EXAMPLE: HTML and XML define 'Numeric Character References' which allow both the escaping of syntax-significance and the expression of arbitrary Unicode characters. Expressed as < or < the character '<' will not be parsed as a markup delimiter.

EXAMPLE: The programming language Java uses '"' to delimit strings. To express '"' within a string, one may escape it as '\"'.

EXAMPLE: XML defines 'CDATA sections' which allow escaping the syntax-significance of all characters between the CDATA section delimiters. CDATA sections prevent the expression of characters using numeric character references.

The following guidelines apply to the way specifications define character escapes.

*

C042 [S] Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists.
*

C043 [S] The number of different ways to escape a character SHOULD be minimized (ideally to one).

A well-known counter-example is that for historical reasons, both HTML and XML have redundant decimal (&#ddddd;) and hexadecimal (&#xhhhh;) character escapes.
*

C044 [S] Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided.

These character escapes are not clear visually, and can cause an editor to insert spurious line-breaks when word-wrapping on spaces. Forms like SPREAD's &UABCD; [SPREAD] or XML's &#xhhhh;, where the character escape is explicitly terminated by a semicolon, are much better.
*

C045 [S] Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation.
*

C046 [S] Escaped characters SHOULD be acceptable wherever their unescaped forms are; this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable.

The following guidelines apply to content developers, as well as to software that generates content:

*

C047 [I] [C] Escapes SHOULD only be used when the characters to be expressed are not directly representable in the format or the character encoding of the document, or when the visual representation of the character is unclear.

NOTE: An example of when the visual representation of the character is unclear is the use of   to distinguish a non-breaking space from a normal space.
*

C048 [I] [C] Content SHOULD use the hexadecimal form of character escapes rather than the decimal form when there are both.

NOTE: The hexadecimal form is preferred because character encoding standards (in particular Unicode) usually list character numbers as hexadecimal, making lookup easier.

5 Compatibility and Formatting Characters

This specification does not address the suitability of particular characters for use in markup languages, in particular formatting characters and compatibility equivalents. For detailed recommendations about the use of compatibility and formatting characters, see Unicode in XML and other Markup Languages [UXML].

C050 [S] Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define.
6 Strings
6.1 String concepts

Various specifications use the notion of a 'string', sometimes without defining precisely what is meant and sometimes defining it differently from other specifications. The reason for this variability is that there are in fact multiple reasonable definitions for a string, depending on one's intended use of the notion; the term 'string' is used for all these different notions because these are actually just different views of the same reality: a piece of text stored inside a computer.

Byte string: A string viewed as a sequence of bytes representing characters in a particular character encoding. This corresponds to a character encoding scheme (CES). Text processing of a byte string is dependent on the particular encoding used. When the encoding changes the processing must also be changed to reflect the stucture of the new encoding. Such a change could require significant redesign of the functions or API used to process the byte strings as text. Therefore, this definition is only useful in specifications when the textual nature of a string is unimportant and the string is considered only as a piece of opaque data with a length in bytes (such as when copying a buffer).

C011 [S] Specifications SHOULD NOT define a string as a 'byte string'.

EXAMPLE: This is a counter-example, illustrating one reason why considering strings as byte strings may be problematic. Consider text containing the character U+233B4 (a Chinese character meaning 'stump of tree') encoded as UTF-16 in big-endian byte order (UTF-16BE). The text will contain the bytes D8 4C DF B4. If one searches this text, considered as a byte string, for the character U+4CDF (another Chinese character meaning 'phoenix'), an erroneous match will be found on the bytes 4C DF that are the UTF-16BE representation of U+4CDF.

Code unit string: A string viewed as a sequence of code units representing characters in a particular character encoding. This corresponds to a character encoding form (CEF). A definition of a code unit string needs to include the size of the code units (e.g. 16 bits) and the character encoding used (e.g. UTF-16). Code unit strings are useful in APIs that expose a physical representation of string data based on reliable knowledge of the encoding forms that are likely candidates for implementation. Example: For the DOM [DOM Level 1], UTF-16 was chosen based on widespread implementation practice. In general, 'code unit string' is only useful if the implementation candidates are likely to be either UTF-16 or UTF-32.

Character string: A string viewed as a sequence of characters, each represented by a code point in Unicode [Unicode]. This is usually what programmers consider to be a string, although it may not match exactly what most users perceive as characters. This is the highest layer of abstraction that ensures interoperability with very low implementation effort. The 'character string' definition of a string is generally the most useful. Good examples using this definition include the Production [2] of XML 1.0 [XML 1.0], the SGML declaration of HTML 4.0 [HTML 4.01], and the character model of RFC 2070 [RFC 2070].

C012 [S] The 'character string' definition SHOULD be used by most specifications.

EXAMPLE: Consider the string comprising the characters U+233B4 (a Chinese character meaning 'stump of tree'), U+2260 NOT EQUAL TO, U+0071 LATIN SMALL LETTER Q and U+030C COMBINING CARON, encoded in UTF-16 in big-endian byte order. The rows of the following table show the string viewed as a character string, code unit string and byte string, respectively:
Glyphs Ideographic supplementary character: Archaic Chinese character meaning "the stump of a tree" (still in current use in Cantonese) NOT EQUAL TO LATIN SMALL LETTER Q COMBINING CARON
Character string U+233B4 U+2260 U+0071 U+030C
Code unit string D84C DFB4 2260 0071 030C
Byte string D8 4C DF B4 22 60 00 71 03 0C

NOTE: It is also possible to view a string as a sequence of grapheme clusters. Grapheme clusters divide the text into units that correspond more closely than character strings to the user's perception of where character boundaries occur in a visually rendered text. A discussion of grapheme clusters is given at the end of Section 2.10 of the Unicode Standard, Version 4 [Unicode 4.0]; a formal definition is given in Unicode Standard Annex #29 [UTR #29]. The Unicode Standard defines default grapheme clustering. Some languages require tailoring to this default. For example, a Slovak user might wish to treat the default pair of grapheme clusters "ch" as a single grapheme cluster. Note that the interaction between the language of string content and the end-user's preferences may be complex.
6.2 String indexing

There are many situations where a software process needs to access a substring or to point within a string and does so by the use of indices, i.e. numeric "positions" within a string. Where such indices are exchanged between components of the Web, there is a need for an agreed-upon definition of string indexing in order to ensure consistent behavior. The requirements for string indexing are discussed in Requirements for String Identity Matching [CharReq], section 4. The two main questions that arise are: "What is the unit of counting?" and "Do we start counting at 0 or 1?".

The example in the previous section, 6.1 String concepts, shows a string viewed as a character string, code unit string and byte string, respectively, each of which involves different units for indexing.

Depending on the particular requirements of a process, the unit of counting may correspond to definitions of a string provided in section 6.1 String concepts. In particular:

*

C051 [S] [I] The character string is RECOMMENDED as a basis for string indexing.

(Example: the XML Path Language [XPath]).
*

C052 [S] [I] A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string.

(Example: the use of UTF-16 in [DOM Level 1]).
*

C071 [S] [I] Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern.

See Unicode Standard Annex #29, Text Boundaries [UTR #29].

C074 [S] Specifications that define indexing in terms of grapheme clusters MUST either: a) define grapheme clusters in terms of default grapheme clusters as defined in Unicode Standard Annex #29, Text Boundaries [UTR #29], or b) define specifically how tailoring is applied to the indexing operation.
*

C072 [S] [I] The use of byte strings for indexing is NOT RECOMMENDED.

It is noteworthy that there exist other, non-numeric ways of identifying substrings which have favorable properties. For instance, substrings based on string matching are quite robust against small edits; substrings based on document structure (in structured formats such as XML) are even more robust against edits and even against translation of a document from one human language to another.

C053 [S] Specifications that need a way to identify substrings or point within a string SHOULD provide ways other than string indexing to perform this operation.

C054 [I] [C] Users of specifications (software developers, content developers) SHOULD whenever possible prefer ways other than string indexing to identify substrings or point within a string.

Experience shows that more general, flexible and robust specifications result when individual characters are understood and processed as substrings, identified by a position before and a position after the substring. Understanding indices as boundary positions between the counting units also makes it easier to relate the indices resulting from the different string definitions.

C055 [S] Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units.

C056 [S] Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' as argument or return types.

EXAMPLE: The function uppercase("?") cannot return the proper result (the two-character string 'SS') if the return type of the uppercase function is defined to be a single character. Note, also, that there is not necessarily a one-to-one mapping between characters and units of sound, input, etc. as described in 3 Perceptions of Characters.

The issue of index origin, i.e. whether we count from 0 or 1, actually arises only after a decision has been made on whether it is the units themselves that are counted or the positions between the units.

C057 [S] When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string.
7 Referencing the Unicode Standard and ISO/IEC 10646

Specifications often need to make references to the Unicode Standard or International Standard ISO/IEC 10646. Such references must be made with care, especially when normative. The questions to be considered are:

*

Which standard should be referenced?
*

How to reference a particular version?
*

When to use versioned vs. unversioned references?

ISO/IEC 10646 is developed and published jointly by ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission). The Unicode Standard is developed and published by the Unicode Consortium, an organization of major computer corporations, software producers, database vendors, national governments, research institutions, international agencies, various user groups, and interested individuals. The Unicode Standard is comparable in standing to W3C Recommendations.

ISO/IEC 10646 and the Unicode Standard define exactly the same coded character set (CCS) (same repertoire, same code points) and encoding forms. They are actively maintained in synchrony by liaisons and overlapping membership between the respective technical committees. In addition to the jointly defined CCS and encoding forms, the Unicode Standard adds normative and informative lists of character properties, normative character equivalence and normalization specifications, a normative algorithm for bidirectional text and a large amount of useful implementation information. In short, the Unicode Standard adds semantics to the characters that ISO/IEC 10646 merely enumerates. Conformance to the Unicode Standard implies conformance to ISO/IEC 10646, see [Unicode 4.0] Appendix C.

C062 [S] Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646.

By providing a reference to the Unicode Standard implementers can benefit from the wealth of information provided in the standard and on the Unicode Consortium Web site.

The fact that both ISO/IEC 10646 and the Unicode Standard are evolving (in synchrony) raises the issue of versioning: should a specification refer to a specific version of the standard, or should it make a generic reference, so that the normative reference is to the version current at the time of reading the specification? In general the answer is both.

C063 [S] A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time.

An example would be the set of characters acceptable as Name characters in XML 1.0 [XML 1.0], which is an enumerated list that parsers must implement to validate names.

NOTE: See http://www.unicode.org/unicode/standard/versions/#Citations for guidance on referring to specific versions of the Unicode Standard.

A generic reference can be formulated in two ways:

1.

By explicitly including a generic entry in the bibliography section of a specification and simply referring to that entry in the body of the specification. Such a generic entry contains text such as "... as it may from time to time be revised or amended".
2.

By including a specific entry in the bibliography and adding text such as "... as it may from time to time be revised or amended" at the point of reference in the body of the specification.

It is an editorial matter, best left to each specification, which of these two formulations is used. Examples of the first formulation can be found in the bibliography of this specification (see the entries for [ISO/IEC 10646] and [Unicode]). Examples of the latter, as well as a discussion of the versioning issue with respect to MIME charset parameters for UCS encodings, can be found in [RFC 3629] and [RFC 2781].

C064 [S] All generic references to the Unicode Standard [Unicode] MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification.

C065 [S] All generic references to ISO/IEC 10646 [ISO/IEC 10646] MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification.
A References
A.1 Normative References

IANA
Internet Assigned Numbers Authority, Official Names for Character Sets. (See http://www.iana.org/assignments/character-sets.)
ISO/IEC 10646
ISO/IEC 10646:2003, Information technology -- Universal Multiple-Octet Coded Character Set (UCS), as, from time to time, amended, replaced by a new edition or expanded by the addition of new parts. (See http://www.iso.org/iso/en/ISOOnline.openerpage for the latest version.)
MIME-entity
N. Freed, N. Borenstein, Multipurpose Internet Mail Extensions (MIME). Part One: Format of Internet Message Bodies, RFC 2045, November 1996, http://www.ietf.org/rfc/rfc2045.txt.
MIME-charset
Multipurpose Internet Mail Extensions (MIME). Part Two: Media Types, N. Freed, N. Borenstein, RFC 2046, November 1996, http://www.ietf.org/rfc/rfc2046.txt.
RFC 2119
S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119. (See http://www.ietf.org/rfc/rfc2119.txt.)
Unicode
The Unicode Consortium, The Unicode Standard, Version 4, ISBN 0-321-18578-1, as updated from time to time by the publication of new versions. (See http://www.unicode.org/unicode/standard/versions for the latest version and additional information on versions of the standard and of the Unicode Character Database).
Unicode 3.2
The Unicode Consortium, The Unicode Standard, Version 3.2.0 is defined by The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (see http://www.unicode.org/reports/tr27) and by the Unicode Standard Annex #28: Unicode 3.2 (see http://www.unicode.org/reports/tr28).
Unicode 4.0
The Unicode Consortium. The Unicode Standard, Version 4.0, Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1. (See http://www.unicode.org/versions/Unicode4.0.0/)

A.2 Other References

CharNorm
Martin J. Dürst, Fran?ois Yergeau, Richard Ishida, Misha Wolf, Tex Texin, Addison Phillips Character Model for the World Wide Web 1.0: Normalization, W3C Working Draft. (See http://www.w3.org/TR/charmod-norm.)
CharIRI
Martin J. Dürst, Fran?ois Yergeau, Richard Ishida, Misha Wolf, Tex Texin, Character Model for the World Wide Web 1.0: Resource Identifiers, W3C Candidate Recommendation. (See http://www.w3.org/TR/charmod-resid.)
CharReq
Martin J. Dürst, Requirements for String Identity Matching and String Indexing, W3C Working Draft. (See http://www.w3.org/TR/WD-charreq.)
Connolly
D. Connolly, Character Set Considered Harmful, W3C Note. (See http://www.w3.org/MarkUp/html-spec/charset-harmful.)
CSS2
Bert Bos, H?kon Wium Lie, Chris Lilley, Ian Jacobs, Eds., Cascading Style Sheets, level 2 (CSS2 Specification), W3C Recommendation. (See http://www.w3.org/TR/REC-CSS2.)
DOM Level 1
Vidur Apparao et al., Document Object Model (DOM) Level 1 Specification, W3C Recommendation. (See http://www.w3.org/TR/REC-DOM-Level-1.)
HTML 4.0
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.0 Specification, W3C Recommendation, 18-Dec-1997 (See http://www.w3.org/TR/REC-html40-971218.)
HTML 4.01
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.01 Specification, W3C Recommendation. (See http://www.w3.org/TR/html401.)
ISO/IEC 646
ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange. This standard defines an International Reference Version (IRV) which corresponds exactly to what is widely known as ASCII or US-ASCII. ISO/IEC 646 was based on the earlier standard ECMA-6. ECMA has maintained its standard up to date with respect to ISO/IEC 646 and makes an electronic copy available at http://www.ecma-international.org/publications/standards/Ecma-006.htm
ISO/IEC 9541-1
ISO/IEC 9541-1:1991, Information technology -- Font information interchange -- Part 1: Architecture. (See http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=17277 for the latest version.)
ISO/IEC 14651
ISO/IEC 14651:2000, Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering as, from time to time, amended, replaced by a new edition or expanded by the addition of new parts. (See http://www.iso.org/iso/en/ISOOnline.openerpage for the latest version.)
MathML2
David Carlisle, Patrick Ion, Robert Miner, Nico Poppelier, Eds., Mathematical Markup Language (MathML) Version 2.0, W3C Recommendation. (See http://www.w3.org/TR/MathML2.)
Nicol
Gavin Nicol, The Multilingual World Wide Web, Chapter 2: The WWW As A Multilingual Application. (See http://www.mind-to-mind.com/library/papers/multilingual/multilingual-www.html.)
RFC 2070
F. Yergeau, G. Nicol, G. Adams, M. Dürst, Internationalization of the Hypertext Markup Language, IETF RFC 2070, January 1997. (See http://www.ietf.org/rfc/rfc2070.txt.)
RFC 2277
H. Alvestrand, IETF Policy on Character Sets and Languages, IETF RFC 2277, BCP 18, January 1998. (See http://www.ietf.org/rfc/rfc2277.txt.)
RFC 2978
N. Freed, J. Postel, IANA Charset Registration Procedures, IETF RFC 2978, BCP 19, October 2000. (See http://www.ietf.org/rfc/rfc2978.txt.)
RFC 3629
F. Yergeau, UTF-8, a transformation format of ISO 10646, IETF RFC 3629, STD 63, November 2003. (See http://www.ietf.org/rfc/rfc3629.txt.)
RFC 2781
P. Hoffman, F. Yergeau, UTF-16, an encoding of ISO 10646, IETF RFC 2781, February 2000. (See http://www.ietf.org/rfc/rfc2781.txt.)
SPREAD
SPREAD - Standardization Project for East Asian Documents Universal Public Entity Set. (See http://www.ascc.net/xml/resource/entities/index.html)
SVG
Jon Ferraiolo, Ed., Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation. (See http://www.w3.org/TR/SVG.)
UTR #10
Mark Davis, Ken Whistler, Unicode Collation Algorithm, Unicode Technical Report #10. (See http://www.unicode.org/unicode/reports/tr10.)
UTR #17
Ken Whistler, Mark Davis, Character Encoding Model, Unicode Technical Report #17. (See http://www.unicode.org/unicode/reports/tr17.)
UTR #29
Mark Davis, Text Boundaries, Unicode Standard Annex #29. (See http://www.unicode.org/unicode/reports/tr29 for the latest version).
UXML
Martin Dürst and Asmus Freytag, Unicode in XML and other Markup Languages, Unicode Technical Report #20 and W3C Note. (See http://www.w3.org/TR/unicode-xml.)
XML 1.0
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Eds., Extensible Markup Language (XML) 1.0, W3C Recommendation. (See http://www.w3.org/TR/REC-xml.)
XML Japanese Profile
MURATA Makoto Ed., XML Japanese Profile, W3C Note. (See http://www.w3.org/TR/japanese-xml.)
XPath
James Clark, Steve DeRose, Eds, XML Path Language (XPath) Version 1.0, W3C Recommendation. (See http://www.w3.org/TR/xpath.)

B Examples of Characters, Keystrokes and Glyphs (Non-Normative)

A few examples will help make sense all this complexity of text in computers (which is mostly a reflection of the complexity of human writing systems). Let us start with a very simple example: a user, equipped with a US-English keyboard, types "Foo", which the computer encodes as 16-bit values (the UTF-16 encoding of Unicode) and displays on the screen.
Keystrokes Shift-f o o
Input characters F o o
Encoded characters (byte values in hex) 0046 006F 006F
Display Foo
Example: Basic Latin

The only complexity here is the use of a modifier (Shift) to input the capital 'F'.

A slightly more complex example is a user typing '?é' on a traditional French-Canadian keyboard, which the computer again encodes in UTF-16 and displays. We assume that this particular computer uses a fully composed form of UTF-16.
Keystrokes ? c é
Input characters ? é
Encoded characters (byte values in hex) 00E7 00E9
Display ?é
Example: Latin with diacritics

A few interesting things are happening here: when the user types the cedilla ('?'), nothing happens except for a change of state of the keyboard driver; the cedilla is a dead key. When the driver gets the c keystroke, it provides a complete '?' character to the system, which represents it as a single 16-bit code unit and displays a '?' glyph. The user then presses the dedicated 'é' key, which results in, again, a character represented by two bytes. Most systems will display this as one glyph, but it is also possible to combine two glyphs (the base letter and the accent) to obtain the same rendering.

On to a Japanese example: our user employs a romaji input method to type '日本?' (U+65E5, U+672C, U+8A9E), which the computer encodes in UTF-16 and displays.
Keystrokes n i h o n g o
Input characters 日 本 ?
Encoded characters (byte values in hex) 65E5 672C 8A9E
Display Three Kanji characters, U+65E5, U+672C, U+8A9E, pronounced 'nihongo'.
Example: Japanese

The interesting aspect here is input: the user types Latin characters, which are converted on the fly to kana (not shown here), and then to kanji when the user requests conversion by pressing ; the kanji characters are finally sent to the application when the user presses . The user has to type a total of nine keystrokes before the three characters are produced, which are then encoded and displayed rather trivially.

A Persian example, using Arabic script, will show different phenomena:
Keystrokes ARABIC LETTER LAM ARABIC LETTER ALEF Arabic ligature 'lam-alef'. ARABIC LETTER FARSI YEH ARABIC LETTER FARSI YEH
Input characters ? ? ? ? ? ?
Encoded characters (byte values in hex) 0644 0627 0644 0627 06CC 06CC
Display The displayed output appears, from right to left, as: two lam-alef ligatures, and initial farsi yeh glyph attached to a final farsi yeh glyph.
Example: Persian

Here the first two keystrokes each produce an input character and an encoded character, but the pair is displayed as a single glyph ('Arabic ligature 'lam-alef'.', a lam-alef ligature). The next keystroke is a lam-alef, which some Arabic script keyboards have; it produces the same two characters which are displayed similarly, but this second lam-alef is placed to the left of the first one when displayed. The last two keystrokes produce two identical characters which are rendered by two different glyphs (a medial form followed to its left by a final form). We thus have 5 keystrokes producing 6 characters and 4 glyphs laid out right-to-left.

A final example in Tamil, typed with an ISCII keyboard, will illustrate some additional phenomena:
Keystrokes TAMIL LETTER TTA TAMIL VOWEL SIGN AA TAMIL LETTER NGA TAMIL SIGN VIRAMA TAMIL LETTER KA TAMIL VOWEL SIGN OO
Input characters ? ? ? ? ? ?
Encoded characters (byte values in hex) 0B9F 0BBE 0B99 0BCD 0B95 0BCB
Display 'Tango' in Tamil letters.
Example: Tamil

Here input is straightforward, but note that contrary to the preceding accented Latin example, the virama diacritic ' ?' (U+0BCD) is entered after the '?' (U+0B99) to which it applies. Rendering is interesting for the last two characters. The last one ' ?' (U+0BCB) clearly consists of two glyphs which surround the glyph of the next to last character '?' (U+0B95).
C Example text (Non-Normative)

The following are textual versions of strings or characters used in image-based examples in this document. They are provided here for the benefit of those who want to cut and paste the text for their own testing.

1.

Section: 3.3 Units of visual rendering

Example: An example showing the logical order of characters in a string containing two Arabic words followed by a year number. In logical selection mode, the range of characters selected by starting the selection in the middle of the second word and ending in the middle of the year number is depicted using highlighting. The highlighting covers a single block of contiguous characters.

Text: ??? ???? ????
2.

Section: 6.1 String concepts

Example: Ideographic supplementary character: Archaic Chinese character meaning "the stump of a tree" (still in current use in Cantonese)NOT EQUAL TOLATIN SMALL LETTER QCOMBINING CARON

Text: ≠q?
3.

Section: B Examples of Characters, Keystrokes and Glyphs

Example: Three Kanji characters, U+65E5, U+672C, U+8A9E, pronounced 'nihongo'.

Text: 日本?
4.

Section: B Examples of Characters, Keystrokes and Glyphs

Example: The displayed output appears, from right to left, as: two lam-alef ligatures, and initial ghayn glyph attached to a final ghayn glyph.

Text: ??????
5.

Section: B Examples of Characters, Keystrokes and Glyphs

Example: 'Tango' in Tamil letters.

Text: ??????

D List of conformance criteria (Non-Normative)

This is a list of the conformance criteria in this specification, in document order. This list can be used to check specifications, implementations, and content for conformance to this specification.

When doing so, the following points should be kept in mind:

*

To ensure that you understand the meaning, read the whole document first. Use this list as a quick reference only after having first read the conformance criteria in context in the main body of the text.
*

If the meaning of a conformance criterion in this list is still unclear after referring back to the surrounding text in the main body of the document, consider sending a comment to www-i18n-comments@w3.org (publicly archived).
*

Not all conformance criteria apply to all specifications, implementations, or content. Before checking for actual conformance, applicability should be checked. As an example, C010 only applies to specifications. As another example, C002 applies to specifications, implementations, and content, but only if it deals with mapping between characters and units of displayed text.

C001 [S] [I] [C] Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language.
C002 [S] [I] [C] Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text.
C003 [S] [I] [C] Protocols, data formats and APIs MUST store, interchange or process text data in logical order.
C075 [I] Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage.
C004 [S] Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs.
C005 [S] [I] Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world.
C006 [S] [I] Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application.
C007 [S] [I] Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user.
C066 [S] [I] Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering.
C008 [S] [I] Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode.
C009 [S] [I] Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage.
C010 [S] When specifications use the term 'character' the specifications MUST define which meaning they intend.
C067 [S] Specifications SHOULD use specific terms, when available, instead of the general term 'character'.
C013 [S] [C] Textual data objects defined by protocol or format specifications MUST be in a single character encoding.
C014 [S] All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model, namely:

1.

Specifications MUST define text in terms of Unicode characters, not bytes or glyphs.
2.

For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form.
3.

Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows:
*

The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form.
*

All processing MUST take place on this sequence of Unicode characters.
*

If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification.

4.

If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects.

C070 [S] Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive.
C077 [S] Specifications MUST NOT allow code points above U+10FFFF.
C079 [S] Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use.
C078 [S] Specifications MUST NOT allow the use of surrogate code points.
C015 [S] Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified.
C016 [S] When designing a new protocol, format or API, specifications SHOULD require a unique character encoding.
C017 [S] When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules.
C018 [S] When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32.
C020 [S] Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED.
C021 [S] If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset registry names, and in particular the names identified in the registry as 'MIME preferred names', to designate character encodings in protocols, data formats and APIs.
C022 [S] [I] [C] Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement.
C023 [S] [I] [C] If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed.
C049 [I] [C] The character encoding of content SHOULD be chosen so that it maximizes the opportunity to directly represent characters (ie. minimizes the need to represent characters by markup means such as character escapes) while avoiding obscure encodings that are unlikely to be understood by recipients.
C034 [C] If facilities are offered for identifying character encoding, content MUST make use of them; where the facilities offered for character encoding identification include defaults (e.g. in XML 1.0 [XML 1.0]), relying on such defaults is sufficient to satisfy this identification requirement.
C024 [I] [C] Content and software that label text data MUST use one of the names required by the appropriate specification (e.g. the XML specification when editing XML text) and SHOULD use the MIME preferred name of a character encoding to label data in that character encoding.
C025 [I] [C] An IANA-registered charset name MUST NOT be used to label text data in a character encoding other than the one identified in the IANA registration of that name.
C026 [S] If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the specification).
C027 [S] Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them.
C028 [S] Specifications MUST NOT propose the use of heuristics to determine the encoding of data.
C029 [I] Receiving software MUST determine the encoding of data from available information according to appropriate specifications.
C030 [I] When an IANA-registered charset name is recognized, receiving software MUST interpret the received data according to the encoding associated with the name in the IANA registry.
C031 [I] When no charset is provided receiving software MUST adhere to the default character encoding(s) specified in the specification.
C035 [S] Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding.
C033 [I] Software MUST completely implement the mechanisms for character encoding identification and conflict resolution.
C073 [C] Publicly interchanged content SHOULD NOT use codepoints in the private use area.
C076 [C] Content MUST NOT use a code point for any purpose other than that defined by its coded character set.
C038 [S] Specifications MUST NOT require the use of private use area characters with particular assignments.
C039 [S] Specifications MUST NOT require the use of mechanisms for defining agreements of private use code points.
C040 [S] [I] Specifications and implementations SHOULD NOT disallow the use of private use code points by private agreement.
C041 [S] Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters.
C068 [S] Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics.
C042 [S] Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists.
C043 [S] The number of different ways to escape a character SHOULD be minimized (ideally to one).
C044 [S] Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided.
C045 [S] Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation.
C046 [S] Escaped characters SHOULD be acceptable wherever their unescaped forms are; this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable.
C047 [I] [C] Escapes SHOULD only be used when the characters to be expressed are not directly representable in the format or the character encoding of the document, or when the visual representation of the character is unclear.
C048 [I] [C] Content SHOULD use the hexadecimal form of character escapes rather than the decimal form when there are both.
C050 [S] Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define.
C011 [S] Specifications SHOULD NOT define a string as a 'byte string'.
C012 [S] The 'character string' definition SHOULD be used by most specifications.
C051 [S] [I] The character string is RECOMMENDED as a basis for string indexing.
C052 [S] [I] A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string.
C071 [S] [I] Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern.
C074 [S] Specifications that define indexing in terms of grapheme clusters MUST either: a) define grapheme clusters in terms of default grapheme clusters as defined in Unicode Standard Annex #29, Text Boundaries [UTR #29], or b) define specifically how tailoring is applied to the indexing operation.
C072 [S] [I] The use of byte strings for indexing is NOT RECOMMENDED.
C053 [S] Specifications that need a way to identify substrings or point within a string SHOULD provide ways other than string indexing to perform this operation.
C054 [I] [C] Users of specifications (software developers, content developers) SHOULD whenever possible prefer ways other than string indexing to identify substrings or point within a string.
C055 [S] Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units.
C056 [S] Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' as argument or return types.
C057 [S] When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string.
C062 [S] Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646.
C063 [S] A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time.
C064 [S] All generic references to the Unicode Standard [Unicode] MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification.
C065 [S] All generic references to ISO/IEC 10646 [ISO/IEC 10646] MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification.
E Acknowledgements (Non-Normative)

Tim Berners-Lee and James Clark provided important details in the section on URIs. Asmus Freytag , Addison Phillips, and in early stages Ian Jacobs, provided significant help in the authoring and editing process. The W3C I18N WG and IG, as well as many others, provided many helpful comments and suggestions.