「帶 BOM 的 UTF-8」和「無 BOM 的 UTF-8」有什麼區別?網頁代碼一般使用哪個?
受邀。早知道上得山多終遇老虎,在@梁海 老兄面前耍Unicode總會有這一天的??
首先,BOM是啥。這個就不解釋了,Wikipedia上很詳細。http://en.wikipedia.org/wiki/Byte_order_mark。
在網頁上使用BOM是個錯誤。BOM設計出來不是用來支持HTML和XML的。要識別文本編碼,HTML有charset屬性,XML有encoding屬性,沒必要拉BOM撐場面。雖然理論上BOM可以用來識別UTF-16編碼的HTML頁面,但實際工程上很少有人這麼干。畢竟UTF-16這種編碼連ASCII都雙位元組,實在不適用於做網頁。
其實說BOM是個壞習慣也不盡然。BOM也是Unicode標準的一部分,有它特定的適用範圍。通常BOM是用來標示Unicode純文本位元組流的,用來提供一種方便的方法讓文本處理程序識別讀入的.txt文件是哪個Unicode編碼(UTF-8,UTF-16BE,UTF-16LE)。Windows相對對BOM處理比較好,是因為Windows把Unicode識別代碼集成進了API里,主要是CreateFile()。打開文本文件時它會自動識別並剔除BOM。Windows用這個有歷史原因,因為它最初脫胎於多代碼頁的環境。而引入Unicode時Windows的設計者又希望能在用戶不注意的情況下同時兼容Unicode和非Unicode(Multiple byte)文本文件,就只能藉助這種小trick了。相比之下,Linux這樣的系統在多locale的環境中浸染的時間比較短,再加上社區本身也有足夠的動力輕裝前進(吐槽:微軟對兼容性的要求確實是到了非常偏執的地步,任何一點破壞兼容性的做法都不允許,以至於很多時候是自己綁住自己的雙手),所以乾脆一步到位進入UTF-8。當然中間其實有一段過渡期,比如從最初全UTF-8的GTK+2.0發布到基本上所有GTK開發者都棄用多locale的GTK+1.2,我印象中至少經歷了三到四年。
BOM不受歡迎主要是在UNIX環境下,因為很多UNIX程序不鳥BOM。主要問題出在UNIX那個所有腳本語言通行的首行#!標示,這東西依賴於shell解析,而很多shell出於兼容的考慮不檢測BOM,所以加進BOM時shell會把它解釋為某個普通字元輸入導致破壞#!標示,這就麻煩了。其實很多現代腳本語言,比如Python,其解釋器本身都是能處理BOM的,但是shell卡在這裡,沒辦法,只能躺著也中槍。說起來這也不能怪shell,因為BOM本身違反了一個UNIX設計的常見原則,就是文檔中存在的數據必須可見。BOM不能作為可見字元被文本編輯器編輯,就這一條很多UNIX開發者就不滿意。
順便說一句,即使腳本語言能處理BOM,隨處使用BOM也不是推薦的辦法。各個腳本語言對Unicode的處理都有自己的一套,Python的 # -*- coding: utf-8 -*-,Perl的use utf8,都比BOM簡單而且可靠。另一個好消息是,即使是必須在Windows和UNIX之間切換的朋友也不會悲催。幸虧在UNIX環境下我們還有VIM這種神器,即使遇到BOM擋道,我們也可以通過 set nobomb; set fileencoding=utf8; w 三條命令解決問題。
最後回頭想想,似乎也真就只有Windows堅持用BOM了。
P.S.:本問題是自己的第150個回答。突然發現自己回答得很少很少??P.S. 2:突然想起需要解釋一下為什麼說VIM去除bomb的操作需要在UNIX下完成。因為VIM在Windows環境下有一個奇怪的bug,總是把UTF-16文件識別成二進位文件,而UNIX(Linux或者Mac都可以)下VIM則無問題。這個問題從VIM 6.8一直跟著我到VIM 7.3。目前尚不清楚這是VIM的bug還是我自己那個.vimrc文件的bug。如有高手解答不勝感激。
UTF-8 不需要 BOM,儘管 Unicode 標準允許在 UTF-8 中使用 BOM。
所以不含 BOM 的 UTF-8 才是標準形式,在 UTF-8 文件中放置 BOM 主要是微軟的習慣(順便提一下:把帶有 BOM 的小端序 UTF-16 稱作「Unicode」而又不詳細說明,這也是微軟的習慣)。
BOM(byte order mark)是為 UTF-16 和 UTF-32 準備的,用於標記位元組序(byte order)。微軟在 UTF-8 中使用 BOM 是因為這樣可以把 UTF-8 和 ASCII 等編碼明確區分開,但這樣的文件在 Windows 之外的操作系統里會帶來問題。
「UTF-8」和「帶 BOM 的 UTF-8」的區別就是有沒有 BOM。即文件開頭有沒有 U+FEFF。
UTF-8 的網頁代碼不應使用 BOM,否則常常會出錯。這是一個小例子: 為什麼這個網頁代碼 &
內的信息會被瀏覽器理解為在 & 內?另附《The Unicode Standard, Version 6.0》之 3.10 D95 UTF-8 encoding scheme 的一段話:While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
網頁編程中用不用bom我就不說什麼了,因為軟體原因無法使用的就更不能用了。
最近在學慣用cocos2d-x,純C++的編碼,如果代碼中有中文等的非ascii字元出現。發現會出錯。代碼是在mac 下用xcode 寫的,放到windows 下用vs 編譯。
最後把所有的源文件轉成了帶bom的格式後編譯通過了,鏈接失敗,這想這個就不是編碼的問題了。
通常情況下,一般都 會認為在寫C++代碼的時候不要用中文,但是很多時候我們程序員也有想自己看著舒服的時候,為神馬就不能寫中文了?
於是在windows 下寫了一個helloworld.cpp 類型的文件,輸出內容用中文,然後存為utf-8 帶bom格式,再把它copy到mac 下用g++ 編譯,發現成功通過並且可正常運行,用xcode打開源文件也正常顯示。
所以,這裡建議程序要在windows 和 mac 還有linux 上運行的話,源代碼最好保存成utf-8 帶bom的格式,這樣比較通用一些。而用utf-16 無論大端還是小端,g++ 都不認的。或者用utf-8 不帶bom格式,然後代碼不要出現非ascii 127以後的字元。
關於說utf-8 不帶bom 才是標準的,我想應該是帶用個人情緒的說法吧。真正的標準應該是bom是可選的,為什麼可選?因為有些時候不帶bom會出錯,就拿歷史較久遠的windows來講吧,很多國家的用戶都在用windows ,其文件都是用其本地的ansi 編碼來做的,比如大陸的GBK和GB2013,港台的big5,這些編碼因為針對當地所用的字元制定的,所以呢,其存儲文件較小,所以會大量使用,並且也大量存在著,微軟不可能不考慮全球幾十億的用戶的文件而盲目地修改解碼方式,並且微軟也是uncode 制定者之一,所以,帶用bom的utf-8也是符合國際標準的。
或許是因為程序編寫者的個人原因,也許是考慮到效率,很多的程序無法正確區分一個utf-8文件是否有bom,所以導致了各種亂碼的出現。
個人不想說哪個是標準,也不想用語言去攻擊哪個公司或團體。微軟在堅持使用bom上沒有錯,因為這是在為用戶考慮的。也許給我們這些寫程序的帶來了不便,但是,計算機最廣泛的用戶不是程序員。UTF-8因為它的編碼特性,是位元組序無關的,所以不需要BOM。
我覺得「帶BOM的UTF-8」這個鍋基本上WINDOWS還是要背的,儘管我不太確定「UTF-8文件是否可以帶BOM」這個問題,但整因為它不需要,於是很多跨平台的軟體其實並不支持這種格式。
編碼歪傳——番外篇 BOM是什麼,有興趣可以看看我這篇流水賬。
就是帶頭的鵝和去頭的鵝,有些編輯器比較傻會把去頭的鵝認成鴨子…
帶 BOM 的 UTF-8 就是赤裸裸的流氓!!!!!!!!!
windows總是自做聰明的做一些別人無法理解的事情!!!UTF-8是不需要BOM頭的~~~!!
從剛開始學習代碼(實在不能稱我做的東西為程序)到現在,不曉得被這個BOM頭搞了多少次,特別是對於我這種完全自學的人,知道找一個BUG需要多久多久不????
帶不帶BOM頭區別就在於這個BOM頭,祥見排名靠前的大神答案。windows特有的奇葩。請使用UTF-8 不帶BOM頭!!
它產生的BUG包含但不僅限於:- 鍩 -- 感謝 @飛揚 提供,參考其答案
- HTML空白行
- div之間莫明的間隔
- 亂碼!
如果你用ssl那麼一定會有問題!!!
順便再鄙視一下 SONY的記憶棒、IPHONE的介面~~
這種吐槽的東西就讓它摺疊吧php之「鍩」,誰用誰知道……
notepad++會自動添加為帶Bom的utf8比較坑爹
幾周前還在為BOM的問題苦惱著。。。
正如@梁海所說,「不含 BOM 的 UTF-8 才是標準形式」,的確是這樣,無BOM使用得更多些,所以個人還是推薦一般情況下用無BOM的形式吧,除非有問題的時候,再考慮換有BOM的。Windows系統保存的都是有BOM的,所以你可以看到,用記事本保存一個UTF-8的txt,其實是有BOM的,這一點需要注意。另外不同的文本編輯器對於有無BOM的稱呼也略有不同,比如EditPlus,有BOM的稱為UTF-8+,無BOM的稱為UTF-8,而在Notepad++中,有BOM的被稱為標準UTF-8,而無BOM則被稱為UTF-8無BOM。
這個問題只有吳秀軍的答案是正確的,在這裡鼓吹不帶BOM的utf8編碼的都是頭髮長見識短(還有個居然推薦mac,請允許我呵呵),不屑與之辯論。
我就問:這種情況你怎麼解決?
在某個時候你把一個帶中文的文檔指定存成了無bom的utf8格式。
在下次,你或者其他人再次用編輯器打開這個文件,編輯器自動檢測到了這是一個utf8文件(或者你明確了是以utf8格式打開,但我假設沒有人會這麼勤快),然後編輯它,然後保存,文件仍然是utf8格式的。
到此,一切都十分美好,完全按照理想中的情況運行。
是嗎?但是,也許,在某個時候,你對這個文件的編輯,剛好去掉了,文件中所有的中文內容,但是你仍然像往常一樣,把它保存成utf8,而且沒有任何異樣。
問題來了,下次再打開這個文件時,編輯器怎麼識別這個文件的編碼,?ascii?gbk?還是utf8?
utf8對ascii的兼容確實是它的好,但是這個優點在某些時候恰恰成了隱藏問題的缺點。因此bom大法好,加bom保平安。
補充:
如果你只活在專業的程序員的世界裡面,尤其你又是個linux平台的,那麼確實utf8就是近乎終極解決方案了。但是這個世界上的大多數人仍然用的是windows,給你提交這個文件的人,可能來自你的組員,也可能來自其他部門,或者是來自你的客戶,你不能保證所有的情況下其他人都能按照你的標準來操作,何況世界上絕大多數人對編碼的問題一無所知,如果造成了問題,對於他們來說,這是你的問題,不是他們的問題,他們給你東西的時候確實是好的。你也許要爭論,但是,對於一個成熟的專業人士來說,要學會傾聽和理解客戶的訴求,而不是向你的客戶解釋你的專業問題。
這就是為什麼windows的記事本要強行給utf8加bom的原因——為了兼容舊系統的編碼問題,unix陣營放棄帶bom的utf8——為了讓它們的上古程序能繼續運行下去,這個各自有自己利益訴求的差異決定其實並不對錯。但是,讓一眾程序加入chomp utf8 bom的功能,和讓上億沒有專業背景知識的用戶面對亂碼問題,哪種解決起來更麻煩,更成本高,我想,答案是顯而易見的吧。文本的編碼屬於文本的元數據, html 和 xml 等在頭部已經說明了編碼所以不需要 BOM。
而對於一個沒有元數據說明的文本文件,*nix憑什麼就欽定是 UTF-8 呢,為什麼就不可以是 GB2312,不能是 JIS 呢?所以我覺得 Windows 的做法提現了他的考慮周全,和對用戶負責的態度。
另外:Unicode 不推薦 UTF-8 使用 BOM 完全是無中生有。
被這個坑過,後來保存文件會加個小心,當然,gbk-&>utf8也是一個坑
什麼鬼,就是因為這個bom,CSV導入mongodb時,第一個欄位總是不正常,直接導致用第一個欄位作為條件find時,出不了結果!
坑了laozi一晚上
別看上面瞎扯一大堆,言不達意,看The Unicode Consortium官方解釋
16.8 Specials
The Specials block contains code points that are interpreted as neither control nor graphic
characters but that are provided to facilitate current software practices.
For information about the noncharacter code points U+FFFE and U+FFFF, see
Section 16.7, Noncharacters.
Byte Order Mark (BOM): U+FEFF
For historical reasons, the character U+FEFF used for the byte order mark is named zero
width no-break space. Except for compatibility with versions of Unicode prior to Ver-
sion 3.2, U+FEFF is not used with the semantics of zero width no-break space (see
Section 16.2, Layout Controls). Instead, its most common and most important usage is in
the following two circumstances:
1. Unmarked Byte Order. Some machine architectures use the so-called big-
endian byte order, while others use the little-endian byte order. When Unicode
text is serialized into bytes, the bytes can go in either order, depending on the
architecture. Sometimes this byte order is not externally marked, which causes
problems in interchange between different systems.
2. Unmarked Character Set. In some circumstances, the character set information
for a stream of coded characters (such as a file) is not available. The only infor-
mation available is that the stream contains text, but the precise character set is
not known.
In these two cases, the character U+FEFF is used as a signature to indicate the byte order
and the character set by using the byte serializations described in Section 3.10, Unicode
Encoding Schemes. Because the byte-swapped version U+FFFE is a noncharacter, when an
interpreting process finds U+FFFE as the first character, it signals either that the process
has encountered text that is of the incorrect byte order or that the file is not valid Unicode
text.
In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file or stream explicitly
signals the byte order.
The byte sequences &
file as containing UTF-16 text. Either sequence is exceedingly rare at the outset of text files
using other character encodings, whether single- or multiple-byte, and therefore not
likely to be confused with real text data. For example, in systems that employ ISO Latin-1
(ISO/IEC 8859-1) or the Microsoft Windows ANSI Code Page 1252, the byte sequence
&
Apple Macintosh Roman character set or the Adobe Standard Encoding, this sequence rep-
Copyright ? 1991-2007, Unicode, Inc. The Unicode Standard 5.0 – Electronic edition
16.8 Specials 551
resents the sequence &
PC code pages (for example, CP 437, 850), this sequence represents & space&> 「? 」. In UTF-8, the BOM corresponds to the byte sequence &
are never any questions of byte order with UTF-8 text, this sequence can serve as signature
for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16,
this sequence of bytes will be extremely rare at the beginning of text files in other character
encodings. For example, in systems that employ Microsoft Windows ANSI Code Page 1252,
& mark&> 「? ? ?」. For compatibility with versions of the Unicode Standard prior to Version 3.2, the code point U+FEFF has the word-joining semantics of zero width no-break space when it is not used as a BOM. In new text, these semantics should be encoded by U+2060 word joiner. See 「Line and Word Breaking」 in Section 16.2, Layout Controls, for more information.
Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE, then all
U+FEFF characters—even at the very beginning of the text—are to be interpreted as zero
width no-break spaces. Similarly, where Unicode text has known byte order, initial U+FEFF
characters are not required, but for backward compatibility are to be interpreted as zero
width no-break spaces. For example, for strings in an API, the memory architecture of the
processor provides the explicit byte order. For databases and similar structures, it is much
more efficient and robust to use a uniform byte order for the same field (if not the entire
database), thereby avoiding use of the byte order mark.
Systems that use the byte order mark must recognize when an initial U+FEFF signals the
byte order. In those cases, it is not part of the textual content and should be removed before
processing, because otherwise it may be mistaken for a legitimate zero width no-break space.
To represent an initial U+FEFF zero width no-break space in a UTF-16 file, use
U+FEFF twice in a row. The first one is a byte order mark; the second one is the initial zero
width no-break space. See Table 16-4 for a summary of encoding scheme signatures.
Table 16-4. Unicode Encoding Scheme Signatures
Encoding Scheme Signature
UTF-8 EF BB BF
UTF-16 Big-endian FE FF
UTF-16 Little-endian FF FE
UTF-32 Big-endian 00 00 FE FF
UTF-32 Little-endian FF FE 00 00
If U+FEFF had only the semantics of a signature code point, it could be freely deleted from
text without affecting the interpretation of the rest of the text. Carelessly appending files
together, for example, can result in a signature code point in the middle of text. Unfortu-
nately, U+FEFF also has significance as a character. As a zero width no-break space, it indi-
cates that line breaks are not allowed between the adjoining characters. Thus U+FEFF
affects the interpretation of text and cannot be freely deleted. The overloading of semantics
The Unicode Standard 5.0 – Electronic edition Copyright ? 1991–2007 Unicode, Inc.
552 Special Areas and Format Characters
for this code point has caused problems for programs and protocols. The new character
U+2060 word joiner has the same semantics in all cases as U+FEFF, except that it cannot
be used as a signature. Implementers are strongly encouraged to use word joiner in those
circumstances whenever word joining semantics are intended.
An initial U+FEFF also takes a characteristic form in other charsets designed for Unicode
text. (The term 「charset」 refers to a wide range of text encodings, including encoding
schemes as well as compression schemes and text-specific transformation formats.) The
characteristic sequences of bytes associated with an initial U+FEFF can serve as signatures
in those cases, as shown in Table 16-5.
Table 16-5. U+FEFF Signature in Other Charsets
Charset Signature
SCSU 0E FE FF
BOCU-1 FB EE 28
UTF-7 2B 2F 76 38 or
2B 2F 76 39 or
2B 2F 76 2B or
2B 2F 76 2F
UTF-EBCDIC DD 73 66 73
Most signatures can be deleted either before or after conversion of an input stream into a
Unicode encoding form. However, in the case of BOCU-1 and UTF-7, the input byte
sequence must be converted before the initial U+FEFF can be deleted, because stripping
the signature byte sequence without conversion destroys context necessary for the correct
interpretation of subsequent bytes in the input sequence.
Specials: U+FFF0–U+FFF8
The nine unassigned Unicode code points in the range U+FFF0.. U+FFF8 are reserved for
special character definitions.
Annotation Characters: U+FFF9–U+FFFB
An interlinear annotation consists of annotating text that is related to a sequence of anno-
tated characters. For all regular editing and text-processing algorithms, the annotated char-
acters are treated as part of the text stream. The annotating text is also part of the content,
but for all or some text processing, it does not form part of the main text stream. However,
within the annotating text, characters are accessible to the same kind of layout, text-pro-
cessing, and editing algorithms as the base text. The annotation characters delimit the
annotating and the annotated text, and identify them as part of an annotation. See
Figure 16-4.
The annotation characters are used in internal processing when out-of-band information is
associated with a character stream, very similarly to the usage of U+FFFC object replace-
Copyright ? 1991-2007, Unicode, Inc. The Unicode Standard 5.0 – Electronic edition
Figure 16-4. Annotation Characters
Felix
Text display
Text stream
Annotated
text
Annotating
text
Annotation
characters
Annotated
text
Annotating
text
Annotation
characters
16.8 Specials 553
ment character. However, unlike the opaque objects hidden by the latter character, the
annotation itself is textual.
Conformance. A conformant implementation that supports annotation characters inter-
prets the base text as if it were part of an unannotated text stream. Within the annotating
text, it interprets the annotating characters with their regular Unicode semantics.
U+FFF9 interlinear annotation anchor is an anchor character, preceding the interlin-
ear annotation. The exact nature and formatting of the annotation depend on additional
information that is not part of the plain text stream. This situation is analogous to that for
U+FFFC object replacement character.
U+FFFA interlinear annotation separator separates the base characters in the text
stream from the annotation characters that follow. The exact interpretation of this charac-
ter depends on the nature of the annotation. More than one separator may be present.
Additional separators delimit parts of a multipart annotating text.
U+FFFB interlinear annotation terminator terminates the annotation object (and
returns to the regular text stream).
Use in Plain Text. Usage of the annotation characters in plain text interchange is strongly
discouraged without prior agreement between the sender and the receiver, because the con-
tent may be misinterpreted otherwise. Simply filtering out the annotation characters on
input will produce an unreadable result or, even worse, an opposite meaning. On input, a
plain text receiver should either preserve all characters or remove the interlinear annota-
tion characters as well as the annotating text included between the interlinear annota-
tion separator and the interlinear annotation terminator.
When an output for plain text usage is desired but the receiver is unknown to the sender,
these interlinear annotation characters should be removed as well as the annotating text
included between the interlinear annotation separator and the interlinear anno-
tation terminator.
This restriction does not preclude the use of annotation characters in plain text inter-
change, but it requires a prior agreement between the sender and the receiver for correct
interpretation of the annotations.
The Unicode Standard 5.0 – Electronic edition Copyright ? 1991–2007 Unicode, Inc.
554 Special Areas and Format Characters
Lexical Restrictions. If an implementation encounters a paragraph break between an
anchor and its corresponding terminator, it shall terminate any open annotations at this
point. Anchor characters must precede their corresponding terminator characters.
Unpaired anchors or terminators shall be ignored. A separator occurring outside a pair of
delimiters, shall be ignored. Annotations may be nested.
Formatting. All formatting information for an annotation is provided by higher-level pro-
tocols. The details of the layout of the annotation are implementation-defined. Correct for-
matting may require additional information that is not present in the character stream, but
rather is maintained out-of-band. Therefore, annotation markers serve as placeholders for
an implementation that has access to that information from another source. The format-
ting of annotations and other special line layout features of Japanese is discussed in JIS X
4501.
Input. Annotation characters are not normally input or edited directly by end users. Their
insertion and management in text are typically handled by an application, which will
present a user interface for selecting and annotating text.
Collation. With the exception of the special case where the annotation is intended to be
used as a sort key, annotations are typically ignored for collation or optionally preprocessed
to act as tie breakers only. Importantly, annotation base characters are not ignored, but
rather are treated like regular text.
Replacement Characters: U+FFFC–U+FFFD
U+FFFC. The U+FFFC object replacement character is used as an insertion point for
objects located within a stream of text. All other information about the object is kept out-
side the character data stream. Internally it is a dummy character that acts as an anchor
point for the object』s formatting information. In addition to assuring correct placement of
an object in a data stream, the object replacement character allows the use of general
stream-based algorithms for any textual aspects of embedded objects.
U+FFFD. The U+FFFD replacement character is the general substitute character in
the Unicode Standard. It can be substituted for any 「unknown」 character in another
encoding that cannot be mapped in terms of known Unicode characters (see Section 5.3,
Unknown and Missing Characters).
看到了好多連記事本都不會用的"程序員"....
不知道微軟搞什麼,本來漢語的處理就很麻煩。
notepad++裡面編碼裡面有兩個選項。「以UTF-8格式編碼「和」以UTF-8無BOM格式編碼」。打眼一看,肯定選擇「以UTF-8格式編碼」啊。於是,從notepad++ --&> Mongodb裡面複製東西的時候,莫名其妙多了不少的位元組數。如果不安裝notepad++,使用默認的記事本,那就更是個坑,默認有boom,你還無法選擇。
曾因為BOM問題,而導致花費數小時Debug,吃一塹長一智,Windows下編程千萬要注意
在pintia上掛題,手冊上寫的UTF8,然後我就習慣性的帶上BOM了,然後內測的時候咋交咋不對
最後花了點時間轉碼重新上傳了一遍~
我想知道unix下如何將無BOM的utf8轉成帶BOM的,被csv亂碼弄死了,求大神
為了各自的利益把UTF-8搞得烏煙瘴氣
推薦閱讀:
TAG:HTML | Unicode統一碼 | 字元編碼 | UTF-8 | 位元組序標記BOM |