「帶 BOM 的 UTF-8」和「無 BOM 的 UTF-8」有什麼區別?網頁代碼一般使用哪個?

其實說BOM是個壞習慣也不盡然。BOM也是Unicode標準的一部分,有它特定的適用範圍。通常BOM是用來標示Unicode純文本位元組流的,用來提供一種方便的方法讓文本處理程序識別讀入的.txt文件是哪個Unicode編碼(UTF-8,UTF-16BE,UTF-16LE)。Windows相對對BOM處理比較好,是因為Windows把Unicode識別代碼集成進了API里,主要是CreateFile()。打開文本文件時它會自動識別並剔除BOM。Windows用這個有歷史原因,因為它最初脫胎於多代碼頁的環境。而引入Unicode時Windows的設計者又希望能在用戶不注意的情況下同時兼容Unicode和非Unicode(Multiple byte)文本文件,就只能藉助這種小trick了。相比之下,Linux這樣的系統在多locale的環境中浸染的時間比較短,再加上社區本身也有足夠的動力輕裝前進(吐槽:微軟對兼容性的要求確實是到了非常偏執的地步,任何一點破壞兼容性的做法都不允許,以至於很多時候是自己綁住自己的雙手),所以乾脆一步到位進入UTF-8。當然中間其實有一段過渡期,比如從最初全UTF-8的GTK+2.0發布到基本上所有GTK開發者都棄用多locale的GTK+1.2,我印象中至少經歷了三到四年。


順便說一句,即使腳本語言能處理BOM,隨處使用BOM也不是推薦的辦法。各個腳本語言對Unicode的處理都有自己的一套,Python的 # -*- coding: utf-8 -*-,Perl的use utf8,都比BOM簡單而且可靠。另一個好消息是,即使是必須在Windows和UNIX之間切換的朋友也不會悲催。幸虧在UNIX環境下我們還有VIM這種神器,即使遇到BOM擋道,我們也可以通過 set nobomb; set fileencoding=utf8; w 三條命令解決問題。


P.S. 2:突然想起需要解釋一下為什麼說VIM去除bomb的操作需要在UNIX下完成。因為VIM在Windows環境下有一個奇怪的bug,總是把UTF-16文件識別成二進位文件,而UNIX(Linux或者Mac都可以)下VIM則無問題。這個問題從VIM 6.8一直跟著我到VIM 7.3。目前尚不清楚這是VIM的bug還是我自己那個.vimrc文件的bug。如有高手解答不勝感激。

UTF-8 不需要 BOM,儘管 Unicode 標準允許在 UTF-8 中使用 BOM。
所以不含 BOM 的 UTF-8 才是標準形式,在 UTF-8 文件中放置 BOM 主要是微軟的習慣(順便提一下:把帶有 BOM 的小端序 UTF-16 稱作「Unicode」而又不詳細說明,這也是微軟的習慣)。
BOM(byte order mark)是為 UTF-16 和 UTF-32 準備的,用於標記位元組序(byte order)。微軟在 UTF-8 中使用 BOM 是因為這樣可以把 UTF-8 和 ASCII 等編碼明確區分開,但這樣的文件在 Windows 之外的操作系統里會帶來問題。

「UTF-8」和「帶 BOM 的 UTF-8」的區別就是有沒有 BOM。即文件開頭有沒有 U+FEFF。

UTF-8 的網頁代碼不應使用 BOM,否則常常會出錯。這是一個小例子: 為什麼這個網頁代碼 & 內的信息會被瀏覽器理解為在 & 內?

另附《The Unicode Standard, Version 6.0》之 3.10 D95 UTF-8 encoding scheme 的一段話:

While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.



最近在學慣用cocos2d-x,純C++的編碼,如果代碼中有中文等的非ascii字元出現。發現會出錯。代碼是在mac 下用xcode 寫的,放到windows 下用vs 編譯。

通常情況下,一般都 會認為在寫C++代碼的時候不要用中文,但是很多時候我們程序員也有想自己看著舒服的時候,為神馬就不能寫中文了?

於是在windows 下寫了一個helloworld.cpp 類型的文件,輸出內容用中文,然後存為utf-8 帶bom格式,再把它copy到mac 下用g++ 編譯,發現成功通過並且可正常運行,用xcode打開源文件也正常顯示。

所以,這裡建議程序要在windows 和 mac 還有linux 上運行的話,源代碼最好保存成utf-8 帶bom的格式,這樣比較通用一些。而用utf-16 無論大端還是小端,g++ 都不認的。或者用utf-8 不帶bom格式,然後代碼不要出現非ascii 127以後的字元。

關於說utf-8 不帶bom 才是標準的,我想應該是帶用個人情緒的說法吧。真正的標準應該是bom是可選的,為什麼可選?因為有些時候不帶bom會出錯,就拿歷史較久遠的windows來講吧,很多國家的用戶都在用windows ,其文件都是用其本地的ansi 編碼來做的,比如大陸的GBK和GB2013,港台的big5,這些編碼因為針對當地所用的字元制定的,所以呢,其存儲文件較小,所以會大量使用,並且也大量存在著,微軟不可能不考慮全球幾十億的用戶的文件而盲目地修改解碼方式,並且微軟也是uncode 制定者之一,所以,帶用bom的utf-8也是符合國際標準的。



編碼歪傳——番外篇 BOM是什麼,有興趣可以看看我這篇流水賬。


帶 BOM 的 UTF-8 就是赤裸裸的流氓!!!!!!!!!



帶不帶BOM頭區別就在於這個BOM頭,祥見排名靠前的大神答案。windows特有的奇葩。請使用UTF-8 不帶BOM頭!!


  • 鍩 -- 感謝 @飛揚 提供,參考其答案
  • HTML空白行
  • div之間莫明的間隔
  • 亂碼!


順便再鄙視一下 SONY的記憶棒、IPHONE的介面~~




正如@梁海所說,「不含 BOM 的 UTF-8 才是標準形式」,的確是這樣,無BOM使用得更多些,所以個人還是推薦一般情況下用無BOM的形式吧,除非有問題的時候,再考慮換有BOM的。Windows系統保存的都是有BOM的,所以你可以看到,用記事本保存一個UTF-8的txt,其實是有BOM的,這一點需要注意。另外不同的文本編輯器對於有無BOM的稱呼也略有不同,比如EditPlus,有BOM的稱為UTF-8+,無BOM的稱為UTF-8,而在Notepad++中,有BOM的被稱為標準UTF-8,而無BOM則被稱為UTF-8無BOM。











這就是為什麼windows的記事本要強行給utf8加bom的原因——為了兼容舊系統的編碼問題,unix陣營放棄帶bom的utf8——為了讓它們的上古程序能繼續運行下去,這個各自有自己利益訴求的差異決定其實並不對錯。但是,讓一眾程序加入chomp utf8 bom的功能,和讓上億沒有專業背景知識的用戶面對亂碼問題,哪種解決起來更麻煩,更成本高,我想,答案是顯而易見的吧。

文本的編碼屬於文本的元數據, html 和 xml 等在頭部已經說明了編碼所以不需要 BOM。
而對於一個沒有元數據說明的文本文件,*nix憑什麼就欽定是 UTF-8 呢,為什麼就不可以是 GB2312,不能是 JIS 呢?所以我覺得 Windows 的做法提現了他的考慮周全,和對用戶負責的態度。
另外:Unicode 不推薦 UTF-8 使用 BOM 完全是無中生有。



別看上面瞎扯一大堆,言不達意,看The Unicode Consortium官方解釋

16.8 Specials

The Specials block contains code points that are interpreted as neither control nor graphic

characters but that are provided to facilitate current software practices.

For information about the noncharacter code points U+FFFE and U+FFFF, see

Section 16.7, Noncharacters.

Byte Order Mark (BOM): U+FEFF

For historical reasons, the character U+FEFF used for the byte order mark is named zero

width no-break space. Except for compatibility with versions of Unicode prior to Ver-

sion 3.2, U+FEFF is not used with the semantics of zero width no-break space (see

Section 16.2, Layout Controls). Instead, its most common and most important usage is in

the following two circumstances:

1. Unmarked Byte Order. Some machine architectures use the so-called big-

endian byte order, while others use the little-endian byte order. When Unicode

text is serialized into bytes, the bytes can go in either order, depending on the

architecture. Sometimes this byte order is not externally marked, which causes

problems in interchange between different systems.

2. Unmarked Character Set. In some circumstances, the character set information

for a stream of coded characters (such as a file) is not available. The only infor-

mation available is that the stream contains text, but the precise character set is

not known.

In these two cases, the character U+FEFF is used as a signature to indicate the byte order

and the character set by using the byte serializations described in Section 3.10, Unicode

Encoding Schemes. Because the byte-swapped version U+FFFE is a noncharacter, when an

interpreting process finds U+FFFE as the first character, it signals either that the process

has encountered text that is of the incorrect byte order or that the file is not valid Unicode


In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file or stream explicitly

signals the byte order.

The byte sequences & or & may also serve as a signature to identify a

file as containing UTF-16 text. Either sequence is exceedingly rare at the outset of text files

using other character encodings, whether single- or multiple-byte, and therefore not

likely to be confused with real text data. For example, in systems that employ ISO Latin-1

(ISO/IEC 8859-1) or the Microsoft Windows ANSI Code Page 1252, the byte sequence

& constitutes the string & 「t?」; in systems that employ the

Apple Macintosh Roman character set or the Adobe Standard Encoding, this sequence rep-

Copyright ? 1991-2007, Unicode, Inc. The Unicode Standard 5.0 – Electronic edition

16.8 Specials 551

resents the sequence & 「¤?」; in systems that employ other common IBM

PC code pages (for example, CP 437, 850), this sequence represents &

space&> 「? 」.

In UTF-8, the BOM corresponds to the byte sequence &. Although there

are never any questions of byte order with UTF-8 text, this sequence can serve as signature

for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16,

this sequence of bytes will be extremely rare at the beginning of text files in other character

encodings. For example, in systems that employ Microsoft Windows ANSI Code Page 1252,

& corresponds to the sequence &

mark&> 「? ? ?」.

For compatibility with versions of the Unicode Standard prior to Version 3.2, the code

point U+FEFF has the word-joining semantics of zero width no-break space when it is not

used as a BOM. In new text, these semantics should be encoded by U+2060 word joiner.

See 「Line and Word Breaking」 in Section 16.2, Layout Controls, for more information.

Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE, then all

U+FEFF characters—even at the very beginning of the text—are to be interpreted as zero

width no-break spaces. Similarly, where Unicode text has known byte order, initial U+FEFF

characters are not required, but for backward compatibility are to be interpreted as zero

width no-break spaces. For example, for strings in an API, the memory architecture of the

processor provides the explicit byte order. For databases and similar structures, it is much

more efficient and robust to use a uniform byte order for the same field (if not the entire

database), thereby avoiding use of the byte order mark.

Systems that use the byte order mark must recognize when an initial U+FEFF signals the

byte order. In those cases, it is not part of the textual content and should be removed before

processing, because otherwise it may be mistaken for a legitimate zero width no-break space.

To represent an initial U+FEFF zero width no-break space in a UTF-16 file, use

U+FEFF twice in a row. The first one is a byte order mark; the second one is the initial zero

width no-break space. See Table 16-4 for a summary of encoding scheme signatures.

Table 16-4. Unicode Encoding Scheme Signatures

Encoding Scheme Signature


UTF-16 Big-endian FE FF

UTF-16 Little-endian FF FE

UTF-32 Big-endian 00 00 FE FF

UTF-32 Little-endian FF FE 00 00

If U+FEFF had only the semantics of a signature code point, it could be freely deleted from

text without affecting the interpretation of the rest of the text. Carelessly appending files

together, for example, can result in a signature code point in the middle of text. Unfortu-

nately, U+FEFF also has significance as a character. As a zero width no-break space, it indi-

cates that line breaks are not allowed between the adjoining characters. Thus U+FEFF

affects the interpretation of text and cannot be freely deleted. The overloading of semantics

The Unicode Standard 5.0 – Electronic edition Copyright ? 1991–2007 Unicode, Inc.

552 Special Areas and Format Characters

for this code point has caused problems for programs and protocols. The new character

U+2060 word joiner has the same semantics in all cases as U+FEFF, except that it cannot

be used as a signature. Implementers are strongly encouraged to use word joiner in those

circumstances whenever word joining semantics are intended.

An initial U+FEFF also takes a characteristic form in other charsets designed for Unicode

text. (The term 「charset」 refers to a wide range of text encodings, including encoding

schemes as well as compression schemes and text-specific transformation formats.) The

characteristic sequences of bytes associated with an initial U+FEFF can serve as signatures

in those cases, as shown in Table 16-5.

Table 16-5. U+FEFF Signature in Other Charsets

Charset Signature



UTF-7 2B 2F 76 38 or

2B 2F 76 39 or

2B 2F 76 2B or

2B 2F 76 2F

UTF-EBCDIC DD 73 66 73

Most signatures can be deleted either before or after conversion of an input stream into a

Unicode encoding form. However, in the case of BOCU-1 and UTF-7, the input byte

sequence must be converted before the initial U+FEFF can be deleted, because stripping

the signature byte sequence without conversion destroys context necessary for the correct

interpretation of subsequent bytes in the input sequence.

Specials: U+FFF0–U+FFF8

The nine unassigned Unicode code points in the range U+FFF0.. U+FFF8 are reserved for

special character definitions.

Annotation Characters: U+FFF9–U+FFFB

An interlinear annotation consists of annotating text that is related to a sequence of anno-

tated characters. For all regular editing and text-processing algorithms, the annotated char-

acters are treated as part of the text stream. The annotating text is also part of the content,

but for all or some text processing, it does not form part of the main text stream. However,

within the annotating text, characters are accessible to the same kind of layout, text-pro-

cessing, and editing algorithms as the base text. The annotation characters delimit the

annotating and the annotated text, and identify them as part of an annotation. See

Figure 16-4.

The annotation characters are used in internal processing when out-of-band information is

associated with a character stream, very similarly to the usage of U+FFFC object replace-

Copyright ? 1991-2007, Unicode, Inc. The Unicode Standard 5.0 – Electronic edition

Figure 16-4. Annotation Characters


Text display

Text stream













16.8 Specials 553

ment character. However, unlike the opaque objects hidden by the latter character, the

annotation itself is textual.

Conformance. A conformant implementation that supports annotation characters inter-

prets the base text as if it were part of an unannotated text stream. Within the annotating

text, it interprets the annotating characters with their regular Unicode semantics.

U+FFF9 interlinear annotation anchor is an anchor character, preceding the interlin-

ear annotation. The exact nature and formatting of the annotation depend on additional

information that is not part of the plain text stream. This situation is analogous to that for

U+FFFC object replacement character.

U+FFFA interlinear annotation separator separates the base characters in the text

stream from the annotation characters that follow. The exact interpretation of this charac-

ter depends on the nature of the annotation. More than one separator may be present.

Additional separators delimit parts of a multipart annotating text.

U+FFFB interlinear annotation terminator terminates the annotation object (and

returns to the regular text stream).

Use in Plain Text. Usage of the annotation characters in plain text interchange is strongly

discouraged without prior agreement between the sender and the receiver, because the con-

tent may be misinterpreted otherwise. Simply filtering out the annotation characters on

input will produce an unreadable result or, even worse, an opposite meaning. On input, a

plain text receiver should either preserve all characters or remove the interlinear annota-

tion characters as well as the annotating text included between the interlinear annota-

tion separator and the interlinear annotation terminator.

When an output for plain text usage is desired but the receiver is unknown to the sender,

these interlinear annotation characters should be removed as well as the annotating text

included between the interlinear annotation separator and the interlinear anno-

tation terminator.

This restriction does not preclude the use of annotation characters in plain text inter-

change, but it requires a prior agreement between the sender and the receiver for correct

interpretation of the annotations.

The Unicode Standard 5.0 – Electronic edition Copyright ? 1991–2007 Unicode, Inc.

554 Special Areas and Format Characters

Lexical Restrictions. If an implementation encounters a paragraph break between an

anchor and its corresponding terminator, it shall terminate any open annotations at this

point. Anchor characters must precede their corresponding terminator characters.

Unpaired anchors or terminators shall be ignored. A separator occurring outside a pair of

delimiters, shall be ignored. Annotations may be nested.

Formatting. All formatting information for an annotation is provided by higher-level pro-

tocols. The details of the layout of the annotation are implementation-defined. Correct for-

matting may require additional information that is not present in the character stream, but

rather is maintained out-of-band. Therefore, annotation markers serve as placeholders for

an implementation that has access to that information from another source. The format-

ting of annotations and other special line layout features of Japanese is discussed in JIS X


Input. Annotation characters are not normally input or edited directly by end users. Their

insertion and management in text are typically handled by an application, which will

present a user interface for selecting and annotating text.

Collation. With the exception of the special case where the annotation is intended to be

used as a sort key, annotations are typically ignored for collation or optionally preprocessed

to act as tie breakers only. Importantly, annotation base characters are not ignored, but

rather are treated like regular text.

Replacement Characters: U+FFFC–U+FFFD

U+FFFC. The U+FFFC object replacement character is used as an insertion point for

objects located within a stream of text. All other information about the object is kept out-

side the character data stream. Internally it is a dummy character that acts as an anchor

point for the object』s formatting information. In addition to assuring correct placement of

an object in a data stream, the object replacement character allows the use of general

stream-based algorithms for any textual aspects of embedded objects.

U+FFFD. The U+FFFD replacement character is the general substitute character in

the Unicode Standard. It can be substituted for any 「unknown」 character in another

encoding that cannot be mapped in terms of known Unicode characters (see Section 5.3,

Unknown and Missing Characters).


notepad++裡面編碼裡面有兩個選項。「以UTF-8格式編碼「和」以UTF-8無BOM格式編碼」。打眼一看,肯定選擇「以UTF-8格式編碼」啊。於是,從notepad++ --&> Mongodb裡面複製東西的時候,莫名其妙多了不少的位元組數。如果不安裝notepad++,使用默認的記事本,那就更是個坑,默認有boom,你還無法選擇。






