為什麼 C 語言對字元串的設計是用零結尾,而不是像 Pascal 一樣在字元串首指明長度?

我覺得保留字元串的 0 位記錄串長度,比用零收尾更便於處理。

C 語言的這種設計是有什麼原因或特別目的的嗎?


參考: http://delivery.acm.org/10.1145/2020000/2010365/p40-kamp.pdf?ip=114.240.64.53acc=OPENCFID=97964950CFTOKEN=11676210__acm__=1334754475_bdb51fde42cd85383eea1bf50c7ff3e7

這可能確實是 C 語言的一個設計失誤。但是文化源自偶然(——阿西莫夫)。已成事實。

=============

更新:在 slashdot 上的評論:

Interesting, but I think this article largely misses the point.

Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn"t be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings &> 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I"m not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.

Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)

Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That"s got nothing to do with the NUL terminator.

What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn"t why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren"t buffer overflows. A couple of years back, a vulnerability was found in Microsoft"s crypto libraries (I don"t have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "http://google.com" and "http://google.comhttp://0.malicioushacker.com" compared equal, so if I got a certificate for "*.comhttp://0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".


穩定可靠的東西,就是簡單直觀的東西。這是最簡單的字元串的模型了。字元串自身不額外維護長度這個屬性。

如果像其他語言一樣,在前面加上一個字元串長度,則需要維護它,凡是需要維護的都會比較繁瑣。而捨棄它,在需要時,查詢字元串的長度的成本很低。很多人覺得這裡不是很容易想清楚,像字元串這種經常被哪來顛三倒四的東西,最簡單的暴露模型效率才最高,c string沒有任何封裝。如果我截斷一個字元串,或者追加幾個字元到結尾,還要同步的修改一個長度屬性,想想就覺得這樣是繁瑣的。如果你測試下strlen的效率就會發現實時計算更划算。

同時你自己想要封裝字元串,提供更完善更適合自己的類也可以。不過c string的優點是高效,但也有一個幾乎讓人無法忍受的缺點,它需要你負責內存管理的所有事情。這就是為什麼通常不建議程序員依賴 c string 的原因。因為內存管理的成本太高了,通常會超出性價比所能忍受的範疇,為此,你需要一個類幫助你處理掉這些很煩的事情。例如 basic_string。


在很久以前機器的內存非常小,這算一種折中的辦法。相反顯式指定長度就顯得是個錯誤的決定。這個傳統一直流傳到了現在。


如果不用char*的話,勢必要把字元串設計成一個複合類型(struct),這樣就大大提高了維持C標準庫的二進位兼容性的難度,因為以後的版本對struct的任何修改(增刪成員, 或者哪怕調整順序)都會導致毫無疑問的break

這恐怕也是C標準庫里不包含任何容器等數據結構的原因

但是這樣做的好處也很明顯,超級簡單的API界面使得C標準庫高度穩定,C語言也成為了有史以來最有生命力的語言


用0結束意味著字元串理論上可以無限長。而串首指明長度則不然,無論你是用多少個位元組或字元來描述長度,它都是一個有限的值。


0位保存長度?那,超了怎麼辦?字元流怎麼辦?一步小心越界了


你也可以另外自己定義一個有一個32位長度段的字元串類型啊,加上一組字元串處理函數,和與C-style字元串轉換的函數,就可以使用了.

struct NString {
unsigned int len;
char * str;
};


其實PASCAL中的字元數組的元素數目也是比你說明的多一個的。假如a:string[5],那麼a[0]是存在的,你可以是試試輸出看看是什麼!


推薦閱讀:

計算機專業的學編程是怎麼套路?
有沒有一本講解gpu和CUDA編程的經典入門書籍?
有哪些語言的編譯器是用C++寫的? 為何選用C++
為什麼解釋型的腳本語言(如Lua、Python)可以熱更新,而編譯型的語言(如C、C++)不能呢?

TAG:編程語言 | C編程語言 | Pascal | 字元串 |