Stata 中文字元顯示成問號，該怎麼解決？

01-14

用stata連接mysql資料庫導出數據成dta格式，結果中文都是顯示成？號，是字符集的問題么，應該在哪設置好呢。
多謝啦~

以前我用破解版的Stata12，不知為何不存在這個問題，估計是搞出破解版的大俠已經解決了這個問題。

但是！！！

我後來用上了正版的Stata14，打開以前的資料庫時就出事了！！！雖然說，Stata14支持Unicode編碼，但中文亂碼的問題並不是自動解決的，這取決於資料庫里中文之前是如何編碼的。要在Stata14里解決這個問題，需要以下步驟（以CHARLS2011年資料庫中psu.dta為例，加粗為代碼，括弧里是我的備註，其餘為Stata outputs）：

clear [經知友馬克思福柯提醒特此補充，在開始用unicode之前是不能打開任何資料庫的，否則Stata會提醒你data in memory would be lost 還有there must be no data in memory]

cd "D:long term careCHARLSstatadatahousehold_and_community_questionnaire_data" 【先設置你的working directory】

unicode analyze psu.dta

(Directory ./bak.stunicode created; please do not delete)

File summary (before starting):

1 file(s) specified

1 file(s) to be examined ...

File psu.dta (Stata dataset)

2 str# variables need translation

----------------------------------------------------------------------------------------------------------

File needs translation. Use unicode translate on this file.

File psu.dta needs translation

File summary:

1 file(s) need translation

【總結：Stata告訴你，你這個file需要轉換】

unicode encoding set "GB18030"

(default encoding now GB18030)

【這裡是告訴Stata這數據本來的unicode encoding是怎麼樣的，我死活找不到CHARLS的unicode encoding，後來在國外網站說中文一般用GB18030都可以，就試了試。之前我還用過據說很常用的Windows-1252，但在這裡沒有用！！！】

unicode retranslate psu.dta, transutf8

(using GB18030 encoding)

File summary (before starting):

1 file(s) specified

1 file(s) to be examined ...

File psu.dta (Stata dataset)

all variable names translated

all data labels translated

all variable labels translated

all value-label names translated

all value-label contents translated

all characteristic names translated

all characteristic contents translated

all str# variables translated

----------------------------------------------------------------------------------------------------------

File successfully translated

File summary:

all files successfully translated

【這裡就是讓Stata把數據轉換為支持中文的UTF-8，轉換完畢後，檢查數據，中文終於不是亂碼了，大功告成！】

如果要一次性搞定不止多個資料庫，也可以嘗試：

unicode analyze *

unicode encoding set "GB18030"

unicode retranslate *, transutf8

這樣你working directory里的所有資料庫都被一次性轉換完成了。

如果出了問題，也沒有關係。反正轉換完成之後，還可以使用以下命令來恢復或者調整。

unicode restore filespec

或者unicode retranslate filespec

更多可以參考：

Chinese support in Stata 14

http://www.stata.com/manuals14/dunicodetranslate.pdf

是字符集的問題，說明到處的數據沒有亂碼，只是無法顯示而已～～～不過stata到是沒用過～～

第一步，在（Windows系統）電腦控制面板中的調整語言的選項中，把unicode改成「中文」。系統語言可以不必要改成中文。但unicode必須改。

第二步，在stata中的preference里，把color scheme改成classic。stata出結果的框的背景會變成黑色。

這樣數據中的中文（包括變數名、數據中的中文字元、label里的中文）都能夠正常顯示。

我的stata 13、win 7系統是這麼設置的。

不過存在一個小問題：在do文檔中輸入的中文如果和標點符號連在一起，有時候會出現亂碼，需要手動點一下刪除才能恢復。但是不太影響使用，只是輸入的時候要多注意一下。

所以，使用stata以及其他計量軟體處理數據時，從變數名、observation包含的字元串到label的內容，最好都使用英文字元。同時數據的文件名，以及存儲數據的文件夾目錄（也就是路徑上所有文件夾的名稱），都應該使用不含空格的英文字元。

Stata 14 開始使用新的編碼系統，以便支持多種語言。Stata 15 進一步做了調整，動作比較大。這就導致在 Stata 14 和 15 中打開 Stata 13 以前的 dofile，.dta 文檔時，其中包含的中文字元會顯示為亂碼。不過，不用擔心，只是顯示為亂碼而已。

樓上各位提到的 unicode 命令可以對當前工作路徑下的所有文件進行轉碼。

若想對當前工作路徑，以及子文件夾和孫文件夾中的所有文件進行一次性轉換，可以使用 ua 前綴命令。

下載地址：Stata連享會/ua - 碼雲 Gitee.com

Stata 範例：

* Change current working directory (CWD) . cd D:stata15adopersonalmypaper


  * Unicode all .dta files in CWD and files in sub-directories

    . ua: unicode encoding set gb18030

    . ua: unicode translate *.dta

* Unicode all files (.do, .ado, .dta, .hlp, etc.) in CWD and files in sub-directories . ua: unicode encoding set gb18030 . ua: unicode translate *

命令的詳細說明：簡書 - Stata15 Unicode：一次性轉碼，解決中文亂碼問題；或 知乎 - Stata15：一次性轉碼，解決中文亂碼問題

stata14的話，利用可愛的unicode.

cd "工作路徑」

unicode encoding set GBK (此代號可以解碼中日韓三語）

unicode analyze *.do或dta

unicode translate *do或dta。

-----------------------------------------------

若是碰到nonconvertible characters, 不能讓它阻擋我們讀取文件的決心！再敲一行

unicode retranslate *.do或dta, invalid(mark)

除了個別無法轉換，其他都可以解碼了。。。

color scheme 設置成classic就行了。

曾經用python幫同學解決過一次，具體代碼已經刪了，把大致思路寫下來。

首先用pandas庫讀取stata文件，unicode 聲明為「GB18030」然後再把stata 的數據重新生成一個csv file， unicode用 UTF-8

最後重新import進stata里

這是一個不會用stata的人的迂迴解決方法

這個解決了么？我也出現遮掩個的問題，中文標籤是問號，不知道怎麼辦。。。

目前我用過的最簡單直觀的辦法是把整個系統的語言設置改成中文。。。

如果是Import excel的話，有個辦法：

In Stata, I run the unicode locale list command and get the following output (with many lines removed)

Code:

. unicode locale list

# Locale Language Country

-------------------------------------------------------------------------------

1 af Afrikaans

2 af_NA Afrikaans Namibia

3 af_ZA Afrikaans South Africa

4 agq Aghem

5 agq_CM Aghem Cameroon

[lines removed]

673 zh Chinese

674 zh_Hans Chinese

675 zh_Hans_CN Chinese China

676 zh_Hans_HK Chinese Hong Kong SAR China

677 zh_Hans_MO Chinese Macau SAR China

678 zh_Hans_SG Chinese Singapore

679 zh_Hant Chinese

680 zh_Hant_HK Chinese Hong Kong SAR China

681 zh_Hant_MO Chinese Macau SAR China

682 zh_Hant_TW Chinese Taiwan

[lines removed]

-------------------------------------------------------------------------------

What you see in 673-682 are the possible locale specifications for the Chinese language ("zh"). There are two "scripts" ("Hans" and "Hant") and several country specifications. My believe is you want either zh_Hans or zh_Hant, depending on the script. (You should run the command in your Stata 14 implementation and see the locales that are shown in its output; perhaps some that I show are new in Stata 15.)

Then, you add to your import excel command the appropriate locale option, for example

Code:

import excel yourworkbook.xlsx, locale("zh_Hans")

as documented in

Code:

help import excel

要放在英文目錄下

edit preference 改完之後把背景改為白色，其餘全改為黑色

linux 下do文件不支持中文

stata不支持某些中文字符集，數據量不大的話，另存為txt再導入親測有效

14版的不存在這個問題了，因為開始支持unicode編碼。