User:Jamesboston/nsIProcess/meeting-110708
Date : Nov 7, 2008
Topic : Unicode!
People: James Boston, David Humphrey, Jason Orendorff, Ted Mielczarek, Benjamin Smedberg
09:37 <@humph> jboston: you there? 09:37 < jboston> humph: ack 09:38 <@humph> jboston: meet jorendorff 09:38 < jboston> jorendorff: allo 09:38 < jorendorff> jboston: hi! 09:38 <@humph> jboston: I think your project, and some stuff he wants to do, is doing, are tied together 09:38 <@humph> jboston: he works on the js engine and other scary things 09:39 < jboston> ah! 09:39 < jorendorff> jboston: what should I read? point me at your blog and stuff 09:39 < jboston> jamesboston.ca 09:39 < jboston> http://zenit.senecac.on.ca/wiki/index.php/User:Jamesboston 09:40 <@humph> jboston: where is that initial conversation with bsmedberg from irc 09:40 <@humph> that might help here 09:40 < jboston> http://zenit.senecac.on.ca/wiki/index.php/User:Jamesboston/nsIProcess/meeting-092008 09:41 * jorendorff reads and reads 09:45 < jorendorff> so, that IRC chatlog refers to "issues with character sets" 09:45 <@humph> yeah, the unicode piece is right in front of him now 09:45 < jboston> Yes. A tricky issue. 09:45 < jorendorff> jboston: Yeah, do you have ideas already? 09:46 < jboston> Well, the problem I have is that I am using the Netscape Portable Runtime, but the APIs I use don't support unicode. 09:46 < jorendorff> I want to look at how this is handled in Python 3.0... maybe the stdin/stdout deal in bytes, not text 09:46 < jorendorff> jboston: This is a horrible problem to have :-P 09:46 < jorendorff> jboston: It feels too much like NSPR is just in the wy 09:46 < jorendorff> *way 09:47 <@humph> there's a whole bunch of never-implemented or implemented-badly bits in nspr that sort of draw your attention, though 09:47 < jboston> Yes and no. It looks like over the years people have hacked things into nsIProcess to avoid using the NSPR. But there's a lot of useful stuff in there. 09:47 < ted> heh 09:48 < jorendorff> Ultimately I want JS to have a byte-array type. 09:48 < jorendorff> The language has strings, which are immutable arrays of 16-bit "characters" (actually UTF-16 or UCS2 code units) 09:48 <@humph> so bypassing the unicode problem by doing byte-by-byte is interesting 09:48 < ted> JS could use some way to handle binary data 09:48 < jorendorff> yes, it's a fairly common request; Flash has such a thing 09:49 <@humph> jorendorff: do you have a bug on this? 09:49 <@humph> the byte array? 09:49 < jorendorff> jboston: the awkward thing here is that it feels like a prerequisite to what you're doing, and certainly I don't want to block what you're doing 09:49 < ted> if you can give the user a stream of bytes, we have streams that will let you get out unicode data 09:49 < ted> https://developer.mozilla.org/en/Reading_textual_data 09:50 <@humph> nice 09:50 < jorendorff> let me search, i'm not aware of a bug 09:50 < jboston> I need to do more research on how NSPR APIs for piping handle unicode. The problem I had run into with character sets had to do with passing arguments to processes. 09:50 < jorendorff> for byte arrays 09:50 < ted> like nsIFileInputStream just gives you bytes 09:50 < ted> yeah, isn't that one of the main problems you were looking into? 09:50 < ted> since on windows, filenames are actually UTF-16 09:50 <@humph> right 09:50 < jorendorff> filenames and command lines both 09:51 < ted> yeah 09:51 < jorendorff> on unix, the executable filename is 8-bit, and the argv strings are 8-bit, and there is no command line 09:51 <@humph> do either of you have any tips for him on solving this? 09:51 < ted> so i guess ideally, your interface would just take nsStrings 09:51 < ted> jorendorff: well, not true 09:51 < jorendorff> ted: ? 09:51 < ted> linux and OSX use UTF-8 natively, most of the time, now 09:52 < ted> (although you can change the encoding you use) 09:52 < jorendorff> ted: yep, most of the time. 09:52 < ted> you can find the platform charset in those cases though, it shouldn't be a big deal 09:52 < ted> and we have plenty of APIs for converting charsets 09:53 < ted> jboston: are you going to just make a new API? 09:53 < ted> something like nsIProcess2 ? 09:53 * ctyler wonders why MS chose UTF-16. Not big enough to encode the 09:53 < jorendorff> The problem is drawing the boundary... in particular, you have to use whatever NSPR exposes 09:53 < ted> they committed too early 09:53 < jboston> ted: I think that will happen. 09:53 < ted> and then unicode said oops 09:54 <@humph> jorendorff: or change nspr 09:54 <@ctyler> ah 09:54 < jorendorff> ctyler: that decision predates Unicode being >16bits 09:54 < ted> of course, using UCS-4 natively is sort of insane 09:54 <@ctyler> Unicode was always >16 bits, it's just the BMP that was <16 09:54 < ted> i'm pretty sure glibc does that 09:55 < ted> "let's use 4x the memory of ascii just in case we have to support insane non-BMP characters!" 09:55 < mhoye> you're not seriously defending ASCII, are you? 09:55 <@humph> ted: that's how you sell new machines with more ram, note. 09:55 < jorendorff> ctyler: i... that is inconsistent with my vague understanding of the history 09:55 * humph just read knuth ranting about 64-bit pointers being a sin for the same reason :) 09:56 < ted> mhoye: no, i support UTF-8 09:56 < ted> all the compatibility without paying the insane memory cost 09:56 < mhoye> ? 09:56 < ted> of UCS-2 09:56 < ted> er 09:56 < ted> UCS-4 09:56 < mhoye> Man, memory is free. 09:56 < ted> sez you 09:57 < mhoye> At least as far as text data is concerned. 09:57 < mhoye> Hellz, yeah. 09:57 <@ctyler> imho, the only sane options are UTF-8 (decent size for most data streams) and UCS-4/UTF-32 (no escape tokens to parse) 09:57 * ted shudders to think of what mozilla's memory footprint would look like if we used UCS-4 natively 09:57 < ted> mhoye: databases? 09:57 < ted> ctyler: i agree 09:57 < ted> i think UCS-4 has its place, if you know you're going to be dealing with lots of non-ascii data 09:59 < jorendorff> jboston: but we digress 09:59 < ted> yeah 09:59 < jboston> I'm wondering what I an change in the NSPR? I don't want to break things. 10:00 < jorendorff> jboston: back to first principles - we definitely want to support launching a process by providing a bunch of strings 10:00 < jboston> Well, that's possible if you don't use Japanese. 10:00 < ted> NSPR is just code :) 10:01 <@humph> it's just macros, actually :) 10:01 < jorendorff> jboston: Suppose one of the strings contains Japanese 10:01 < jorendorff> like, Popen(['hg', 'commit', '-u', username]) 10:01 < jorendorff> jboston: You have some working code -- what are you doing right now? 10:02 < jboston> http://jamesboston.ca/patches/patch100308.txt 10:02 < jboston> That bascilly just fixes nsIProcess so that you can start and stop a process. Nothing else. 10:03 <@humph> jorendorff: he's been trying to decided how to approach this, from js-api level or up from nspr. the path is not clear atm 10:03 < jorendorff> So, is the NSPR process management stuff just totally undocumented? 10:04 < jboston> No. There 10:04 < jboston> There's some stuff at devmo. 10:04 < jboston> https://developer.mozilla.org/en/NSPR_API_Reference 10:04 < jboston> But it's the usual terse description. 10:04 < jorendorff> Right, I see that, but https://developer.mozilla.org/en/NSPR_API_Reference#Process_Management_and_Interprocess_Communication 10:04 < jorendorff> is empty 10:05 < jorendorff> OK, does NSPR deal with character encodings anywhere else? 10:05 * jorendorff doesn't see it if so 10:05 <@humph> I thought there was something with filenames 10:06 < jboston> I think so. 10:07 < ted> http://www.mozilla.org/projects/nspr/reference/html/prprocess.html#24349 10:07 < ted> not all of the NSPR docs have made it to MDC yet 10:07 < jorendorff> well that sucks! 10:08 < ted> yep 10:08 < ted> should get sheppy to fix that 10:08 * humph tries to use ted's voice 10:08 < jorendorff> we should double the size of our doc team... to 2 10:08 <@humph> "it's a wiki! fix it!" 10:08 < ted> hah 10:08 < jboston> I'm fishing around in the code looking for unicode stuff. Here's something: http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/io/prfile.c#801 10:08 < ted> humph: yeah, but migrating lots of docs over is a better task for someone dedicated 10:09 <@humph> jboston: yes that's what I remember 10:09 <@humph> ted: for sure 10:09 <@humph> actually, that could be a good project for our doc writing team 10:09 < jorendorff> ok, I'm searching for where this stuff is implemented... 10:09 < jorendorff> humph: gosh yes 10:09 * humph sends a mail 10:12 * jorendorff sees #define _MD_OPEN_FILE _PR_MD_OPEN_FILE 10:12 < jorendorff> and vice versa! 10:12 <@humph> it's macro mania 10:12 < jboston> Experimental: http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/include/prio.h#671 10:13 < ted> jboston: you could email wtc and ask him about these things 10:13 < jboston> There are a lot of defines. You have to go through 3 or 4 levels to reach the thing being defined. 10:13 < ted> if you're interested in modifying NSPR 10:13 < jorendorff> good idea... 10:13 < jboston> Who is wtc? 10:13 < jorendorff> jboston: yeah, I just needed to poke around a little and find stuff 10:13 < jorendorff> I see the implementation now 10:14 < jorendorff> maze of twisty little passages -- it happens, not necessarily for bad reasons 10:14 <@humph> yeah, code grows hair 10:15 < ted> jboston: Wan-Teh Chang, the NSPR owner 10:15 < ted> wtc@google.com 10:15 < jboston> ted: thanks. 10:15 < ted> he doesn't irc much, but he's responsive to email 10:15 < jorendorff> The problem with making any kind of change to NSPR is that there are more platforms than any human can understand and test 10:16 < ted> sure 10:16 < jorendorff> and my impression is that they really really don't want regressions, but I'm sure I'd have a friendlier impression if I'd actually spoken to any of them 10:16 < ted> nspr is used in other projects, afaik 10:16 < jboston> It is. 10:17 < ted> i've worked with wtc to get fixes to the NSPR build system that i needed 10:17 < ted> he's pretty helpful 10:18 < jorendorff> so, I see an implementation of _PR_MD_OPEN_FILE_UTF16 in w95io.c 10:18 < jorendorff> but not in ntio.c 10:18 < jorendorff> which worries me a touch 10:19 < jboston> Perhaps utf16 is default for nt? 10:19 < ted> well, it is 10:19 < ted> but does NSPR know that? :) 10:19 < jorendorff> jboston: at the OS level, but not in NSPR 10:22 < jboston> jorendorff: Yes. The nt function only takes a char*. Hrm... 10:22 < jorendorff> jboston: So, regarding stdin/stdout... totally agree with ted that you should just produce byte streams, 10:22 < jorendorff> and use JS, and existing classes, to 10:22 <@humph> can I suggest that we get all of this into a suitable bug? 10:23 < jorendorff> provide text streams as desired 10:23 < jboston> So I should ask wtc if I can create PR_MD_OPEN_FILE_UTF16 for nt. That sort of thing. 10:23 < jorendorff> command lines and filenames are a separate thing 10:23 < jboston> https://bugzilla.mozilla.org/show_bug.cgi?id=459572 10:23 < jorendorff> i'll write all this in the bug in a sec 10:23 < firebot> jboston: Bug 459572 nor, --, ---, wtc@google.com, UNCO, PR_CreateProcess in NSPR needs unicode support 10:24 < jorendorff> the thing about this is, mostly we're interested in JS users, who just want to pass JS strings 10:24 < jorendorff> which we should treat as UTF-16. 10:24 < jorendorff> It would be nice if NSPR supported that. Then you wouldn't have to worry about it. 10:25 < jorendorff> ted: what's our usual XPCOM class for filenames in Moz? 10:25 * jorendorff can never remember 10:25 < jorendorff> nsIProcess knows it... 10:25 < jorendorff> s/class/interface/ 10:26 <@humph> nsIFile? 10:26 < jboston> nsIFile 10:27 < ted> jorendorff: XPCOM does pretty seamless translation from JS strings to nsString 10:27 < ted> which is in turn pretty easy to get to whatever encoding you want 10:27 < jorendorff> all the implicitness makes my head hurt, but yeah 10:28 < jboston> Getting wide characters from js into xpcom is easy. But then how to pass them to NSPR? 10:28 < jorendorff> the thing is: http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsIFile.idl#242 10:29 < jorendorff> (the preceding comment explains that warning somewhat) 10:29 < ted> well yeah, mac classic had that problem 10:29 < ted> not sure it's relevant for mozilla 10:30 < ted> nsLocalFileMac might still hold a FSref or something 10:30 < jorendorff> yeesh, does NSPR ever end-of-life anything? 10:30 < ted> ostensibly you can have paths on windows that aren't really representable by pathnames as well, like the Control Panel 10:30 * jorendorff looks surprised 10:31 < jorendorff> I thought everything on Windows could be represented by a path somehow, but it's all kind of mysterious 10:31 < jboston> I think osx using the unix implementation in the nspr. 10:31 < jboston> There's some stuff with paths that has to be handled: http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsProcessCommon.cpp#98 10:32 < ted> http://en.wikipedia.org/wiki/Windows_Shell_namespace 10:32 < jorendorff> oh, that's not what I meant 10:32 < jorendorff> I meant that there are filenameoids on Windows for stuff like devices and registry keys 10:33 < jorendorff> names that you can use to attach permissions and stuff 10:33 < jboston> Am I going the correct route using the nspr at all? I think it makes design sense. 10:35 < jorendorff> judgement call 10:35 < jorendorff> you have some working code, which tends to make me believe you're on the right track :) 10:35 < ted> jorendorff: ah, yeah 10:35 < ted> the shell deals with PIDLs though 10:36 < jorendorff> jboston: ok, so I suspect that you'll have OS-specific code eventually anyway 10:36 < jorendorff> unless NSPR wants to add some features. 10:36 < jboston> They must want unicode? Everybody want unicode. 10:36 < jorendorff> The reason I think this is because I think you want something like Python's shell=True option 10:37 < jorendorff> right, I figure NSPR probably wouldn't mind adding UTF-16 APIs, it's worth asking 10:37 < jorendorff> But 10:38 < jorendorff> it's unobvious what those APIs should do on POSIX, though for any given UNIX it's pretty straightforward, if clunky 10:38 < jorendorff> what I would do 10:38 < jorendorff> in terms of implementing the UTF-16 API on a UNIX 10:38 < jorendorff> is, first convert it to wchar_t if wchar_t is not already UTF-16 on that platform; then use wcstombs 10:39 < jorendorff> and pass the resulting char string to the relevant UNIX api. 10:39 < jorendorff> shell=True is very much a separate issue; NSPR probably doesn't want it. 10:39 * jorendorff doesn't know 10:42 < jboston> I think that I will try to implement ipc without unicode before moving on to unicode. 10:55 < mhoye> http://www.joelonsoftware.com/articles/Unicode.html 11:00 < jorendorff> jboston is right, byte streams first 11:05 < ted> well, as stated, there are two issues here 11:05 < ted> the encoding of the file names/command line args 11:06 < ted> and the encoding of the stdio 11:06 < ted> the first is kind of hard 11:06 < ted> the second we already have plenty of ways to work with in the tree 11:09 < bsmedberg> yeah, the filename/commandline args are more important to me 11:10 < bsmedberg> the nsIScriptable{Input,Output}Stream interfaces mostly take care of the stream stuff 11:10 < bsmedberg> although bytearray would be nice 11:10 < jorendorff> yeah, bytearray :| 11:10 < jorendorff> do we have a bug on that? 11:11 < jorendorff> "Bug xxx - can i has bytearray" 11:11 * humph would love if it was a 3 digit bug num 11:11 < bsmedberg> does ES3.1 have a spec for ByteArray? 11:11 < ted> have you ever seen my pure JS+XPCOM EXIF parser? 11:12 * jorendorff pastes bsmedberg's question in #jslang 11:15 < jorendorff> mailing list suggests no... 11:18 < jboston> jorendorff: I just read your comment on my blog. Very informative, thanks. I'll have to investigate that further. 11:19 < jorendorff> yeah, i can't honestly tell if that code really does what they claimed it did 11:19 < jorendorff> but it seemed like my experience was worth sharing anyway :-P 11:20 < jboston> I will look through the nspr code to see how/if they handle that problem. 12:06 < bsmedberg> jboston: I think that NSPR will hold back progress significantly 12:06 < jboston> Do you recommend bypassing nspr? 12:07 < bsmedberg> yes, probably 12:07 < bsmedberg> it took nearly a year to get PR_LoadLibraryWithFlags to accept wide-character paths 12:08 < bsmedberg> and the WithFlags API already existed, we were just adding a new flag 12:08 <@humph> holy crap 12:10 < jboston> Oh dear. Well, for the filename + arguments problem I can do it another way. But if the i/o stuff is a stream of bytes that should be ok? 12:11 < jboston> I'll try to do it a way where process creation can be swapped out from one to the other as the situation evolves. 12:26 < jorendorff> i/o = stream of bytes is not just ok but a hard requirement, anything else is nuts
Jason Orendorff's comment on my blog dealing with children inheriting handles from parents:
http://jamesboston.ca/cms/node/87