![]() |
[QUOTE=xilman;299855]There's a (to my mind) serious conceptual difficulty with C and with many other languages. Back when it was invented, a char could hold a character (hence the name) but what it really held was a small integer. Even then there were problems, some of which have been raised in this thread, with what happened when widened to a short, int or long.
Toto, I've a feeling we're not in Ascii anymore. In other words, a char can no longer hold a character, in general. Some encodings allow £ or ß to be stored in a char but I don't know of any which allow 㒩 or 𓂸 or 𒄐 to be so stored. This is actually a serious problem which a few languages have addressed seriously. I'm a Perl fan partly because it has, to my mind, taken the problem seriously and has come up with a remarkably useful solution. The solution isn't perfect but it works well, by and large, despite having to try very hard to support older software and notions of what character strings should be.[/QUOTE] My phone can't render the last three chars, whatever they are :P I'm not sure what Perl's solution is, but Java chars are two (8 bit) bytes of UTF-16; Python is capable of using many, many different encodings, and defaults to UTF-8. (For anyone reading this, look in my previous post for another addendum on pointers.) |
[QUOTE=Dubslow;299856]My phone can't render the last three chars, whatever they are :P
I'm not sure what Perl's solution is, but Java chars are two (8 bit) bytes of UTF-16; Python is capable of using many, many different encodings, and defaults to UTF-8. (For anyone reading this, look in my previous post for another addendum on pointers.)[/QUOTE]Yup. Java is an example of an implementation of a language which did not take the problem seriously. Two 8-bit bytes can hold at most 2^16 different characters, which is nowhere near enough to represent the world's scripts. I also suggest that you review what "UTF-8" and "UTF-16" actually mean. You'll find out some rather important information. Finding out which characters I used is left as an exercise. As a hint, the first is from an extant language but the other two are used only to represent now dead scripts. Paul |
[QUOTE=xilman;299857]Yup. Java is an example of an implementation of a language which did not take the problem seriously. Two 8-bit bytes can hold at most 2^16 different characters, which is nowhere near enough to represent the world's scripts. I also suggest that you review what "UTF-8" and "UTF-16" actually mean. You'll find out some rather important information.
Finding out which characters I used is left as an exercise. As a hint, the first is from an extant language but the other two are used only to represent now dead scripts. Paul[/QUOTE] Hmmm... Not gonna try figuring it out in my phone :P I'm sure my desktop would render it just fine. It seems UTF-* are all just different ways of encoding the same ~1.2+ million characters. |
[QUOTE=Dubslow;299853]
How do you do multi-line macros? I think I still prefer the gotos of the original attempt to these macros, but it's fun to experiment with the macro at least. [/QUOTE] Well, you have to use the line continuation char '\'. To the compiler then, it's still one line. [QUOTE=Dubslow;299853] [URL]http://publications.gbdirect.co.uk/c_book/chapter5/pointers.html[/URL] There they say to use 0, not NULL, and [URL="http://publications.gbdirect.co.uk/c_book/chapter9/introduction.html"]here[/URL]: I think I assign 0, but compare to NULL. Arbitrary, yes, but hey. (This might have chaned since C89, but most compilers (read: gcc) still use that standard anyways AFAICT.):smile: [/QUOTE] That is new to me. They keep referring to the "STANDARD", so probably C89. I would love to see an example where sizeof( aaa* ) != sizeof( bbb* ). [QUOTE=Dubslow;299853] Addendum on pointers: I think the confused type/deferencing notation is confusing for a lot of beginners; understanding that in a declaration 'char *d', 'd' as a variable has its own value which is separate from *d. It really does look like you're declaring the variable *d, so working with d is counter-intuitive. That's why I like 'char* d' because then it's clear what the variable actually is, and then *d is just the natural extension of d. IM(A)O this makes pointers conceptually cleaner, and therefore easier to learn.[/QUOTE] I think we're in agreement here.:smile: It's only natural IMHO to want to look at one area of your screen to see the type of variable and then look on the other side for the name. Multiple declarations on one line don't really bother me, but look a little messy. something like: int i,j,k; Looks pretty though. (This is all subjective after all). |
[QUOTE=Dubslow;299860]Hmmm... Not gonna try figuring it out in my phone :P I'm sure my desktop would render it just fine.
It seems UTF-* are all just different ways of encoding the same ~1.2+ million characters.[/QUOTE]Very largely true. They are different ways of encoding Unicode. Most, but not all, of the world's supply of characters are characterized in Unicode but there's no particular reason why Unicode should be the only way of describing characters. Actually, the second of my example characters is a rather recent addition to the Unicode tables. For many years those who are experts in the language in question could not agree on the members of the definitive set. Other ways of writing the same language are still absent from Unicode AFAIK. One day perhaps I'll compose something in the language in question and post it here. I came extremely close to doing so when trading insults with another poster who delighted in exposing his inadequate knowledge of Latin and Ancient Greek. Unfortunately he earned himself a permanent ban before I collected enough round tuits. |
[QUOTE=Dubslow;299856]I'm not sure what Perl's solution is, but Java chars are two (8 bit) bytes of UTF-16; Python is capable of using many, many different encodings, and defaults to UTF-8.[/QUOTE]Further on this: Perl uses UTF-8 which is based on an 8-bit byte encoding of Unicode. That does [i]not[/i] mean that Perl uses 8-bit characters. It means that Perl represent characters in an integral number of 8-bit bytes. With default semantics the expression [FONT="Fixedsys"]len($c)[/FONT] has the value 1 when the variable [FONT="Fixedsys"]$c[/FONT] holds a single character. If the pragma [FONT="Fixedsys"]use bytes[/FONT] is in effect that same expression may have a value of 1 or it may well be 2, 3 or 4 depending on which character is stored. Perl, quite remarkably in my view, almost always does The Right Thing and entirely without any fuss when manipulating characters. I wish some other languages were as good.
You'd be amazed at how much difficulty I and a couple of other colleagues had when trying to teach American programmers how to use Unicode in the FlyBase project. They somehow [i]knew[/i] that any particular character had a unique representation, that all characters required the same amount of storage in memory and files, and that it is straightforward to check things like the length of strings, that to convert to UPPER CASE or lower case is straightforward, that is easy to determine whether one character sorts before or after another, and so forth. |
[QUOTE=jcrombie;299829]@jyb Just want to says thanks again for that tip on computing LM primitives.
[/QUOTE] Glad it helped! [QUOTE=jcrombie;299829] A nitpicky thing -- I prefer calloc() to malloc(). Using malloc() introduces a lot of randomness to your program's behaviour. Better to clear it out and then when you're tracking down your bugs, things will behave more predictably. [/QUOTE] Be very careful about the things you say to novices. I don't think your advice to use calloc instead of malloc is bad (though I *hate* the way calloc has a different interface from malloc). However, saying that it's because malloc "introduces...randomness" can be very misleading. Getting in the habit of always zeroing out your memory can be convenient, but there's nothing "random" at all about how malloc is used here. His code never tries to access memory before assigning it, and by the time the function ends the entirety of the allocated space has been written, so there's no randomness at all. Saying that malloc introduces randomness can be very confusing/misleading for a novice if they don't know exactly what you mean by it. [QUOTE=jcrombie;299829] I think jyb mentioned this, but I see that you assigned 0 to a pointer. This is programming by coincidence. It just so happens that NULL is address 0 on all implementations that I've seen, but theorectically that could change. Besides, what you want is an address type and not an integer. [/QUOTE] Sorry, but this is just plain wrong. First of all, I don't see where he assigns 0 to a pointer in his code. I could well be missing it; please point it out. But more importantly, when the constant value 0 is assigned to a pointer type, that pointer is a null pointer. This is completely unequivocal in C. By definition, this is the way you get a null pointer. The macro NULL is just a convenience, but it's guaranteed to be defined as either 0 or (void *)0. Note that this *doesn't* mean that a null pointer must be "address 0". The concept of "address 0" isn't really defined by the language. I.e. the internal representation of a null pointer might be some arbitrary bit pattern other than all 0s, in which case the compiler would have to set the bits accordingly when it sees a line like char *p = 0; But assigning the value 0 to a pointer is entirely correct and defined by the language. |
[QUOTE=jyb;299869]
Be very careful about the things you say to novices. I don't think your advice to use calloc instead of malloc is bad (though I *hate* the way calloc has a different interface from malloc). However, saying that it's because malloc "introduces...randomness" can be very misleading. Getting in the habit of always zeroing out your memory can be convenient, but there's nothing "random" at all about how malloc is used here. His code never tries to access memory before assigning it, and by the time the function ends the entirety of the allocated space has been written, so there's no randomness at all. Saying that malloc introduces randomness can be very confusing/misleading for a novice if they don't know exactly what you mean by it. [/QUOTE] Yes, that was a very brief comment. I see that his code used malloc() correctly this time and in fact would never actually encounter randomness. However, a program is in general a very dynamic thing and in general much larger in size. My comment was really about fortifying your code so that say you forget to set '\0' on the end of your string and most times your program works just fine because unitialized memory usually is zeroes anyway, but just once in a while it isn't and crashes. But I guess that's what memory checkers are for. [QUOTE=jyb;299869] Sorry, but this is just plain wrong. First of all, I don't see where he assigns 0 to a pointer in his code. I could well be missing it; please point it out. But more importantly, when the constant value 0 is assigned to a pointer type, that pointer is a null pointer. This is completely unequivocal in C. By definition, this is the way you get a null pointer. The macro NULL is just a convenience, but it's guaranteed to be defined as either 0 or (void *)0. Note that this *doesn't* mean that a null pointer must be "address 0". The concept of "address 0" isn't really defined by the language. I.e. the internal representation of a null pointer might be some arbitrary bit pattern other than all 0s, in which case the compiler would have to set the bits accordingly when it sees a line like char *p = 0; But assigning the value 0 to a pointer is entirely correct and defined by the language.[/QUOTE] My apologies for that. I saw the "*out = 0" and missed the asterisk. Yes, I see from Dubslow's link that the "STANDARD" allows for this. I still think this is most confusing though. sizeof(0) is 4 bytes on my machine. Same as the sizeof(int). One could assume that 0 is in fact an integer. So, we are relying on an implicit cast across the "=" to promote this integer to the 8 bytes of a char* (for a 64-bit address). However, with this use of NULL, you could do sizeof(NULL) and get 8 bytes and believe you were just assigning something of pointer type to another thing of pointer type. The wrinkle in my logic is as you pointed out is that according the STANDARD, pointers can be arbitrary sizes and bit patterns and that simple model of how things work is no longer valid. Sorry, but the STANDARD [U]sucks[/U] if this is really true. Oh well, enough ranting. And next time I'll read the code twice before commenting. |
[Split into two posts because of character-count limitation]
[QUOTE=Dubslow;299852]I think many of these can be attributed to the fact that this is essentially the only C I've done since February (first attempt), and in lieu of that it's mostly been Python or Java (e.g. exit(1); vs. return 1; vs. return NULL; that was the Python talking :razz:). [/QUOTE] Ah, well then help me out here. What were you actually trying to accomplish with that macro? Using exit(1) means the program will just end, regardless of what else may have been going on. Many would consider this to be rude behavior for a utility function like this, but again it's not entirely clear what the correct behavior should be in case of memory allocation failure. Was that what you thought, or did you think that exit(1) would just return from the function? [QUOTE=Dubslow;299852] See my first comment :razz: (edit: and the following post.) This was just a plain slip up, though if I had caught myself I probably would have returned NULL. How is that ambiguous? If I read an empty line, (I think) it returns a 1-byte chunk initialized to zero, i.e. just the null terminator, which isn't the same thing as NULL, because *out==0, not out==0. [/QUOTE] Go back and look at the description I gave for ReadLine (in essence its "spec") in post #195. What is it supposed to do when there are no characters to read because it's at the end of the file? I.e. how should a program that is using ReadLine know when it's read everything? The answer is that in this case ReadLine should return NULL. So if it also returns NULL to indicate to indicate that a memory error occurred, then that's ambiguous. What should a program do if it calls ReadLine and gets back NULL? Assume it got everything, or assume that something went wrong while trying? [QUOTE=Dubslow;299852] 'gcc -Wall' gave me a warning about incompatible pointer types without the explicit cast (although now there are no pointers, I kept the cast anyways. It doesn't hurt). [/QUOTE] Ah, but dealing with pointer types is a different thing. size_t is assignment compatible with int (or other integer types). But a size_t * is [I]not[/I] assignment compatible with an int * (or other pointers to integer types). But as for the cast not hurting, in general I disagree rather strongly with that. More on this below. [QUOTE=Dubslow;299852] This one is due to that reference above. Though I can't find the part where it's explicitly stated, throughout the book (mostly in Chapter 5) all malloc()s are cast to whatever type. That was the whole point of void*, if that book is to be believed. [/QUOTE] If that book really said that, then no, it is not to believed and I would recommend throwing it out! The "whole point" of void * is exactly the opposite. It's a pointer type which is guaranteed to be assignment compatible with most other pointer types, so it doesn't require a cast when assigning between them. The cast is not necessary; as for whether it's actually harmful, well again, more on that below. |
[QUOTE=Dubslow;299852]
One thing that I came away with about pointers is that for all purposes that I can see, they are effectively primitve types, if not labelled as such. Their use is syntactically built into the language, without which it wouldn't be C. 'char c' has a very different type from 'char* d'. For instance, for someone is isn't looking real hard, 'int *x=0' looks like a pointer-derefence, when in fact all you're doing is setting it to NULL. 'int* x=0' is much clearer. Writing 'char *d' to me is just a great way of mixing the type-declaration with the variable name d, so I do 'char* d'. [strike]It's just unfortunate that pointer-dereference is done with *d =..., and if I were designing C myself I wouldn't have chosen that notation;[/strike] It's just unfortunate that pointers aren't [I]actually[/I] their own types, and if I were designing C myself I would have made it such and chosen a different dereference notation; if jcrombie is to be believed, Mr. Kernighan shares this view. (When I first came to these conclusions, I realized that 'char* a, b' is bad, but I decided that it would just be best to keep them on separate lines, much like jcrombie does. I shall rue the day when I break my own rule.) tl;dr: Agree to disagree :razz: [/QUOTE] I won't try to persuade you to change your mind on this. As I said before, there is much variation in how people do this. But I do want to address one point you make: pointers are inherently not "primitive" because by their very nature they must have something to point to. If you want to maintain a distinction between different pointer types (and I claim this is a very useful distinction to have, about which more below), then pointer types have to be defined by their "base" type. They're really no different from arrays in that respect. If you want to declare an array, it has to be an array of some type; the same goes for pointers. So the pointed-to type really is a fundamental part of the pointer type. And this isn't just arbitrary, it really matters. If you dereference a pointer, how big is the memory you're reading/writing? The answer lies in the pointer's type, which is derived from the type it's pointing to. char *cp = 0; and int *ip = 0; do very different things to memory. [QUOTE=Dubslow;299852] I understand the subtle, but I'm not sure how this is fundamental to C (like on the same level as pointers). I had in fact seen that getc() etc. are declared to return ints, but as you say, I couldn't find a good answer for 'Why?'. Thanks to you, I now have :smile: (I'm not sure that a detailed understanding of types and conversions is necessary; the only key part is that technically EOF is not a char value. The rest is easily inferred from long->int = truncation, int->char = truncation (or some sort of clobbering). [/QUOTE] I see it as fundamental for two reasons. One, it requires a fairly sophisticated understanding of how integer conversions work in order to know exactly what will go wrong if you assign the return value to a char. It's not hugely complicated, but it requires more understanding than most people have when they first see the getchar/getc functions. Two (and more importantly) it gets into some fairly deep philosophical questions about language design and exception handling. (When I say exception handling here, I'm not referring specifically to exception handlers as they exist in C++ and Java, though that's related of course; I just mean the handling of exceptional conditions in general.) To take getchar as an example, its interface sounds pretty straightforward: read and return the next character from stdout. What type would you want it to return? Well duh, it's supposed to return a character, so it should return a char. But what happens when something exceptional happens (the most common being there are no more characters in the file)? Well, there are three possibilities that come to mind: 1) f*ck up your nice simple interface by having the return type be something unintuitive, just so it can also indicate an exceptional condition. I.e. do a form of in-band signaling. 2) Add an extra pointer argument so getchar can pass back an error indicator. This, too, f*cks up the nice simple interface. 3) Have a dynamically-scoped exception mechanism, a la C++ and Java, and have getchar throw an exception if something exceptional happens. This preserves the simple interface, but requires a great deal more language support. Obviously the designers of C chose #1, probably because it required the fewest changes and the least thinking. But it also happens to be a choice which is highly prone to mistakes. You would be surprised by the number of professional C programmers who make the same mistake you made, and don't even realized they have a bug, or why. [QUOTE=Dubslow;299852] [code]int c; while( c=getc()...) { ...*out++ = (char)c; ...}[/code] Yes, the cast isn't necessary, but that's the Java showing through, where it is necessary, and it can't posisbly hurt the C. [URL="http://publications.gbdirect.co.uk/c_book/chapter4/function_types.html"]Here[/URL], though talking about function declarations/prototypes and implicit return types (from "Old C"), the point is always declare your functions, and even though it's valid, it's better to explicitly return an int then just let the compiler assume it, because then the guy reading the code knows it's deliberate and not a mistake. That's also at least partially why I cast malloc()s. Being explicit doesn't hurt, and it's required in many other languages, so there's no reason not to do it in C. (This is also why I used sizeof(char) in the first attempt; even then I knew chars were one byte, but I was just being explicit. This time I was being lazy :razz:)[/QUOTE] I just have to disagree that casting can't possibly hurt. Here's how I think of it. If you really wanted a bare-metal language, you wouldn't really have to have types at all. You could just always indicate a size when you declare a variable and then do what you want with it. Well C is often described as a bare-metal language, and certainly compared to many languages it does tilt that way. But it does actually do a fair amount for you, and the biggest way it does this is through the type system. The type system allows the compiler to keep track of sizes and do address calculations so you don't have to (most of the time), and it includes a lot of safety mechanisms to prevent you from doing things that probably don't make sense. So for example, the types int * and long * are not assignment compatible. This is a good thing, because it prevents you from doing something like this: [code] int i; long *p; p = &i; *p = 5; [/code] If we assume that long is a wider type than int (which is not required but common these days), then that code would probably be disastrous. And the compiler can dutifully give you an error (in fact, it's required to "emit a diagnostic" for the first assignment). That is a good prompt that you should take a closer look. Now it sometimes happens that you really do know better than the compiler what needs to happen, and for such cases C provides an override. That override is the cast operator. You really want to have the above code? Fine, make the first assignment be [code] p = (long *)&i; [/code] (That may not work due to alignment considerations, another reason why the compiler warning/error would be helpful, but let's not worry about that now.) I.e. casting is a way of saying to the compiler "I know what I'm doing, let me do this." If you are in the habit of casting things all over the place, just because you can ("Hey, being explicit doesn't hurt"), then you are taking some type safety features of C out of the picture, and thereby preventing the compiler from helping you when it can. |
[QUOTE=jcrombie;299872]Yes, that was a very brief comment. I see that his code used malloc() correctly this time and in fact would never actually encounter randomness. However, a program is in general a very dynamic thing and in general much larger in size. My comment was really about fortifying your code so that say you forget to set '\0' on the end of your string and most times your program works just fine because unitialized memory usually is zeroes anyway, but just once in a while it isn't and crashes. But I guess that's what memory checkers are for.
[/QUOTE] No, no, memory checkers may be helpful, but I agree completely with your general point about avoiding uninitialized memory. Zeroing can be very effective for that (though I'll point out that there's no zeroing variant of realloc, so in this case it would be somewhat futile anyway). My point was just that your first post about this could have been confusing to Dubslow if he didn't know what you meant. [QUOTE=jcrombie;299872] My apologies for that. I saw the "*out = 0" and missed the asterisk. Yes, I see from Dubslow's link that the "STANDARD" allows for this. I still think this is most confusing though. sizeof(0) is 4 bytes on my machine. Same as the sizeof(int). One could assume that 0 is in fact an integer. So, we are relying on an implicit cast across the "=" to promote this integer to the 8 bytes of a char* (for a 64-bit address). However, with this use of NULL, you could do sizeof(NULL) and get 8 bytes and believe you were just assigning something of pointer type to another thing of pointer type. The wrinkle in my logic is as you pointed out is that according the STANDARD, pointers can be arbitrary sizes and bit patterns and that simple model of how things work is no longer valid. Sorry, but the STANDARD [U]sucks[/U] if this is really true. Oh well, enough ranting. And next time I'll read the code twice before commenting.[/QUOTE] But how is this different from "long l = 0;"? sizeof(long) may be 8 and sizeof(0) may be 4, but I assume you don't have a problem with that? The point is that the compiler has to do *something* when it does an assignment across types, and it frequently does this quite transparently. In my example, the "something" is to widen the 0 value from 4 bytes to 8 when writing to memory. When the lhs is a pointer, the "something" is to simply write the bit pattern for a null pointer into the pointer variable's memory. I don't understand your point about sizeof(NULL), or indeed why it matters what sizeof(NULL) is. The compiler just takes care of this for you when it needs to, and you never have to worry about it. And as for the standard sucking, I quite disagree: if it worked the way you apparently want it to, then C would be limited to working on a particular kind of machine architecture (which happens to be about the only architecture still around, but still). The way the standard is written doesn't impose any such requirements on the machine architecture, and that's a good thing. And after all, there's nothing that prevents an implementation from using all-0s as the representation for a null pointer (and that's exactly what pretty much all modern machines/compilers do). |
| All times are UTC. The time now is 21:46. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.