Literal Strings and Pointers
Question: "How does printf know when to stop?"printf("Hello, world\n"); printf("The value of i is %i\n", i); printf("The area is %f\n", PI * radius * radius);
Reminder of the prototype:
where:int printf(const char *format_string, ...);
format_string A string (character array) ... An optional list of expressions to print.
Strictly-speaking, it is not a good idea to use NULL for terminating strings. NULL is a macro that is defined as a pointer type. NUL may not be defined, and if it is, it is likely to be an escaped zero: '\0'. However, you will see NULL, NUL, and null being used interchangeably when talking about null-terminated strings. See the comp.lang.c FAQ, specifically section 5.9.
There is a subtle difference between a string and an array of characters. This is how the first literal string above would be laid out in memory:Literal strings are much like character arrays in that they can be used with pointers. In this example, p is a char pointer or pointer to char and it points to the first element in the string:
Visually:char *p = "Hello, world\n";
We can print the string just as if it were a literal string:![]()
Using the %s format specifier to print strings:printf(p);
These three strings would look something like this (not necessarily adjacent in memory):char *ph = "Hello"; char *pw = "world"; printf("%s, %s\n", ph, pw);
The terminating NUL (zero) character is very important when treating the array as a string. It is what tells printf when to stop:![]()
![]()
![]()
Output:char *ph = "Hello"; char w[] = {'H', 'e', 'l', 'l', 'o'}; printf("%s\n", ph); /* OK, a string */ printf("%s\n", w); /* Bad, not a string */
Another attempt:Hello Hello¦¦¦¦¦¦¦¦¦¦¦<@B
We could print strings "the hard way", by printing one character at a time:/* Manually add the terminator to the array */ char w[] = {'H', 'e', 'l', 'l', 'o', 0}; /* Ok, now it's a string */ printf("%s\n", w);
After initilization:char *p = "Hello, world\n"; while (*p != 0) printf("%c", *p++); /* Compact pointer notation */
After the while loop:![]()
Make sure that you fully understand the difference between the pointer and the value that the pointer is pointing to:
Output from the incorrect code: (using gcc)
char *p = "Hello, world\n"; /* This is the correct condition */ while (*p != 0) printf("%c", *p++); char *p = "Hello, world\n"; /* INCORRECT */ while (p != 0) printf("%c", *p++);
The output before it crashed and burned when using Microsoft's compiler.Hello, world %c Hello world %s, %s %s The value of i is %i ? a @ @ hA x@ l@ xA ¤@ ¬@ ,@ E@ O@ è@ ?A ?A ?A $A 0 A I a`y a¤% a?~?aàS a&+ a,/ aZ1 aN5 a I Y| 5 __main F?_impure_ptr ·?calloc ï?cygwin_internal ??dll_crt0__FP11per_process e?free K? malloc >?printf ?realloc O?GetModuleHandleA @ @ @ @ @ @ @ @ @ cygwin1.dll ¶@ KERNEL32.dll 118871 [main] a 1808 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION 118871 [main] a 1808 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION 119867 [main] a 1808 open_stackdumpfile: Dumping stack trace to a.exe.stackdump 119867 [main] a 1808 open_stackdumpfile: Dumping stack trace to a.exe.stackdump 810735 [main] a 1808 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION 841004 [main] a 1808 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)
Note: When using printf to print strings, only the first string is being interpreted. For example, this code:
will print this:char *p1 = "%s%d"; printf("A string with %%: %s\n", p1);
as none of the other arguments (p1 in this case) will have their % symbols evaluated. They will just be printed verbatim.A string with %: %s%d
String Variables and Initialization
Initialization with character arrays:Initializing with strings:char s1[] = {'H', 'e', 'l', 'l', 'o'}; /* array of 5 chars */ char s2[] = {'H', 'e', 'l', 'l', 'o', 0}; /* array of 6 chars */
What is sizeof s1, s2, s3, s4? (Hint: What are the types?)char s3[] = "Hello"; /* array of 6 chars; 5 + terminator */ char *s4 = "Hello"; /* pointer to a char; 6 chars in the "string"; 5 + terminator */
Initializing with fewer characters:![]()
![]()
![]()
char s5[10] = {'H', 'e', 'l', 'l', 'o'}; /* array of 10 chars, 5 characters are 0 */ char s6[8] = "Hello"; /* array of 8 chars; 3 characters are 0 */
Given these declarations:![]()
![]()
Use a loop to set each character and then print them out (assume i is an integer):char s[5]; /* array of 5 chars, undefined values */ char *p; /* pointer to a char, undefined value */
A different loop doing the same thing (assume c is an integer): ASCII chart
/* Set each character to A - E */ for (i = 0; i < 5; i++) s[i] = i + 'A'; /* Print out the characters: ABCDE */ /* Uses array notation */ for (i = 0; i < 5; i++) printf("%c", s[i]); printf("\n");
Do something similar with p:
/* Set each character to A - E */ for (c = 'A'; c < 'A' + 5; c++) s[c - 'A'] = c; /* Print out the characters: ABCDE */ /* Uses pointer notation */ for (i = 0; i < 5; i++) printf("%c", *(s + i));
You may get garbage, or it may crash:/* Print out the character that p points to */ printf("%c", p[0]); printf("%c", *p);
Set p to point at something first:65 [main] a 2020 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION 22906 [main] a 2020 open_stackdumpfile: Dumping stack trace to a.exe.stackdump 65 [main] a 2020 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION 22906 [main] a 2020 open_stackdumpfile: Dumping stack trace to a.exe.stackdump 686199 [main] a 2020 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION 707734 [main] a 2020 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)
Now print out the value:/* Point p at s[0] */ p = s;
In a loop, print out all the characters that p points to: ABCDE. These are both the same (due to the Basic Rule):/* Print out the character that p points to */ printf("%c", p[0]); printf("%c", *p);
for (i = 0; i < 5; i++) printf("%c", p[i]); for (i = 0; i < 5; i++) printf("%c", *(p + i));
String Input/Output
There's a convenient function for printing strings:The puts function will print a newline automatically. Examples:int puts(const char *string);
There's also a convenient function for printing a single character:
Sample code Output char *p1 = "Hello"; char p2[] = "Hello"; puts("Hello"); /* literal string */ puts(p1); /* string variable */ puts(p2); /* string variable */ puts("%s%i%d"); /* literal string */ Hello Hello Hello %s%i%d
Example:int putchar(int c);
For input, we can use this:
Sample code Output char c = 'H'; char *p = "ello"; putchar(c); /* outputs one char, no newline */ while (*p) putchar(*p++); /* outputs one char, no newline */ putchar('\n'); /* print new line */ Hello
int gets(char *string);
Example:
Output (charcters in red are typed by the user):char string[100]; /* 99 chars + NUL terminator */ puts("Type something: "); /* prompt the user */ gets(string); /* read the string */ puts(string); /* print it out */
Type something: I am not a great fool, so I can clearly not choose the wine in front of you. I am not a great fool, so I can clearly not choose the wine in front of you.
Example:int getchar(void);
Notice how the loop only printed part of the phrase that was typed in. The getchar function did not return until the user pressed the enter/return key. (All of the characters are buffered.) Then, the loop continued.
Sample code Output int c = 0; while (c != 'a') { c = getchar(); /* read in a character */ putchar(c); /* print out a character */ } This is a string <NL> This is a (no newline)
In C, literal strings are defined as char *. In C++, they are defined as const char *. This will help prevent errors that may occur due to writing to the read-only string pool. More on this later.
String Functions
Although strings are not truly built into the language, there are many functions specifically for dealing with NUL-terminated strings. You will need to include this:Here are four of the more popular ones. Familiarize yourself (i.e. practice) with them as you will be using them a lot in the near future.#include <string.h>
Function Prototype | Description |
size_t strlen(const char *string); | Returns the length of the string, which is the number of characters int the string. It does not include the terminating 0. |
char *strcpy(char *destination, const char *source); | Copies the string pointed to by source into the string pointed to by destination. Destination must have enough space to hold the string from source. The return is destination. |
char *strcat(char *destination, const char *source); | Concatenates (joins) two strings by appending the string in source to the end of the string in destination. Destination must have enough space to accomodate both strings. The return is destination. |
int strcmp(const char *s1, const char *s2); | Compares two strings lexicographically (i.e. alphabetically). If string1 is less than string2, the return value is negative. If string1 is greater than string2, then the return value is positive. Otherwise the return is 0 (they are the same.) UPPERCASE is considered different than lowercase. |
Sample implementations of strlen:
Most compilers/libraries will have a highly-optimized version of strlen, (and other string-related functions) possibly even written in assembly code, so you should never need to write your own. Here is a version from glibc (The GNU C Library). From my simple tests, it's about 2.5 to 3 times faster than any of the ones shown above. Some of the optimizations may depend on the architecture of the CPU, e.g. SSE (Streaming SIMD Extensions) and vectorization, which is certainly well beyond the scope of this course.
size_t mystrlen1(const char *string) { size_t len; for (len = 0; *string != 0; string++) len++; return len; } size_t mystrlen2(const char *string) { size_t len = 0; while (*string++) len++; return len; } size_t mystrlen3(const char *string) { const char *start = string; /* Leaves string pointing at NUL byte */ while (*string) string++; return string - start; } size_t mystrlen4(const char *string) { const char *start = string; /* Leaves string pointing at one past the NUL byte */ while (*string++) ; return string - start - 1; }
Self check: Using the above implementations of mystrlen as a guide, write your own version of mystrcpy and mystrcat.
The String Pool
Given the code below, the three variables p1, p2, and p3, live on the stack. The three (NUL-terminated) strings live in the string pool.int main(void) { /* p1, p2, p3 are on the stack */ char *p1 = "Hello"; char *p2 = "Hello"; char *p3 = "Hello"; /* Display the address of each string */ printf("%p, %p, %p\n", p1, p2, p3); return 0; }
Here's a possible layout in memory (with arbitrary addresses):The string pool is an area of memory that contains all of the constant literal strings in the program. It is generally a read-only area of memory that is protected from being overwritten.
And here's the output of the program:![]()
What?!? All of the strings have the same address! That means that there is only one copy of "Hello" in the program. This is a more accurate diagram:0x400652, 0x400652, 0x400652
This is an optimization that most, if not all, compilers implement. Since they are literal constants, they can never change, so it is totally acceptable to do this. If you have a large program with many strings that are the same, this can save a lot of memory.![]()
There is only one string pool that is shared by all functions and files in a program. So, if the word "Hello" exists in other functions, or even in other files (in the same program), they will all be merged into one string. Some compilers will provide a command line option to enable/disable this optimization.
For strings within a single file, GNU gcc will automatically merge similar strings and this cannot be disabled. For programs with multiple files, this is disabled by default. To enable it, you need the option:
This tells the compiler and linker to remove any duplicate strings. Here's a larger example with multiple functions and multiple files:-fmerge-constants
This program has 13 occurrences of the string "Hello".
merge1.c merge2.c merge3.c #include <stdio.h> /* prototypes */ void f21(); void f22(); void f23(); void f24(); void f31(); void f32(); void f33(); void f34(); void f11(void) { char *p = "Hello"; printf("%p\n", p); } void f12(void) { char *p = "Hello"; printf("%p\n", p); } void f13(void) { char *p = "Hello"; printf("%p\n", p); } void f14(void) { char *p = "Hello"; printf("%p\n", p); } int main(void) { char *p = "Hello"; printf("%p\n", p); f11(); f12(); f13(); f14(); f21(); f22(); f23(); f24(); f31(); f32(); f33(); f34(); return 0; } #include <stdio.h> void f21(void) { char *p = "Hello"; printf("%p\n", p); } void f22(void) { char *p = "Hello"; printf("%p\n", p); } void f23(void) { char *p = "Hello"; printf("%p\n", p); } void f24(void) { char *p = "Hello"; printf("%p\n", p); } #include <stdio.h> void f31(void) { char *p = "Hello"; printf("%p\n", p); } void f32(void) { char *p = "Hello"; printf("%p\n", p); } void f33(void) { char *p = "Hello"; printf("%p\n", p); } void f34(void) { char *p = "Hello"; printf("%p\n", p); }
Build the program:
There is a tool called strings (part of Cygwin on Windows, built into Linux and Mac) that displays the strings used in a program:gcc -Wall -Wextra -ansi -pedantic merge1.c merge2.c merge3.c -o merge
This produces about 78 lines of output on my Linux computer. The actual output is here. Since I'm only interested in the strings that are Hello, I can filter the output:strings merge
and this is the output:strings merge | grep Hello
This tells me that there are three Hello strings in the program. The reason is that there is one for each of the three files. Now, if I execute the program, this is the output:Hello Hello Hello
You can see there are three different addresses. The first five are from merge1.c, the second four are from merge2.c, and the last four are from merge3.c.0x400814 0x400814 0x400814 0x400814 0x4008140x40081e 0x40081e 0x40081e 0x40081e 0x400828 0x400828 0x400828 0x400828
If I build it like this (with the appropriate option):
Then when I run the strings program I just get this:gcc -Wall -Wextra -ansi -pedantic merge1.c merge2.c merge3.c -o merge -fmerge-constants
and executing the program gives this output:Hello
Clearly, there is now only one copy of the string Hello in the entire program.0x400814 0x400814 0x400814 0x400814 0x400814 0x400814 0x400814 0x400814 0x400814 0x400814 0x400814 0x400814 0x400814
It should be obvious that it is the linker that is doing the merging since the compiler can only see one file at a time.
Here's another example that shows how clever the compiler and linker can be at times:
Building the program:int main(void) { char *p1 = "123456"; char *p2 = "23456"; char *p3 = "3456"; printf("%p, %p, %p\n", p1, p2, p3); return 0; }
Then run strings on it:gcc -Wall -Wextra -ansi -pedantic pool2.c -o pool2
And the output:strings pool2 | grep 3456
This is probably as expected, since there are three different strings. If we execute the program, we will see three distinct addresses:123456 23456 3456
However, if I include the -fmerge-constants option and then run strings we get this:0x4005e4, 0x4005eb, 0x4005f1
What happened to the other two strings (23456 and 3456)? Executing the program gives this output:123456
There are still three distinct addresses, but what do you notice about them? This is another way that the compiler/linker can optimize for memory.0x4005e4, 0x4005e5, 0x4005e6
This is what is happening (with arbitrary addresses):
Again, because these strings are literal constants, there is no way they can change, so doing this is fine. Also, realize that the compiler/linker can't help with this:![]()
This is because the second and third strings don't include every character up to the NUL character.int main(void) { char *p1 = "123456"; char *p2 = "2345"; char *p3 = "34"; printf("%p, %p, %p\n", p1, p2, p3); return 0; }
You don't necessarily have to use the -fmerge-constants command line option. Any optimization option (e.g. -O, -O1, -O2, -O3, or -Os) will enable this feature. If you need to force the compiler/linker to NOT merge strings:
From my research on -fmerge-constants:-fno-merge-constants
From GNU gcc documentation: Options That Control Optimization
So what happens if you do attempt to modify a string in the pool?
Output:int main(void) { char *p1 = "Hello"; /* The "Hello" string is in the string pool. */ *p1 = 'C'; /* Change first char to 'C', now it's "Cello". */ return 0; }
This means that something bad happened. Essentially, you are trying to write to a read-only section of memory and the operating system is terminating the program immediately. Running it under a memory debugger (Valgrind) gives a little more information:Segmentation fault
The "Bad permissions" basically means that the area of memory is marked as read-only, but we are trying to write to it. Just as you can have read-only files on the disk, you can have read-only memory.==26788== ==26788== Process terminating with default action of signal 11 (SIGSEGV) ==26788== Bad permissions for mapped region at address 0x4005C4 ==26788== at 0x400526: main (pool3.c:4) Segmentation fault
As a reminder:
To be on the safe side, and to share code with C++, you should use the const keyword so the compiler can warn you if you do something potentially dangerous. (The const keyword wasn't present in the original C compilers, so that's why C accepts the dangerous code.)char *p1 = "Hello"; /* OK in C, Warning in C++. */ const char *p2 = "Hello"; /* OK in both C and C++. */ *p1 = 'C'; /* Unsafe in both C and C++. */ *p2 = 'C'; /* Error in both C and C++. */