references: 1) 'Java 1.2 Developer's
Handbook', Heller&Roberts, 2) 'Just Java', P.van der Linden
3) Unicode support in Solaris
Operation Systems, a Sun Site white paper, url noted below
IO stand for input and output. The use of the computer depends on the
input and
output process. Information has to been entered into the machine for
it to process.
Information might be entered into the computer from the keyboard. Data
can also
be supplied to the computer from a file. Initially the standard means
of outputing
data from a computer was to send the processed information to a printer.
The
monitors we view has become the commonest means of viewing a computer's
output. Frequently data is input from files located on hard drives.
When a program
has finished, the results are typically output to a file for storage.
Java supplies the io
package and it's classes to provide java programs with the capability
to to input
and output data.
Streams
It is easy to picture a stream of water running
from a lake down a grade to a
small pond. The lake is the source for the stream
of water. The pond is the
destination. Drawing the comparison to the world
of computers, the water is
the carrier of data or information. The source
might be a data source such as
a file or a data structure in a program. The
destination might be another file,
or an internet connection. In the water metaphor,
the stream runs through
a trench or perhaps is carried in pipes or aquifers.
In Java, the conduit for
data is supplied by the different classes of
the io package.
Encoding
Before streams can be used effectively, there has to be agreement on
both
ends of a transmission just what the data means. A series of 1s and
0s can
be intepreted in a number of ways. Do you count 8, 16, 24 or 32 bits
per
character. What order are the bits to be interpreted? Are the most
significant
bits first or last in the transmission.
ASCII and ISO-8859-1
Encoding describes the schemes used to translate characters into binary
bit
patterns, represented by 0s and 1s. The ASCII letter capital 'A' has
a decimal
value, 65. In binary this can be described in a single byte as 1000001.
Using
the first seven bits of a byte to describe the 128 characters is the
original ASCII
character set. Using all 8 bits to describe 256 characters is a character
set called
ISO-8859-1 by the ISO standards organization.
As time went on, different characters were mapped to the different numeric
values
of the byte to derive other character sets. (i.e ISO-8859-9). A bewildering
array
of these character sets have been created describing not just different
languages
but also different platform versions of each of these languages. Soon
schemes
were sought to bring these character sets into single manageable groups.
Unicode and UCS-2
One of these was Unicode which uses 2 bytes to describe 65,535 characters.
This is the character set Java uses. Java from it's inception was internationalized.
Unicode currently has about 35,000 of it's values assigned to the characters
that
make up the worlds languages. Unicode is the same set by a different
name as
UCS-2. UCS-2 is an abreviation for Universal Character Set and is a
ISO
standard. Like Unicode, USC-2 uses two bytes to describe each character.
UCS-4
Though it is likely that Unicode will supply everyone's
character set needs for some
time to come, ISO recognizes that in addition
to spoken languages there are languages
non-spoken languages used in mathematics, science
and commerce. There are
experimental invented languages. There are also
many dialects in the world being
discovered or just now being committed to script.
Beyond these are ancient dialects
that are being discovered and need representation.
Considering these facts, it is
apparent that Unicode is too small! The great
character set endorsed by ISO has
is UCS-4 and takes 4 bytes for each character
it describes. It is also referred to a
as ISO-10646 character set.
UTF-8 and UTF-16
As you can imagine, if you were sending a stream of data half way around
the world,
in UCS-4 encoding, but all your characters were ASCII values, you would
be sending
3 blank bytes along with one byte that held the ASCII information.
Very inefficient
use of bandwidth. (Bandwidth in this context is the number of bytes
of data used per
character.)
UTF-8 and UTF-16 are clever schemes that allow a variable number of
bytes of data
to be used depending on what type of character is being sent. UTF stands
for Universal
Character Set Transformation Format. UTF first determines what character
encoding
type it is sending. ASCII always has the most significant bit empty.
UTF_8 reads this bit
and sees only a single byte needs to be used to send an ASCII character.
For Unicode,
one to three bytes are used. For UCS-4 up to seven bytes are needed
to send a character.
This implies that UTF-8 may be more or less efficient in terms of bandwidth
used, than
transmissions based solely on fixed-length character sets.
The good news is character sets were deviced to be backwards compatible
with
earlier character sets, so ASCII is a subset of ISO-8859-1. ISO-8859-1
is a
subset of UCS-2 or Unicode. UCS-2 or Unicode are subsets of UCS-4.
Table depicting characters set
relationships
UCS-4 //
ISO 10646
|
Common Data Formats
ASCII | American Standard Code for
Information Interchange |
7-bits,
[1 byte] |
128 mostly readable characters |
ISO 8859-1 | 256 ISO character code | 8-bits,
[1 byte] |
adds many non-English characters |
Unicode | synonymous with UCS-2 | 16 bits,
[2 bytes] |
most of the world's characters |
UCS-2 | Universal Character Set
two byte encoding |
16 bits,
[2 bytes] |
1st plane of ISO/IEC 10646
in two bytes (0 to 64K) |
USC-4 | Universal Character Set
four byte encoding |
32 bits,
[4 bytes] |
Full ISO/IEC implementation
in 4 bytes // ISO 10 |
UTF-8 * | UCS Transformation Format
versatile but complex |
[1 to 6
bytes] |
if bit 1 is 0,-->1 byte ASCII
if 1st bits are 110,-->2 bytes if 1st bits are1110,->3 bytes etc. |
UTF-16 | extended variant of UCS-2 | [2 to 4
bytes] |
|
binary | data transfer in numeric form | [1 to 8
bytes] |
binary version of Java chars |
objects | streaming java objects | [variable length] | the serialization process |
Note: ASCII is a subset of ISO
8859-1 which is a subset of Unicode/(UCS-2)
which is a subset of UCS-4/(ISO/IEC 10646)
* A Table describing Details of UTF-8 encoding
Bits | Hex Min | Hex Max | UTF-8 Binary Encoding |
7 | 00000000 | 0000007F | 0xxxxxxx |
11 | 00000080 | 000007FF | 110xxxxx 10xxxxxx |
16 | 00000800 | 0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
21 | 00010000 | 001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
26 | 00200000 | 03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
31 | 04000000 | 7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
Character Set Translation between Java & the Operating System
As mentioned earlier, Java uses Unicode internally. The operating system
that Java
is running on may not be using Unicode. Solaris for instance may be
using ASCII,
ISO8859-1 or UTF. Mac uses ASCII coupled with a proprietary character
format.
NT can use Unicode, ASCII or UCS. The Java IO functions take care of
translating
between the the character set(s) being used by the underlying operating
system and
the Unicode character set Java uses.
For more details on character encoding you may wish to see the white
paper on
Unicode support in Solaris Operating Systems at
www.sun.com/software/white-papers/wp-unicode/#1014899
The Package Gets Bigger
When Java was first released it had a set of IO classes that were based
on working
with bytes of data. Unfortunately, when this byte-based system of classes
was put
into service using the various character encoding described earlier,
problems arose.
The fix that was taken was simple. Wherever applicable, for each byte-based
stream
class, a char-based counterpart was created. The new set of classes
were designed
to work with 2-byte Unicode characters instead of bytes. Now in addition
to the 8 bit
Input/OutputStream classes of jdk1.0 the Reader/Writer classes of jdk1.1
were
added.
This has made the io package larger by 13 classes. The io package had
already been
criticized for creating more classes than were needed to provide input
output services.
In anycase, it is what it is. Although it is a little more work to
learn than perhaps it
needed to be, once you get use to the io package, you will find it
easy to use.
Reader and InputStream read in, while Writer and
OutputStream write out. The
8-bit streams are still used to exchange binary data in many of the
Java API's.
There are three streams that are automatically opened.
1) System.in //
reads bytes from the keyboard
2) System.out
//
writes bytes to the screen famous for System.out.println( )
3) System.err
// seperate out to screen to report errors
Each has a set method, setIn( ), setOut( ) and setErr( ) to redirect the stream.
Example showing setIn( ) & setOut( ) // from 'Just Java', P.V Linden page 615
FileOutputStream fs1 = new FileOutputStream("stdlog.txt");
System.setOut(new PrintStream(fs1);
FileInputSteam fi = new FileInputStream("input.txt");
System.setIn( fi );
One nice feature of the stream classes is they
follow predictable naming patterns.
The steam class names can be determined from
the categories they fall into.
1) Stream width, 8 or 16 bit
// Input & OutputStreams are 8 bit, Reader & Writer's are 16 bit
2) Source / Destination or Function
// for example, File, ByteArray Pushback etc.
3) Direction of Input or Output
// Input & Reader are IN, Output and Writer are OUT
Most of the time you can deduce the name of the class you need by deciding
the
source/destination or function, whether you need an 8 or 16 bit stream,
and which I/O
direction you are going in. So to write an array of bytes out to stream
you would use
ByteArray + OutputStream or ByteArrayOutputStream.
To read a file of 16 bit,
char types in you would use File + Reader or
FileReader.
Basic Methodology
To use the IO classes the following steps are followed.
1) Select the appropriate I/O class
for the objective.
2) Instantiate the class
using the most appropriate constructor
3) High level streams may be layered
over low-level streams by making the
low-level stream
objects the arguments to the high-level stream's constructor
3) Call the appropriate read( )
or write( ) methods on the top level stream.
In general, the methods of the stream classes
throw IOException. The methods that do
not cause exceptions, are involved with processes
other than streaming. IOException is
the parent exception of 15 specialized exception
classes in the io package and described
in the JDK documentation. The try{}catch( ){
} construct is usually present when doing
IO to handle the potential of exceptions being
thrown (unless you select to pass the
exception handling on to the enclosing scope).
Note the abstract superclasses are not all that
abstract! Only one or two methods in each
class is abstract and require implementation.
Both Reader and Writer have an additional
constructor which take an Object instance as
an argument. The object's lock is used to
synchronize thread access to shared code in a
multithreaded environments. The functions
contained in the 8 bit classes are mirrored in
the 16 bit classes, distinguished primarily by
the argument types, being byte for Input/OutputStream
and char for Reader/Writer methods.
InputStream
abstract int read( ) | reads one byte from a source returning the value
in the
low-order 8 bits of an int type |
int read(byte[] dest) | reads bytes from source into dest
array,
in this case returning
an int value describing the number of bytes read |
int read(byte[] dest,
int offset, int length) |
reads length bytes
into dest array, beginning at offset
All
three forms of read return -1 when no more data is available |
void close( ) | releases system resources associated with the source
i.e. the file descriptor |
int available( ) | returns the number of bytes that can be read or
skipped
from the given input stream without blocking |
long skip(long nbytes) | attempts to skip and discard nbytes, returns the
number
actually skipped |
boolean markSupported( )* | returns true if mark/reset mechanism is supported |
void mark(int readlimit)* | sets a mark in the input stream |
void reset( ) | resets the stream to repeat the read from mark |
OutputStream
abstract void write(int b) | writes the byte in the low-order 8-bits of
b,
discarding the high 24 |
void write(byte b[]) | writes an array of bytes, b |
void write
(byte b[] , int offset, int length) |
writes an array subset, b, from offset, length bytes long |
void flush( ) | writes out any bytes which may have been buffered |
void close( ) | releases system resources associated with the data source |
Reader
int read( ) | reads one character from source, returned in
the low-order 16 bits of an int |
int read(char[] dest) | reads characters from source into dest
array,
returns the number read |
abstract int read
(char[] dest, int offset, int length) |
reads length
chars into array dest beginning at
offset
all
three forms of read return -1 when no more data is available |
abstractvoid close( ) | releases system resources associated with source
i.e. the file descriptor |
long skip(long nchars) | attempts to skip and discard nchars, returns the
number
actually skipped |
boolean markSupported( )* | returns true if mark/reset mechanism is supported |
void mark(int readlimit)* | sets a mark in the input stream |
void reset( ) | resets the stream to repeat the read from mark |
void ready( ) | returns true if stream has data immediately available
(so read( ) won't block) |
Writer //
note Writer has two extra writes that take String or a String subset
void write(int c) | writes char in low order 16-bits of argument |
void write(char[] c) | writes an array of characters |
void write
(char[] c, int offset, int length) |
writes a subset of an array of characters |
void write(String s) | writes a string |
void write
(String s, int offset, int length) |
writes a subset of String s of given length from offset |
abstract void flush( )* | writes out any characters the stream has buffered |
abstract void close( ) | releases system resources associated with source |
Miscellaneous
1) available( ) is replaced by ready( ) in Reader.
2) Because Reader
and
Writer
methods
convert native codesets properly, they should be used
when processing
character data.
3) read( ) methods
returning an int allowing for 16-bit char values and EOF which is an int
value, -1
(0 to 0xFFFF)
& 0xFFFFFFFF(-1), EOF)
4) The number of bytes read by a method is dictated by the system's default encoding. i.e
(i) if ASCII, one byte
is read and promoted internally to two-byte Unicode.
(ii) if Unicode two bytes are
read and no conversion is needed
(iii) if UTF is in effect, 1
to 3 bytes are read, and the corresponding Unicode (java) char is
assembled
5) Currently the encoding in effect for a file cannot be changed.
Survey of the IO classes
Ignoring the Exception classes, IO classes can be grouped as by whether
they deal
with 8 or 16-bit data types, and whether they are designed to do input
or output.Of
the stream classes, a couple are unique and fall into a separate category.
Further there
are number of non-stream type classes that form another division.
Finally, within the
general divisions there are low-level and high-level stream classes.
Low level stream
open on specific destinations while high level streams open on other
streams.
The tables below categorize the io classes along the lines just described.
Notice there
is not perfect symetry. Sometimes stream classes fall into one or two
categories instead
of having a version in all four general divisions.
8 Bit Input Output Stream Classes
Low Level InputStreams | Low Level OutputStreams |
FileInputStream
ByteArrayInputStream PipedInputStream |
FileOutputStream
ByteArrayOutputStream PipedOutputStream |
High Level InputSteams | HighLevel OutputStreams |
BufferedInputSteam
DataInputStream PushbackInputStream |
BufferedOutputStream
DataOutputStream PrintStream |
16 Bit Reader Writer Classes
Low Level Readers | Low Level Writers |
FileReader
CharArrayReader PipedReader StringReader |
FileOutputStream
ByteArrayOutputStream PipedWriter StringReader |
High Level Readers | High level Writers |
BufferedReader
PushbackReader LineNumberReader |
BufferedWriter
PrintWriter |
Specialty Classes
Unique Stream Classes | Non-stream based IO Classes |
SequenceStreamReader
InputStreamReader OutputStreamWriter ObjectInputStream & ObjectOutputSteam |
File
RandomAccessFile StreamTokenizer FileDescriptor |
Low-Level Stream Classes //
open on files, arrays, strings, pipes
.
FileInputStream a byte-based input stream that reads from a file FileOutputStream a byte-based output stream that writes to a file FileReader
a character-based stream for reading from a file
ByteArrayInputStream takes input from a byte array or a subset of a byte array ByteArrayOutputStream writes to a byte array CharArrayReader reads characters from a character
array
// Has no corresponding 8-bit streams. There were StringBuffer[In/Out]putStreams. // They are deprecated in 1.2 as they do not properly convert characters into bytes. // The recommended way to create a stream from a string is via the StringReader class. StringReader
reads characters from a string
PipedInputStream reads bytes from a corresponding piped output stream PipedOutputStream writes bytes to a corresponding piped input stream PipedReader reads characters written to a
corresponding piped writer
// Piped streams are used
for inter-thread communication in multithreaded environments.
|
High-Level Stream Classes // Take their input from other streams
.
BufferedInputStream uses an internal byte array to buffer data read from source BufferedOutputStream collects byte data until buffer is full then writes in one operation BufferedReader
uses an internal character array to buffer data read from source
DataInputStream reads bytes and translates them into primitives, char arrays & strings DataOutputStream writes primitive data types, strings and byte arrays to output stream // No Reader/Writer equivalents
// LineNumberInputStream is deprecated since jdk 1.1 It had no writer counterpart LineNumberReader
maintains a count of # of lines it has read //
BufferedReader subclass
PrintStream supports writing text // deprecated except for System.out, a static Printstream instance in System class PrintWriter has all of PrintStream methods except it writes characters not bytes // System class defines
a static PrintStream object ,out , on which the PrintStream
PushbackInputStream allow byte(s) read to be pushed back to source PushbackReader allow char(s) to be pushed back to source // both use internal stacks // no corresponding subclasses of OutputStreams or Writer
SequenceInputStream class combines two or more input streams // takes two streams
as args or an enumeration
// no counterparts
// no corresponding opposite classes InputStreamReader takes an InputStream subclass
reads bytes and converts them to chars
ObjectOutputStream writes serialized objects to stream ObjectInputStream reads serialized objects from stream // no corresponding
Reader/Writer classes
|
File class
In Java, file meta-data, (information
about a given file but not it's contents) is returned by
methods of the File class. The File class doesn't itself do
any IO.
1) returns a File object which a file can be opened on.
2) tests if a file exists, or can be read/write
3) tests whether a File object represents a file or a
directory
4) returns #bytes and when file was last modified
5) has methods to create and delete files and directories
Some of Files methods //
see jdk API for fuller descriptions
public int compareTo(java.io.File); | lexigraphical comparison < = = > |
public static File createTempFile
(String1, String2, File) throws IOException |
File is dir
where temp file will be created
String1-->file name prefix String2--> suffix (min.3 chars long) |
public void deleteOnExit( ); | marks file to be deleted when program ends
(can't rescind call) |
public boolean mkdir( ) | creates one directory |
public boolean mkdirs( ) | as mkdir( ) plus any needed parent directories |
public java.lang.String[] list( ); | returns a String array [of files & dirs]
contained
in the dir from which the method has been invoked |
public java.io.File[] listFiles( ); | same as list( ) but returns File objects not String |
listRoots( ) | lists the available filesystem roots (i.e. "A:\", "C:\" etc.) |
public java.net.URL toURL( )
throws MalformedURLException |
puts File object into URL form (file:///
something)
(if the File object is a directory it end's in a slash "/") |
public boolean createNewFile( )
throws java.io.Exception |
creates a new file if it doesn't already exist
and
returns boolean if a new file has been created. |
FileDescriptor
Instances of the file descriptor class serve as an opaque handle to
the underlying machine-
specific structure representing an open file, an open socket, or another
source or sink of
bytes. The main practical use for a file descriptor is to create a
FileInputStream or
FileOutputStream to contain it. Applications should not create their
own file descriptors.
// info straight from the JDK API docs
RandomAccessFile
RandomAccessFile supports reading and writing as well as pointer positioning.
It is not
based on the streaming model so cannot be
chained or layered with other streaming
classes.
RandomAccessFile
has the same wide variety of reading and writing methods that
DataInputStream and DataOutputStream
have.
RandomAccessFile also has the methods,
void seek( long position ) sets
the position of the file pointer.
long getFilePointer( ) returns
the current location of the file pointer
StreamTokenizer
StreamTokenizer parses input into tokens,
usually from inside a while loop, calling nextToken(
)
until the end of the input is reached. The int returned by nextToken(
) describes the type of the
next token, (as described by static fields such as TT_NUMBER,
TT_WORD, TT_EOL, &
TT_EOF) Many methods are
available to recognize various user-specified characters.
StreamTokenizer is paricularly
suited for parsing Java, C or C++.
// note there is also a
StringTokenizer but it is in the java.util package and not a part of the
// IO package It is described
as being easier to use than StreamTokenizer so it might be worth
// visiting for certain applications
example from the JDK 1.3 documentation
StringTokenizer
st = new StringTokenizer("this is a test");
while (st.hasMoreTokens( )) {
println(st.nextToken( ));
}
// prints the following output:
this
is
a
test
The following example shows the basic approach the io classes use to stream data.
1) Opening an input steam to read bytes from
a file
.
file bytes --> retrieved via FileInputStream --> to bytes --> of an object in memory . |
// getting bytes from a file for a program's use
example FileInputStream fis = new FileInputStream("disk_file.txt");
// once created, the object's read methods can be called to access the data of the file
2) Opening an output stream to stream bytes
to a file
.
From an object reference representing data --> FileOutputStream --> bytes --> to a file . |
// storing bytes produced in a program to a file
example FileOutputStream fos = new FileOutputStream("disk_file.txt");
// once created, the object's write methods can be called to send bytes to file
3) Opening a high order stream on a low order
stream
.
bytes in a disk file --> FileInputStream --> bytes --> DataInputStream --> ints, doubles etc or ints, doubles etc --> DataOutputStream--->bytes --> FileOutputStream -->bytes in a disk file . |
example
FileInputStream fis = new FileInputStream("disk_file.txt");
DataInputSteam dis = new DataInputStream(fis);
FileOutputStream fos = new FileOutputStream("disk_file.txt");
DataOutputStream dos = new DataOutputStream(fos);
// chained or coupled streams where a 'high order' stream takes the 'low level' stream as input
The following example shows both input and output processes in one short
piece
of code. First a PrintWriter object is created ready to act as a conduit
for streaming
data to a file. Then a BufferedReader object is built based on System.in,
the keyboard
as a source. Then a read method, readLine( ) reads data from
the keyboard which is
stored in a String object. This object is then written out to file
via the println method.
Example using System.in and Thee IO Classes
import java.io.*;
public class ToFile{
public static void main(String[]args){
String line;
PrintWriter out=null;
try{
out=new PrintWriter(new FileOutputStream("target"));
}
catch(FileNotFoundException f){
System.out.println("file not found");
System.exit(0);
}
BufferedReader in=new BufferedReader
(new InputStreamReader(System.in));
System.out.println("Type in a line");
try{
line=in.readLine( );
out.println(line);
out.close();
in.close();
}
catch(IOException io){
System.out.println("IOException");
}
}
}