Design and Implementation of a Win32 Text Editor

Neatpad - Win32 Text Editor

Welcome to the brand new tutorial series - "Design and Implementation of a Win32 Text Editor!"

The purpose of these tutorials is to follow the development of a win32 text editor - codenamed Neatpad. Each tutorial will take you step-by-step over the major components and design decisions that lay ahead.

Part 1 - Overview

Introduction

Welcome to the brand new tutorial series - "Design and Implementation of a Win32 Text Editor!" The purpose of these tutorials is to follow the development of a win32 text editor - codenamed Neatpad. Each tutorial will take you step-by-step over the major components and design decisions that lay ahead.

 

Now, whilst the title says "text-editor", this is not the real goal of this tutorial series. Rather, we will aim to write a complete edit control which will be able to provide the core editing front-end for a text editor. Of course, we will have to build a small test application (Neatpad) to test the code with, but we will not go any further than the standard Windows Notepad goes.

Alternative Win32 Solutions

Before we begin you should also check out the following three win32 open-source editor components:

Scintilla, Brainchild and CrystalEdit.

Scintilla in particular looks very nice (it has loads of good features) however all three editors suffer from a large, unwieldy code-base, and all have limitations in their memory-management schemes meaning that large files are not handled particularly smartly.

Text Editor Resources

The following set of resources discuss text editor design rather than individual applications, so you may find it a useful exercise to look through the links and read what's there.

Edit-control requirements

The very first thing we must do is to make sure that there aren't any text controls already available which will do the job perfectly well. I imagine that anyone who is reading this has come to the same conclusions that I have - the standard Windows edit and rich-edit controls are not up to scratch - they are slow, don't handle large files and have bad flicking problems for a start, and boy do I hate apps that flicker :-)

The primary reason I want to start this project is as a learning exercise - for me and for anyone else who reads this series. So now that we have decided to make our own edit control, the next step is to decide what features the edit control will have. This is important, because these initial design ideas can have big impacts on the final implementation. So, in no particular order, here are the features that I want an edit control to offer:

I'm sure that I'll dream up extra features as we move through this article series, but for now I think that'll be enough :-)

Choice of language

This is a tough one. Realistically speaking we have the choice between C and C++. C is good because everyone can compile it, whereas C++ has an advantage that is is easier to write this type of control using it. For this tutorial we are actually going to use a mixture of C and C++. The text editor control will be written using C++, but we will provide a "C" interface to the control, because our main application (Neatpad) will be written in "pure win32" C.

Don't expect this control to be available in .NET, or Visual Basic, or any other high-level language. Maybe towards the end of this article series we will look at putting the editor control in an external DLL so that other languages can use it, but for now this is a pure C/C++ solution with no 3rd party libraries.

Design of a Text Editor

A text editor is conceptually very simple. Just take a look at Notepad - all it has is a main window and an "edit" control inside this main window. This is the design we will follow when we are implementing our text editor.

The diagram below illustrates the key components to a text editor. First of all we have a top-level main window. This is the window that contains the title, menu and status bars. The main window doesn't know how to edit text files, it's only purpose is to provide an interface to the user. This component will be written in pure win32 C.

The most important component is the "TextView". This is a separate window that is a child of the main (parent) window. The TextView is the visual component of a text editor. This control's primary purpose is to display and edit text, but it needs to do other things besides this, such as display scrollbars, process mouse and keyboard input, support drag and drop etc.

The "TextDocument" component is not a visible one, but is used to store and manipulate a text file once it has been loaded into memory. The text object has no concept of windows, mice, drawing or painting. All it knows how to do is manipulate text, and supply text to the edit control when it wants to draw something.

The last component is not really part of a text editor, but plays an important role none-the-less. It is the disk-file that stores the text we want to edit. The TextDocument will interface directly with this text file, reading and writing to it when it is time to load or save a new document.

It is important to separate the TextView from the data it stores (the TextDocument). This has two major benefits. One is that we can attach multiple edit controls to one text object - this will enable us to add "split-views" sometime in the future. The biggest advantage is that we can change the way we store and represent a text file at any time without this impacting upon the design of the visual component.

One thing to note on the component design are the "interfaces" between the various components. The TextView is a line-oriented, graphical entity. Most of the work performed by the edit-window will be updating the display. This will entail retreiving data from the TextDocument on a line-by-line basis.

On the other end of the scale is the interface between the TextDocument and the disk-file. Rather than using a line-by-line strategy, the TextDocument will either load the entire file into memory in one go, or access the file in manageable chunks. No matter how this happens, we've got some interesting design decisions ahead :-)

Neatpad - a Win32 Text Editor

At the top of this tutorial there is a link to a zip-file, which contains the sourcecode to a skeleton text editor. I have called this project "Neatpad" in passing reference to the standard Windows Notepad.

When you unzip the download, there will be a single Visual C++ project workspace (neatpad.dsw) and two subdirectories (Neatpad and TextView). The Neatpad directory contains several files, the important ones listed below:

It is intended that you work using the main neatpad.dsw workspace in the top-level directory. From this workspace you can build the complete project (Neatpad and TextView). Although the workspace and projects are in Visual Studio 6.0 format, it shouldn't be difficult to convert the projects to other IDEs.

TextView - a Win32 Custom Control

The TextView directory contains the source-code to a skeleton win32 custom control, implemented using C++. The control does nothing apart from display "Hello World" in it's window - in fact most of the code is simply what is required to register and create the empty TextView window.

There are four files in the TextView directory - TextView.dsp, TextView.c, TextView.h and TextViewInternal.h:

Most of the future tutorials will concentrate on adding functionality to the TextView project.

Public Interface to the TextView

The basic method of controlling the TextView will be Windows Messages, using the SendMessage API, just like you would use with a standard Edit control. To achieve this we need to define a range of message values that we can use. These message values will be defined in the "public" TextView.h, so whenever you want to use the TextView control in your projects, simply #include TextView.h:

#define TEXTVIEW_CLASS   "TextView32"

#define TXM_BASE (WM_USER)
#define TXM_OPENFILE (TXM_BASE + 0)

#define TextView_OpenFile(hwndTV, szFile) \
SendMessage((hwndTV), TXM_OPENFILE, 0, (LPARAM)(szFile))

As you can see only one message has been defined so far - TXM_OPENFILE. As we progress through this tutorial and add more functionality to the TextView, so we will add more messages as well. i imagine that our TextView will also support the standard EM_xxx edit control messages as well, so our TextView could be a simple drop-in replacement for that control.


The macro defined above (TextView_OpenFile) provides an alternative interface to the control. It is basically a wrapper around the SendMessage call and makes it easier to use. An example of opening a file is shown below:


TextView_OpenFile(hwndTextView, "C:\\SRC\\TEXTVIEW\\README.TXT");

Coming up in Part 2


This first part in the tutorial series was really just an introduction to what we are trying to achieve. I have assumed that the reader (you) as at least some knowledge of C and Win32 because you will need a reasonable level of programming experience if you intend to benefit from these tutorials. Please take some time to study the skeleton project download - the code is very simple but it is important that you understand how the various components are all going to plug together.


Future tutorials will begin to flesh out the functionality of the text-view control. As I've been experimenting with text editors I have drawn up a list of steps that I believe will be necessary to cover. The list below will hopefully give you an idea of the sequence of this tutorial series.



I will try and stick to the above topics as closely as possible, but bear in mind that I don't have a fixed project plan, so we will see this text editor evolve incrementally over the next few weeks.


AttachmentSize
neatpad1.zip37.53 KB

Part 2 - Loading a text file the easy way

Introduction

This is the second part of the "Design and Implementation of a Win32 Text Editor" article series. If you haven't already read part 1 then please do so now!

OK, so assuming you've downloaded, assimilated and compiled the source code that was made available, you should have a basic skeleton text editor which doesn't do anything yet. Our mission in part 2 is to load a text file into memory and display it in our TextView control. But let's not get carried away. The only aim right now is to load a text file and provide very basic display, we are nowhere near providing scrolling or keyboard and mouse support.

Text Documents

A Text Document is nothing more than a basic binary file, with the commonly understood convention that a text-file should not contain unprintable characters (i.e. ASCII control characters), and that lines of text are separated by a common end-of-line delimiter (such as a carriage-return / line-feed pair).

It is the task of a Text Editor to interpret a text-file's binary content and display this content in a line-oriented manner to the user. Part one of this tutorial series discussed the structure of a Text Editor - and described the TextView and TextDocument objects. The first thing we will concentrate on will therefore be the TextDocument object - which we will represent as a C++ class:

class TextDocument
{
public:
    bool  init(char *filename);

ULONG getline(ULONG lineno, char *buf, size_t len);
ULONG linecount();

private:
bool init_linebuffer();

char *buffer;
int length;
};

The basic C++ interface is very simple. We can load a file into the TextDocument using the init class member. We can retrieve a line of text using the getline method - where we specify a line number and a buffer into which to store the line contents.


Notice that the TextDocument class is entirely ASCII in operation at the moment - that is, there is no support for Unicode. We could have used C++ templates to support a variety of different types. However at this moment in time I am still undecided as to how best approach this problem, so we will leave the interface as simple as possible. After all, this is a "throw-away" implementation of TextDocument, and we will completely re-write it later on in the series.


Loading a text file


Our first attempt at loading a text file will try to be as simple as possible. The TextDocument::init function below is the main interface to the TextDocument:


bool TextDocument::init(char *filename)
{
HANDLE hFile;

hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, 0, 0);

if(hFile == INVALID_HANDLE_VALUE)
return false;

return init(hFile);
}

TextDocument::init simply opens a file using the standard win32 CreateFile API, and then passes control to a helper function which loads the file contents using the HANDLE returned by CreateFile:


bool TextDocument::init(HANDLE hFile)
{
ULONG numread;

if((length = GetFileSize(hFile, 0)) == 0)
return false;

// allocate new file-buffer
if((buffer = new char[length]) == 0)
return false;

// read entire file into memory
ReadFile(hFile, buffer, length, &numread, 0);

// work out where each line of text starts
init_linebuffer();

CloseHandle(hFile);
return true;
}

As you can see loading a file is very simple. We calculate how big the file is, allocate a buffer to hold the text, and then read the file into memory. This is not really a very smart thing to do, as large files will take a very long time to load, or may fail to load completely if there isn't enough memory available.


Please understand that this is a "throw-away" TextDocument class. Writing a TextDocument class which supports files of any size will be very difficult so I have deliberately kept this version of TextDocument as simple as possible. Later on in the series we will revisit file-loading and write this class properly.


Carriage-returns and Linefeeds


How many lines are there in a text document? How do we read a file line-by-line? The answer really depends on how you define what a line of text is. At it's simplest, a line of text is a sequence of characters within a file, with well-defined end of line marker. We don't really care what the characters are in each line, but we do need to know how to identify where the start and stop of each line is.


There are three main conventions for delimiting lines of text - under DOS and Windows, a carriage-return / line-feed pair is used. Under UNIX and Linux, a single line-feed character is used, and under Macintosh operating system, a single carriage-return is used. There are a number of issues which become apparent once we try to tackle all these cases, but for now we will just concentrate on the DOS/Windows case. A future tutorial will address the other last two options (and another case, where we can encounter a file with all combinations of line-separators).


The TextView control needs to know now many lines of text there are in it's document, because it must be able to setup the scrollbars to reflect the document length. We also need to be able to locate individual lines of text in a random order because we will be drawing the text document line-by-line.


The most common solution to this problem is to implement what is known as a "line buffer". Conceptually the line buffer is very simple - it is just an array of integer offsets which specify where each line of text in the document starts.



The diagram above illustrates a line-buffer on the left hand side. The buffer has been initialized with the block of text on the right - by processing the text, searching for carriage-return / line-feed sequences, and recording the offset for each line. Each array item (numbered 0-6) contains the offset of the character which starts each line of text.


The TextDocument will implement the line-buffer as it seems a natural choice to perform line-parsing in the same place as the file loading. The TextDocument::init_linebuffer function below is called when a new file is loaded:


bool TextDocument::init_linebuffer()
{
ULONG i = 0;
ULONG linestart = 0;

// allocate the line-buffer
if((linebuffer = new ULONG[length]) == 0)
return false;

numlines = 0;

// loop through every byte in the file
for(i = 0; i < length; )
{
if(buffer[i++] == '\r')
{
// carriage-return / line-feed combination
if(buffer[i] == '\n')
i++;

// record where the line starts
linebuffer[numlines++] = linestart;
linestart = i;
}
}

linebuf[numlines] = length;

return true;
}

The init_linebuffer function does two things: allocate space for the line buffer, and then process the file. Because we don't know how many lines of text there will be until we process the entire file, how do we know how big to make the line-buffer? The short answer is, we can't possibly know this. A real TextDocument class would either dynamically reallocate it's line-buffer as it encountered more lines, or use some clever algorithms to limit the amount of memory consumed by the line-buffer.


For now we will cheat and make the linebuffer the same size as the file itself - this way we know we won't run out of space. Processing the file requires us to loop through each and every byte in the file, looking for a carriage-return character.


The algorithm itself is straight-forward. Whenever a carriage-return is encountered (followed by an optional line-feed), a new entry is added to the end of the linebuffer, which records the current start-of-line. The linestart variable is then made to "point" to the character after the carriage-return - the start of the next line. This continues until there are no more characters left in the file. The number of lines processed is kept updated in the numlines variable.


Retrieving lines of text


Now that our TextDocument implements a rudimentary line buffer, line-based lookups will be very fast. The following TextDocument::getline function shows how we will access each line of text from the TextDocument:


ULONG TextDocument::getline(ULONG lineno, char *buf, size_t len)
{
char *lineptr;
ULONG linelen;

// find the start of the specified line
lineptr = buffer + linebuffer[lineno];

// work out how long it is, by looking at the next line's starting point
linelen = linebuffer[lineno+1] - linebuffer[lineno];

// make sure we don't overflow caller's buffer
linelen = min(len, linelen);

memcpy(buf, lineptr, linelen);

return linelen;
}

The function works by using the lineno parameter as a direct index into the linebuffer array. The offset stored in linebuffer[lineno] is added to the real buffer of text, resulting in a pointer to the start-of-line. The length of the line (in characters) is calculated by subtracting the next line's offset from the current line's. Once the correct offset and length have been calculated, the raw line content is copied into the caller-supplied buffer. It's simple but it works.


char buf[200];

m_pTextDocument->getline(5, buf, sizeof(buf));

The example above shows how to retrieve a buffer of text at line number 5.


Drawing the lines of text


The whole point of accessing the TextDocument in a line-by-line manner is to make our lives simpler when it comes to drawing the text in our TextView control.


The entire drawing and painting logic in our TextView will be centered around the WM_PAINT message. In fact, our WM_PAINT handler will be the only place in the entire program where any form of painting takes place. As with every win32 program, the painting framework starts life as a basic BeginPaint / EndPaint sequence:


LONG TextView::OnPaint()
{
PAINTSTRUCT ps;

BeginPaint(m_hWnd, &ps);

// do painting

EndPaint(m_hWnd, &ps);
return 0;
}

The function above simply validates the device-context and returns - and because our window-class for the TextView specified that there is no background brush, not even a single pixel is displayed.


At this point, all we have is a basic drawing framework, with the update-area specified in the PAINTSTRUCT::rcPaint RECT structure. The diagram below illustrates this update rectangle in red.



Our task is to take this single, arbitrary rectangular region and convert it into a series of horizontal spans that will be filled with lines of text. In the diagram above the update region has been split into three such spans. The variables first and last denote the first and last lines that require updating.


LONG TextView::OnPaint()
{
PAINTSTRUCT ps;

BeginPaint(m_hWnd, &ps);

ULONG first, last;
ULONG i;

// figure out which lines to draw
first = ps.rcPaint.top / m_nFontHeight;
last = ps.rcPaint.bottom / m_nFontHeight;

// draw the display line-by-line
for(i = first; i <= last; i++)
{
PaintLine(ps.hdc, i);
}

EndPaint(m_hWnd, &ps);
return 0;
}

The OnPaint code above is simple but effective. The first two lines we added were "first =" and "last =". These are used to work out the starting and ending row which encompass the update region. We divide by the current font-height because we want to convert from pixels to logical lines.


Note that we have not taken into account the horizontal extents of the update region. This is deliberate, as it is simpler at this stage to draw each line in it's entirety, and let the device-context's update-region clip our output if we draw too much.


The individual line drawing has been deferred to separate function:


LONG TextView::PaintLine(HDC hdc, ULONG nLineNo)
{
TCHAR buf[LONGEST_LINE];
ULONG len;

RECT rect;

GetClientRect(m_hWnd, &rect);

// work out where the line should be drawn
rect.top = nLineNo * m_nFontHeight;
rect.bottom = rect.top + m_nFontHeight;

// get the data for this single line of text
len = m_pTextDocument->getline(nLineNo, buf, LONGEST_LINE);

// draw text and erase the entire line background at the same time
TabbedExtTextOut(hdc, &rect, buf, len);

return 0;
}

As you can see drawing a line of text is relatively simple. The first thing we do is work out the pixel coordinates of where to draw the line. We use the window's client area as our starting point, and then adjust the top and bottom to describe the line as a simple rectangular region in pixel-based coordinates.


Once we have the line's bounding rectangle we use a further helper function to draw the text - TabbedExtTextOut. I won't include this function here - just look in the sources - suffice to say, TabbedExtTextOut is a simple wrapper function around the TabbedTextOut API, with the added feature that it also erases any background area that is not taken up by actual text - similar to the ExtTextOut API.


The actual line of text we want to draw is retrieved from the TextDocument object using the getline method we developed earlier. It doesn't matter if we change how we store our text-file inside the TextDocument, as long as we preserve the "getline" interface the TextView and TextDocument can be entirely separate entities.


Note that our simple text-output at this stage is exactly that - we don't handle control characters, syntax colouring or scrolling. One step at a time though..


Coming up in Part 3


At this point we have a very simple text-viewing capability, but it is quite limited in what it can do. There is no scrolling, no keyboard or mouse input, no selection or highlighting, no colouring and no editing. Don't let this discourage you though - it is very important that we have a simple design with which to start off with.


If you are reading this tutorial series then you probably want to see how a real text-storage component is implemented - i.e. how large files are loaded, data structures managed etc. This will be covered in a future tutorial quite soon, but for now I want to get the basic graphical interface working first.


The next tutorial will therefore look at adding scrolling support, so at least we will be able to view an entire text document. Once we have finished that, we will look at mouse input, as this will be one of the most difficult areas to implement. It will require careful coding in both the mouse and drawing routines as we have to handle cursor placement and selection highlighting at the same time.


AttachmentSize
neatpad2.zip42.65 KB

Part 3 - Scrollbars and Scrolling

Introduction

Welcome to the third installment of the "Design and Implementation of a Win32 Text Editor" article series! In this part we will look at adding scrollbars and scrolling to our TextView control.

Scrolling in Win32

In Windows, scrolling is divided into three areas: The first is the physical scrollbars, whether they be scrollbars built into a window, or separate scrollbar controls. There is a complete scrollbar API to set and retrieve the positional information described by a scrollbar - GetScrollPos, SetScrollInfo etc.

The next area is the window-messages that scrollbars send to their parent window when they are manipulated by the user - i.e. the WM_VSCROLL and WM_HSCROLL messages. These special scrolling messages can be used by an application to update it's user-interface in response to the user's interactions.

The last category is the GDI scrolling API - i.e. ScrollWindow and ScrollWindowEx. Although they are grouped in the same category as the "regular" scrollbar API in the Platform SDK, these two GDI routines don't have anything to do with scrollbars - instead they are used to move/offset a bitmap region within a window to give the illusion of scrolling.

We will have to bring all these areas together to implement full scrolling support in our TextView control.

Scrollbars - built in or separate?

The first thing I want to discuss before we start is what type of scrollbar to use. We have two choices - "built-in" scrollbars which would be part of the TextView's non-client window area, or separate scrollbar controls.

There isn't much difference between the two types. With built-in scrollbars, the WM_xSCROLL messages are sent to the same window to which the scrollbars belong. With separate scrollbar controls, the scrollbar messages get sent to the control's parent window. However it would be very simple to forward the scrollbar messages from a parent window to the real TextView window, so this wouldn't be any kind of set-back.

The real difference comes when we have to position our TextView control, because the scrollbar controls have to be carefully positioned at the same time to make sure they are aligned correctly at the bottom/right edges of the TextView. We would probably delegate this task to a "container" window which managed the layout of a single TextView and it's associated scrollbars.

Whilst separate scrollbar controls can provide additional flexibility (we can add grippers, buttons etc alongside the scrollbars), the extra work involved would detract from the task at hand. So for the time being the TextView will use normal built-in scrollbar controls - however later on in the tutorial series we will look at how to utilize separate controls because it is actually very simple, but just a little tedious to be concerned with at the moment.

Scrollbar settings

A scrollbar's state (i.e. it's range and current thumb position) is set using the SetScrollInfo API, using a SCROLLINFO structure to specify each attribute. This structure is shown below:

struct SCROLLINFO
{    
    UINT cbSize;
    UINT fMask;

int nMin;
int nMax;
int nPage;
int nPos;
};

The diagram below illustrates these four scrollbar properties and how they relate to each other. For the sake of this small example, the data range that the scrollbar represents is 100 lines. Although I've used a picture of a horizontal scrollbar, a vertical scrollbar is no different in operation.



The nMin and nMax values represents the total number of "scrolling units" (or lines) that the scrollbar contain. For our TextView, we would set nMin to be zero as it makes no sense to allow scrolling to negative line numbers. nMax must be the total number of lines in the text document, minus one - in the case of our "100 line file", nMax must be 99.


The nPos value represents the current scrollbar position, in "scrolling units". This is not a pixel or coordinate based value, rather it is an arbitrary value somewhere within the range nMin...nMax.


This leaves us with the "odd one out" - nPage. This value has nothing to do with the scrolling range, the number of lines in a file or the current scrollbar position. It is used purely to specify how many scrolling units there are in the current window client-area. For example, if our window was big enough to hold 15 lines of text, then we would set nPage to equal 15. The built-in scrollbar will use this value to work out how big to make the scrollbar thumb - at no time do you have to do anything complicated to work out how big the thumb should be.


Note that the maximum thumb position will never equal nMax because the thumb width prevents it from reaching this far. Therefore we have two "maximum" values that relate to our scrollbar - what the scrollbar requires for it's si.nMax value, and what results as the thumb position maximum value - which is always nPage units less than the scrollbar's nMax.


Adding scrollbars to a Window


Adding scrollbars to a window is very simple - all we have to do is add the WS_HSCROLL and WS_VSCROLL window styles, which we can do when we create the window:


CreateWindowEx(WS_EX_CLIENTEDGE, TEXTVIEW_CLASS, 0, WS_VSCROLL | WS_HSCROLL ...

Before we can do anything useful with our scrollbars we must create some variables which will represent the "scrollbar state" for TextView C++ class:


class TextView
{
...
ULONG m_nVScrollPos;
ULONG m_nHScrollPos;

ULONG m_nVScrollMax;
ULONG m_nHScrollMax;

int m_nWindowLines;
int m_nWindowColumns;

ULONG m_nLongestLine;
ULONG m_nLineCount;
};

These variables (when initialized) will allow us to take into account the current scrollbar position when we are drawing our lines of text into the display.



Note that the m_nLongestLine variable has been introduced. This is used to represent the maximum horizontal scrolling extent. It should be obvious why we need to use the longest line in the text document to determine this horizontal scrolling range, and not some arbitrary fixed value.


ULONG TextDocument::longestline(int tabwidth);

The TextDocument has been given the task of calculating the width (in logical text units) of the longest line. The current tab-width setting must be specified so that tab characters can be taken into account. I won't include the code necessary to perform this task - just take a look at the sourcecode download.


Configuring the scrollbars


Repeated Win32 projects have taught me that scrollbars only really need to be setup in one place in a program - during a window-size change. A single function can be used to setup both the horizontal and vertical scrollbars at the same time:


VOID TextView::SetupScrollbars()

First of all a SCROLLINFO structure is configured to set the vertical scrollbar properties:


VOID TextView::SetupScrollbars()
{
SCROLLINFO si = { sizeof(si) };

si.fMask = SIF_POS | SIF_PAGE | SIF_RANGE | SIF_DISABLENOSCROLL;

si.nPos = m_nVScrollPos; // scrollbar thumb position
si.nPage = m_nWindowLines; // number of lines in a page (i.e. rows of text in window)
si.nMin = 0;
si.nMax = m_nLineCount - 1; // total number of lines in file (i.e. total scroll range)

SetScrollInfo(m_hWnd, SB_VERT, &si, TRUE);
...

The horizontal scrollbar is configured in a very similar way:


...	
si.nPos = m_nHScrollPos; // scrollbar thumb position
si.nPage = m_nWindowColumns; // number of columns in the window
si.nMin = 0;
si.nMax = m_nLongestLine - 1; // width of longest line (i.e. total scroll range)

SetScrollInfo(m_hWnd, SB_HORZ, &si, TRUE);

...

The very last thing to do is calculate the maximum positions that the scrollbar thumbs can take:


...
m_nVScrollMax = m_nLineCount - m_nWindowLines;
m_nHScrollMax = m_nLongestLine - m_nWindowColumns;
}

It is important that you understand these last two lines. The m_nVScrollMax and m_nHScrollMax values are not used to set the scrollbar's si.nMax properties - the m_nLongestLine and m_nLineCount are used for this purpose. Instead, the m_nxScrollMax values represent the maximum thumb positions - much more useful for our purposes because we will use these values alot in the TextView.


Window size affects scrolling range


The first thing we must understand is that whenever we resize our TextView control, the amount of visible text within that window will change, and therefore the scrollbar's nPage value must change to reflect this. Therefore when our TextView window resizes we can react to the WM_SIZE message that is received:


case WM_SIZE:
return tvp->OnSize(LOWORD(lParam), HIWORD(lParam));
LONG TextView::OnSize(int width, int height)
{
m_nWindowLines = min(height / m_nFontHeight, m_nLineCount);
m_nWindowColumns = min(width / m_nFontWidth, m_nLongestLine);

if(PinToBottomCorner())
RefreshWindow();

SetupScrollbars();
}

The first two lines simply work out how many lines/columns of text there are in the window. The min() function is used to handle the case when the window can display more text than there is available (i.e. when the entire file fits within the window).


Once these two values have been calculated the scrollbars can be configured using the SetupScrollbars function we wrote above. This leads us onto the next question:


What is "Pinning" ?


Imagine the following scenario. We have loaded a text document and scrolled right the way down to the bottom. We then drag the bottom window-border down to make the window larger. The question is, what happens to the file contents when we do this? We have two options to choose from, both of which are acceptable:


Do we leave the current scrollbar position intact and expose "void" space at the end of the file? This option would not affect the file's position - i.e. it would remain static. Some people prefer this because the text display remains in the same place on the screen.


The other option is to drag the file contents down at the same time (exposing more content at the top of the window), adjusting the scrollbar position at the same time, so that we always have a window full of text - in effect, "pinning" the file-content to the bottom edge of the window's client-area.


Have a look at the regular notepad utility and see what it does - you will find that it (or rather, the standard multi-line edit control) "pins" it's content to the bottom-right corner of the control when the control is resized. This is the behaviour that we will use in our edit control.


bool TextView::PinToBottomCorner()
{
bool repos = false;

if(m_nHScrollPos + m_nWindowColumns > m_nLongestLine)
{
m_nHScrollPos = m_nLongestLine - m_nWindowColumns;
repos = true;
}

if(m_nVScrollPos + m_nWindowLines > m_nLineCount)
{
m_nVScrollPos = m_nLineCount - m_nWindowLines;
repos = true;
}

return repos;
}

The function above simply adjusts the scrollbar position to make sure that it is always within the scroll-range, and returns a boolean value to indicate if anything changed - so we know if we need to redraw the window or not.


Note that the alternative (exposing empty space at the end rather than dragging the file down) would also be acceptable - in fact this is what Visual Studio and many other editors do. We might look at adding this behaviour as an optional extra later on in the series. Personally I prefer the "pinning" method much more so that's what I have implemented first of all.


Taking the scrollbar position into account when drawing


Currently our text-display is quite primitive because no scrollbar information is used when painting the display. Subsequently the text-file is firmly rooted to the top-left of the TextView client-area and we have to drag the window larger if we want to see more text. Fortunately very little work is required to make our TextView fully scrollable.


If you remember from the last tutorial, the TextView::OnPaint routine calculated the first and last rows of text to update using a simple formula:


 first = ps.rcPaint.top    / m_nFontHeight;
last = ps.rcPaint.bottom / m_nFontHeight;

It is a simple matter to take the vertical-scrollbar position into account:


 first = m_nVScrollPos + (ps.rcPaint.top    / m_nFontHeight);
last = m_nVScrollPos + (ps.rcPaint.bottom / m_nFontHeight);

Basically what this does is change the line-index that we draw, so as the scrollbar moves down, the lines of text effectively move up the display. We still draw lines at the same physical location within the window, but we draw different lines to give the illusion that we are scrolling through the document.


Now that we can correctly identify which logical line of text needs updating, we can look at the actual text-output.


void TextView::PaintLine(HDC hdc, ULONG nLineNo)
{
RECT rect;
GetClientRect(m_hWnd, &rect);

// calculate rectangle for entire length of line in window
rect.left = -m_nHScrollPos * m_nFontWidth;
rect.right = rect.right;

rect.top = (nLineNo - m_nVScrollPos) * m_nFontHeight)
rect.bottom = rect.top + m_nFontHeight;

// rest of function body omitted

// draw text and fill line background at the same time
TabbedExtTextOut(hdc, &rect, buf, len);
}

First of all understand the horizontal text-positioning - the rect.left and rect.right values: When the horizontal scrollbar's position is increased (i.e. we scroll to the right), we would expect the page to scroll to the left. So as the scroll position increases, the text position must decrease. This is why the "-m_nHScrollPos" is used. This logical text position is then multiplied by the current font-width to produce a pixel-based coordinate that can be used for drawing. This is ideal for fixed-width font displays.


The right-most edge of the line of text must still be fixed to the right-edge of the window, so this value is left unchanged. This basically results in the following occurance: As we scroll to the right, the length of text that we draw gets larger and larger because the starting x-coordinate becomes more and more negative. If the device-context didn't clip our output, lines of text would look something like this:



Of course, instead of offsetting our drawing to the left, we could have simply used the horizontal scrollbar position to find the correct place to draw within the line of buffered characters, and (sensibly) do all drawing from a fixed x-coordinate of zero. This would work well for our simple text display. However "tabbed" text display poses a problem, and once we start more complex syntax colouring and variable-width fonts, this quickly becomes a poor choice. I am undecided at present what the best method is to handle the horizontal scrolling so we will use this simple method for now.


Now onto the vertical scrollbar position. Because our PaintLine function is being supplied with a "logical" line number (nLineNo) - which is relative to the start of the document - we must subtract the vertical scrollbar's position from this value to arrive at a zero-based index, relative to the top of the client-area. This value is then multiplied by the font's height to provide a pixel-based y-coordinate.


Scrollbar Messages


We are now at a point where we are ready to add real scrolling to the TextView. We will concentrate just on the WM_VSCROLL message, as the WM_HSCROLL handler is virtually identical. If you look at the Platform SDK documentation for WM_VSCROLL, it states that low-order WORD of wParam contains the scrolling code. This can be one of the following values: SB_TOP, SB_BOTTOM, SB_LINEUP, SB_LINEDOWN, SB_PAGEUP, SB_PAGEDOWN etc. Basically, it tells us which part of the scrollbar has been clicked by the mouse.


We'll start by looking at the SB_TOP and SB_BOTTOM messages. The handler for the vertical scrollbar will look like this:


LONG TextView::OnVScroll(UINT nSBCode, UINT nPos)
{
switch(nSBCode)
{
case SB_TOP:
m_nVScrollPos = 0;
RefreshWindow();
break;

case SB_BOTTOM:
m_nVScrollPos = m_nVScrollMax;
RefreshWindow();
break;

...
}

// update the scrollbar metrics
SetupScrollbars();
return 0;
}

The SB_TOP and SB_BOTTOM cases are incredibly simple. All that is required is for us to move the scroll position to either extreme of the scrolling range, and then redraw the entire window to reflect the change.


Reacting to the scrollbar "thumb" messages is also very simple:


    case SB_THUMBPOS: case SB_THUMBTRACK:
m_nVScrollPos = GetTrackPos32(m_hWnd, SB_VERT);
RefreshWindow();
break;

...

The GetTrackPos32 is just a simple wrapper function around GetScrollInfo:


LONG GetTrackPos32(HWND hwnd, int nBar)
{
SCROLLINFO si = { sizeof(si), SIF_TRACKPOS };
GetScrollInfo(hwnd, nBar, &si);
return si.nTrackPos;
}

This is required because we want the full 32bit scrollbar value rather than the 16bit position that we get from the WM_VSCROLL message.


The remaining four cases (SB_LINEUP, SB_LINEDOWN, SB_PAGEUP and SB_PAGEDOWN) are slightly different:


    case SB_LINEUP:
Scroll(0, -1);
break;

case SB_LINEDOWN:
Scroll(0, 1);
break;

case SB_PAGEUP:
Scroll(0, -m_nWindowLines);
break;

case SB_PAGEDOWN:
Scroll(0, m_nWindowLines);
break;

...

As you can see we have deferred the real work to a private scrolling function, which we will implement in just a moment:


VOID TextView::Scroll(int dx, int dy);

This function takes two signed integer parameters. Their purpose is to specify the direction and amount (in text units) in which to scroll the viewport. If you look at the message-handlers for SB_LINEUP/DOWN, you can see that to scroll up, we specify a value of -1, and to scroll down we specify a value of 1. Similarly, for SB_PAGEUP/DOWN, we scroll up/down by an entire page's worth of text (i.e. the number of lines currently in the window). We will let the Scroll function take care of making sure that we never scroll off the start/end of the document.


Scrolling the Viewport


Scrolling has been isolated inside a single function call. This has a number of advantages - understanding these will help you develop better programs in the future. The first (and most obvious) advantage is the modular design benefits it brings. We will be able to re-use the scrolling function in many other areas of the control (we have already used it twice for the vertical+horizontal scrollbar messages), but we will also make good use of this function when it comes to keyboard and mouse scrolling.


The biggest advantage is not at first obvious, but greatly simplifies our design. By designing a scrolling function that can pan the display around in two directions at once (horizontal and vertical) we can greatly enhance the mouse-scrolling interactions. Many controls resort to sending two scrolling messages simultaneously - one to scroll up/down, and one to scroll left/right. This has the potential to introduce scrolling glitches and artifacts, and is also a little clumsy in operation.


VOID TextView::Scroll(int dx, int dy)
{
// make sure dx,dy don't scroll past the end of the document!

// adjust the scrollbar thumb position
m_nVScrollPos += dy;
m_nHScrollPos += dx;

if(dx != 0 || dy != 0)
{
// perform the scroll
ScrollWindowEx(
m_hWnd,
-dx * m_nFontWidth,
-dy * m_nFontHeight,
NULL,
NULL,
0, 0, SW_INVALIDATE
);

SetupScrollbars();
}
}

The function above is not quite complete - it is missing some important logic to ensure that dx and dy never allow us to scroll outside the boundaries of the current file. The source-code download does include this code however.


All that I really wanted to show was the the ScrollWindowEx API to update the display. The TextView::Scroll function is very simple at the moment but it does work well - i.e. it scrolls the window smoothly with no flickering. We will revisit this function in the next tutorial when we look at selection-scrolling with the mouse.


MouseWheel support


Whilst we are looking at scrolling we may as well implement support for mouse-wheel scrolling because it is incredibly simple to do. The WM_MOUSEWHEEL message was added with Windows 98 and Windows NT4 to implement mouse-wheel scrolling. In order to support this message we will need a new-ish Platform SDK installed, and we must also define the _WIN32_WINNT variable before #including <windows.h>, so that the message is available for us to use.


Our handler for WM_MOUSEWHEEL simply extracts the wheel-rotation delta (i.e. the forwards/backwards vector) and calls the real message-handler in our TextView class:


case WM_MOUSEWHEEL:
return ptv->OnMouseWheel((short)HIWORD(wParam));

The handler uses the SPI_GETWHEELSCROLLLINES system-setting to work out how many lines to scroll:


LONG TextView::OnMouseWheel(int nDelta)
{
int nScrollLines;

SystemParametersInfo(SPI_GETWHEELSCROLLLINES, 0, &nScrollLines, 0);

if(uScrollLines <= 1)
uScrollLines = 3;

Scroll(0, (-nDelta/120) * uScrollLines);
}

That's all it takes, because we have made good use of the Scroll function we developed to do the hard work. See, I told you it would be useful ;-)


Coming up in Part 4


I hope you will take the time to download the zipfile at the top of this tutorial and have a play with the latest incarnation of Neatpad. For such a small amount of code it is quite a useful tool already - it is a pretty effective text-document viewing application. Of course large files aren't handled at all but it should give you a good idea about where to proceed.


Mouse support will be the aim of the next tutorial. We will look at how to add focus and text-caret support, caret positioning with the mouse, and full text selection. This will require modifications to the drawing code of course. We will also implement "mouse scrolling" - i.e. where the text selection is extended beyond the window and the contents must be scrolled into view. The scrolling work we have done so far will make this task very simple.


 


AttachmentSize
neatpad3.zip46.63 KB

Part 4 - Improved Drawing

Introduction

Welcome to the fourth installment of the "Design and Implementation of a Win32 Text Editor" article series! I realise that you were probably expecting a different tutorial this time around. However, it was only after starting to implement mouse selection and highlighting that I realised there wasn't proper support in the drawing "engine". I've decided to implement all of my text-output requirements in this one tutorial. So we will be covering tabbed-output, multi-coloured text (i.e. for syntax highlighting and selection highlighting) and the problem of ASCII control characters.

Example of coloured text with ASCII control characters.

Multi-coloured text

The first reason the drawing code needs revising is to support multi-coloured text. And I don't just mean syntax colouring (which will be covered in a later tutorial). Selection highlighting is the main reason text must be drawn using different colours, simply to distinguish between selected and unselected text. The image below illustates a line of text with the middle portion selected using the default system colours.

Note that the text isn't really selected - it is merely drawn using two different colour schemes. Just have a play with selecting this paragraph of text in whatever browser you are using, and understand that the "selection" you are making is really just a segment of text drawn in a different colour. So all we need to do for our TextView is divide up our text into the appropriate segments and call SetTextColor & SetBkColor to set the colours before calling ExtTextOut - there really isn't any magic involved here.

A common mistake when first starting a control such as this is to try drawing the selected text in the WM_MOUSEMOVE handler, using intricate fill modes, inverted rectangles or transparent text. This really is the worst approach you could take. The next tutorial will show how to update a selection using the mouse, but all the drawing will be happening in our WM_PAINT handler.

The drawing logic must therefore be able to handle combinations of any colour - because we must think ahead about syntax and block highlighting. And because a text file does not store colour or style information, character colours must be computed separately and applied to the displayed text at runtime.

The attributes of a single character will be represented by the ATTR structure, shown below:

typedef struct
{ COLORREF fg; // foreground colour COLORREF bg; // background colour ULONG style; // font and style information } ATTR;

Whenever a line of text is retrieved from the TextDocument, this text must be "colourised" before displaying it. Font, style and colour attributes will be computed by a separate routine, ApplyTextAttributes.

void TextView::ApplyTextAttributes( ULONG  nLineNo, 
                                    ULONG  nOffset, 
                                    TCHAR *szText, 
                                    int    nTextLen, 
                                    ATTR  *attr)
{
   // loop over each character in the text-buffer
   for(int i = 0; i < nTextLen; i++)
   {
        // apply highlight colours 
        if(nOffset + i >= m_nSelectionStart && nOffset + i < m_nSelectionEnd)
        {
            attr[i].fg = GetColour(TXC_HIGHLIGHTTEXT);
            attr[i].bg = GetColour(TXC_HIGHLIGHT);
        }
        // normal text colours
        else
        {
           attr[i].fg = GetColour(TXC_FOREGROUND);
           attr[i].bg = GetColour(TXC_BACKGROUND);
        }

attr[i].style = 0;
}
}

This function will be called whenever some text is about to be drawn. Two parameters are passed in - nLineNo and nOffset, which describe where in the file the text occurs. ApplyTextAttribrutes will use these two parameters to determine how to "format" the text - currently as selected or unselected.


TCHAR   buff[100];
ATTR attr[100];
ULONG fileoff;

// get some text from the TextDocument
int len = m_pTextDoc->getline(nLineNo, 0, buff, 100, &fileoff);

// calculate colour and font information ApplyTextAttributes(nLineNo, fileoff, buff, len, &attr);

The formatting information is returned through the ATTR array which must be the same size as the szText buffer which holds the text. For the moment this function is sufficient to describe "normal" and "highlighted" text, but will not perform any form of syntax highlighting. I imagine that the function prototype may have to change slightly because syntax highlighting will have additional requirements.

Multi-font display

In part one of this series I wrote that I wanted to restrict the TextView to a single, fixed-width font. The entire reason for using fixed-width fonts was to keep the TextView code as simple as possible - especially the mouse-cursor placement logic. However even a fixed-width text display is not truly fixed-width, because of TAB characters which cause variable sized gaps in the lines of text. Having to deal with TABs means writing additional code to parse each line of text - to determine where in the line the mouse has been placed. My reasoning for supporting variable-width fonts is this: if we have more work to do, we may as well do it properly and handle all types of font.

In order to manage multiple fonts, we must devise a strategy to manage these fonts somehow. The FONT structure (defined below) holds a handle to a font (a normal HFONT) and also contains important information about the font's dimensions in the TEXTMETRIC structure.

struct FONT
{
HFONT hFont; TEXTMETRIC tm;

int nInternalLeading;
int nDescent;
};

An array of these fonts is stored as a member of the TextView class:


FONT  m_FontAttr[MAX_FONTS];
int m_nNumFonts;

The first element in this array (element zero) is always the default display font. Additional fonts can be added to this array but will not be used unless some form of syntax-highlighting is developed. I only imagine variations of fonts would be used in the control (i.e. normal and bold fonts) rather than completely different styles, but the TextView will support whatever fonts you decide on.



There are some added complications which will arise from the use of multiple fonts, which are highlighted by the picture above. The problem can be divided into three cases:



To solve these problems we must first understand how a font's structure is described by GDI. The diagram below illustrates the dimensions of a font as described by the TEXTMETRIC structure:



The Baseline of a font is measured by the Ascent dimension. Therefore if we keep track of the largest Ascent out of all the fonts we are using, we can work out the amount to vertically offset shorter fonts based on their individual Ascent value. For example, suppose we are using two fonts - "Courier New" with an ascent of 21, and "Lucida" with an ascent of 18. When drawing text using the Lucida font, we would have to offset this text down by 3 pixels in order to align the baselines correctly.


The second problem is easily solvable. Previous incarnations of Neatpad used the m_nFontHeight variable to keep track of the height of each line. This variable has now disappeared, to be replaced by a more meaningful value - m_nLineHeight. The TextView code demonstrates how this value is calculated in the OnSetFont member function:


LONG TextView::OnSetFont(HFONT hFont)
{
m_nLineHeight = 0;

for(int = 0; i < m_nNumFonts; i++)
{
m_nLineHeight = max(m_nLineHeight, m_FontAttr[i].tm.tmHeight);
}
}

The final painting problem is easily solved because there is a handy API at our disposal. ExtTextOut allows us to specify a rectangular region to paint when drawing some text:


ExtTextOut(HDC hdc, int x, int y, UINT flags, RECT *rect, TCHAR szText, ...);

The x and y parameters specify where the text is to be drawn. The rect parameter specifies the background rectangle. As long as this rectangular region fills the entire height of the line we can be sure to paint the background properly.



Something we must be careful with is how we tell ExtTextOut to fill the background. The ETO_OPAQUE flag can be used to achieve this effect. However, we must also specify ETO_CLIPPED as well, otherwise we will get some strange effects when using ClearType fonts. The picture above illustrates the problem - even the string length was calculated using GetTextExtentPoint32 (and the background rectangle set to this size), when ExtTextOut was called the string "bleeds" outwards by 1 pixel either side - an oddity unique to the way ClearType text is displayed. Specifying ETO_CLIPPED prevents this problem.


Line Spacing


There is something that I want to mention at this point: Many people are sometimes confused as to how the height of a font (or line of text) must be calculated. There are usually four different methods I see of obtaining a font's height, shown below:


TEXTMETRIC tm;
GetTextMetrics(hdc, &tm);

height1 = tm.tmHeight;
height2 = tm.tmHeight + tm.tmExternalLeading;
height3 = tm.tmHeight + tm.tmInternalLeading;
height4 = tm.tmHeight + tm.tmInternalLeading + tm.tmExternalLeading;

Method #1 is correct and is what many editors utilize, because the height of a font is exactly determined by the tmHeight value. However, method #2 is more correct for a text-editor. The External Leading of a font is the value (as assigned by the font's designer) that should be added in-between lines of text when displayed in a paragraph.


In other words, a multi-line display should use this value when displaying lines of text. The problem is, the external-leading is not taken into account when text is drawn using the standard Windows text routines (TextOut/DrawText). When a line of text is drawn using TextOut, the background is only filled to cover the tmHeight of the font and does not include the tmExternalLeading value. This results in gaps in-between lines of text, which makes people think that they are doing something wrong. The gaps between lines must be filled manually if the correct results are desired. Our multi-font display handles this extra line-spacing correctly by using ExtTextOut and enlarging the background rectangle appropriately.


As a final note, the third and fourth methods are wrong and should not be used to calculate the height of a font, or the height of a line of text. Study the diagram above to see how the various dimensions relate to each other.


As an extra feature I have also added a new message to the TextView - TXM_SETLINESPACING. This message is used to set extra spacing above and below each line of text, in addition to the font's external leading value.


TextView_SetLineSpacing(hwndTextView, 3, 2);

The example above instructs the text-view to add 3 pixels above each line, and 2 pixels below each line (resulting in each line of text being 5 pixels taller). I think people will find this feature extremely useful, as many fonts sometimes don't look very good unless extra line-spacing is included.



The picture above illustrates the various components of a line of text in Neatpad.


ASCII Control characters


The other reason I decided to revise the drawing code was to take into account control-characters (i.e. ASCII control characters 0 - 31). Many simple text editors (such as regular Notepad) do not handle control characters at all. The NUL character (ASCII value 'zero') is never displayed for example - and how could it be - there is no visual representation for this character. The other control characters are equally problematic because they are designed to have function rather than appearance - a backspace or linefeed character also cannot be displayed for this reason. The solution is to duplicate what the Scintilla edit component does to display control characters:



The line of text above has two control-characters embedded it - the NUL character (value zero) and the SYN character (value 24). I'd never seen this method of displaying control-characters until I looked at Scintilla, however it does seem like a neat way of dealing with the problem. Every character in the range 0 to 31 is displayed using inverted colours, with a rectangular border to enclose the text. These "graphic/textual" representations of the ASCII control characters stand out from the surrounding text and are a great visual aid in my opinion.


Enhancing the text-output has again introduced further complexity. The "new style" control-characters are not fixed-width - the borders introduce a 4-pixel overlap which means that the display can no longer be fixed-width only. Actually it was the introduction of these control-character "bitmaps" that prompted me to switch to a fully variable-width font display.


The images that make up the control-characters will be quite complicated to draw - more so than just doing a simple "TextOut". In an ideal world we could pre-calculate the bitmaps and store them in an off-screen buffer, and "BitBlt" them to the screen each time we needed to - this would be an efficient method to draw these characters.



Unfortunately nothing is that simple. We must be able to deal with control-characters in multiple colours (i.e. when they are selected/highlighted or part of a syntax string). We also need to take into account the different fonts, heights and styles that I want the TextView to handle. This means that our drawing code will have to be modified to deal with the occurance of ASCII control codes and handle them appropriately with a special routine DrawCtrlChar.


The basic breakdown of drawing operation is as follows (for black-on-white text):



  1. Fill the background white.

  2. Make a rectangle the width of the text and the same height as the letters.

  3. Draw this rectangle in black.

  4. Expand this rectangle 1 pixel at the top and bottom, and contract it by 1 pixel at the left and right. (i.e. make it taller and thinner by 1 pixel in each dimension).

  5. Draw a second black rectangle - this gives the illusion of a rounded rectangle.

  6. Draw the text in white.


The TextView will of course support the option not to display control-characters this way - I can imagine some scenarios where this would be useful. In this case control-characters can be replaced by a single character '.', or just drawn as best they can.


NeatTextOut


At this point we need to develop a new text-output routine which can handle drawing an aribitrary segment of text (possibly containing control-characters), in a specific font and colour. NeatTextOut (prototype shown below) is similar to TabbedTextOut, with the exception that colour and style information is supplied in the ATTR structure. This function also returns the width of the displayed text (in pixels) so that the calling function can keep track of where to draw (and also when to stop drawing when the text goes outside the window).


int NeatTextOut( HDC    hdc, 
int xpos,
int ypos,
TCHAR *szText,
int nLen,
int nTabOrigin,
ATTR *attr
);

I really don't see the value of including the source-code here for you to see, because it is obviously included in the TextView class and is fully documented for you to see there. Suffice to say, if you don't want to support control-characters you can quite easily replace this function with a normal TabbedTextOut.


Improved drawing engine


The whole thrust behind this tutorial is to be able to render text to the TextView window, using the ATTR structure as a guide for how the text should look. The drawing logic is shared between the following functions:



The first function - OnPaint - hasn't changed at all, whereas the PaintLine function has actually been reduced in functionality. All it does is work out where to draw the line, and then passes control to a further function, PaintText, to do the actual work. My reasoning behind this decision is to allow PaintLine to handle selection margins and line-numbers at some time in the future, and to allow PaintText to just draw the text and nothing else.


The PaintText function is therefore responsible for drawing an entire line of text. It retrieves text from the TextDocument and applies colour formatting to the text by calling ApplyTextAttributes before calling NeatTextOut to display the text.


The benefit of this design is that the drawing code is completely independent of the rest of the TextView. I like this approach because it means that text drawing is isolated into one single place. Any future syntax-highlighting, mouse selection, bookmarks or any other changes will not impact on the drawing code because it is completely dumb - all it knows how to do is draw text in particular styles. It will be up to the syntax highlighter to create these styles, but this is a separate, well-contained problem of it's own.


Understand that when we draw text we have two parallel arrays - one holding the text and the other holding each character's colour attributes. Iterating through these arrays and drawing the text character by character will be very slow, so instead PaintText collects together as much text as possible before drawing it. What it does is identify consecutive characters that share the same colour and font, and outputs each "span" of text in one go. The result is very fast and in practise is not really any slower than doing a single TextOut. The important part of the PaintText routine is shown below.


int i     = 0;         // current character position
int lasti = 0; // last character position
int xtab = xpos;

char buff[100];
ATTR attr[100];

//
// Display the text by breaking it into spans of colour/style
//

for(i = 0; i <= len; i++)
{
// if the colour or font changes, then need to output
if(i == len ||
attr[i].fg != attr[lasti].fg ||
attr[i].bg != attr[lasti].bg ||
attr[i].style != attr[lasti].style)
{
xpos += NeatTextOut(hdc, xpos, ypos, buff + lasti, i - lasti, xtab, &attr[lasti]);

lasti = i;
}
}

I'm not going to include any more code in this tutorial because the sourcecode download is well documented - as always ;-) and it is very easy to see what is going on.


Coming up in Part 5


Mouse input and selection definitely be the subject of the next tutorial. It was necessary to delay the this subject because without proper drawing support selection highlighting would not have been realistically possible. We will also implement "mouse scrolling" - i.e. where the text selection is extended beyond the window and the contents must be scrolled into view. The scrolling work we have done so far will make this task very simple.


AttachmentSize
neatpad4.zip53.47 KB

Part 5 - Mouse Selection and Highlighting

Introduction

Mouse input has proven to be the most intricate and difficult to write part of Neatpad to date. It hasn't been helped by the fact that Neatpad now supports variable-width fonts, so in some ways I am still unsure if this extra complexity is a good thing or not (from a tutorial / learning point of view). However, if I had stuck with fixed-width fonts it would be quite a task for anyone to move from that limited capability to a fully variable-width display, so from that perspective I think I made the right decision...

Most of the complexity has been caused by the deliberate separation between the TextView and TextDocument classes. If the TextView had direct memory-access to the underlying file some of the code could be a little simpler. However if we are to move to a 4Gb file-editor we must hide the memory-management behind a TextDocument class and force a strict interface between GUI and file-management, accessing the file content in small chunks at a time. It makes our code cleaner, but a little more difficult. Anyway, without further ado let's look at what happen's when a person clicks the mouse-pointer inside our Neatpad window.

Carets, Focus and Activation

In Windows, whenever the mouse is clicked inside a control, the expected behaviour is for the window to receive the input-focus and display some form of graphical feedback which indicates this change in focus. This is a consistent user-interface detail present throughout all of Windows.

The default behaviour in Windows is to send a WM_MOUSEACTIVATE message during this intial window activation. However, at no point does the target window actually receive the input focus due to this mouse-click - this is a detail not known or understood by many people. We must process this WM_MOUSEACTIVATE message manually, and set the focus ourselves:

case WM_MOUSEACTIVATE:
    return ptv->OnMouseActivate();
LONG TextView::OnMouseActivate()
{
SetFocus(m_hWnd); return MA_ACTIVATE;
}

So now whenever the mouse is clicked inside our TextView, it will receive the input focus. More importantly though, the TextView will now receive an additional message - WM_SETFOCUS. It is during the processing of this message that we can create and show a caret.

case WM_SETFOCUS:
    return ptv->OnSetFocus();
LONG TextView::OnSetFocus()
{
    DWORD nWidth = 2;

SystemParametersInfo(SPI_GETCARETWIDTH, 0, &nWidth, 0);

CreateCaret(m_hWnd, (HBITMAP)NULL, nWidth, m_nLineHeight);
ShowCaret(m_hWnd);

RefreshWindow();
}

The text-caret will be just the same as in Notepad or Visual Studio. We specify a NULL bitmap handle to create a solid blinking caret. It will be exactly 2 pixels wide unless we are on Windows 2000 and above, in which case the SystemParametersInfo request for SPI_GETCARETWIDTH will succeed and we will use that setting instead. The caret will also be the same height as a line of text. This is a pretty standard way to show show a text caret in Windows.


Whenever the TextView loses the input-focus, it will receive the WM_KILLFOCUS message and we can hide the text-caret and delete it:


case WM_KILLFOCUS:
return ptv->OnKillFocus();
LONG TextView::OnKillFocus()
{
HideCaret(m_hWnd);
DestroyCaret();

RefreshWindow();
}

Note that each time the TextView receives or loses focus, the window entire is redrawn. This will enable us to draw the text-selection in different colours depending on whether the window has focus or not. We now move onto the problem of placing the caret in the correct position whenever the mouse is clicked inside the window. Before we go this far however, we have a decision to make.


TextView Coordinates - offset or coordinate based?


A decision must be made before we go any further. We must decide how to keep track of the current cursor position and selection start/end positions. We have two choices, described below.


The first option is to use a single 32bit file-offset to store all "coordinates". In other words, this is much like a HexEditor or how the existing EDIT control works. The advantage of this method is it is very simple to store, manage and "work with" file offsets - all data accesses are much simpler to perform because everything is offset/buffer based. The disadvantage is that a file-offset does not directly translate to the GUI - remember that we have to use a line buffer to access the underlying file. There is the potential here to introduce extra dependencies on the line-buffer management that may cause performance problems.


The second option is to use a "text-coordinate" system. This would require using two values - one for the line number, and one for the character-offset (or column number) within that line. The advantage of this technique is that it translates directly to what you see on the screen. Moving the caret around with the mouse and keyboard (i.e. user GUI actions) will be potentially a lot easier than the first method because we are using a more natural coordinate system for those actions. The disadvantages are two-fold. Firstly, data-access is now more difficult, and all data-accesses will be required to go through the line-buffer to find the true file-offset - again with the potential performance problems. The second disadvantage is the text-painting. It is far more cumbersome to perform comparisons with an "x,y" coordinate than it is a single value. For example, the decision as to whether to highlight a character based on an x,y coordinate is much more complicated than performing a simple integer comparison.


ULONG  m_nCursorOffset;
ULONG m_nSelectionStart;
ULONG m_nSelectionEnd;

I have experimented with method two in the past and to be honest it was just as complicated as the "pure" file-offset method - the complications and performance issues just get moved to a different place. For this reason the TextView will use the simpler "file-offset" method of storing text coordinates. This will hopefully isolate the difficult issues just inside the mouse-input handling and make the rest of our code easier to write.


Placing the Text Caret


When the user clicks the mouse in a "normal" edit control, the text-caret is placed at the beginning of the nearest character to where the mouse is. It seems a kind of obvious thing to do, but this simple operation is going to be the most complex task we have tackled to date.


The first useful message we receive when the user clicks the mouse is WM_LBUTTONDOWN. The handler for this message is shown below and is actually quite simple. I have structured it in such a way that it doesn't matter if we are using offset-based or text-based coordinates.


LONG TextView::OnLButtonDown(UINT nFlags, int mx, int my)
{
ULONG nLineNo;
ULONG nCharOff;
ULONG nFileOff;
int xpos;

// map the mouse-coordinates to a real file-offset-coordinate
MouseCoordToFilePos(mx, my, &nLineNo, &nCharOff, &nFileOff, &xpos);

SetCaretPos(xpos, (nLineNo - m_nVScrollPos) * m_nLineHeight);

// erase any existing selection
InvalidateRange(m_nSelectionStart, m_nSelectionEnd);

// reset cursor and selection offsets to the same location
m_nCursorOffset = nFileOff;
m_nSelectionStart = nFileOff;
m_nSelectionEnd = nFileOff;

// set capture for mouse-move selection
m_fMouseDown = true;
SetCapture(m_hWnd);

return 0;
}

There are two basic tasks that must be performed. The first is to identify which text-character within the file has been clicked - we need to retrieve the zero-based file-offset of this selected character (so we can keep track of the current cursor position). The second task is to position the text-caret next to the character we selected in step#1.


These complex operations have been isolated inside a single TextView member-function, MouseCoordToFilePos. The purpose of this function is to return the line number, character offset within that line (i.e. the column number), physical file offset, and finally the x-coordinate of the character as it appears on screen.


BOOL MouseCoordToFilePos (int    mx,            // mouse x-coordinate
int my, // mouse y-coordinate
ULONG *pnLineNo, // [out] line number
ULONG *pnCharOffset, // [out] column number
ULONG *pnFileOffset, // [out] file-offset
int *px); // [out] adjusted x coordinate

I'm not sure if I really want to include the full code this function here because it is likely I will keep tweaking it to try and make it as clear/simple as possible. However I do want to describe the basic operation so people can understand exactly what is involved in this operation.


Find the line-number


The first thing to do is work out which line of text is under the mouse. This is actually very straight-forward because we are using fixed-height lines of text (i.e. every line of text is the same).


ULONG lineno = (my / m_nLineHeight) + m_nVScrollPos;

If we were writing a word-processor or a HTML viewer the operation above would be alot more complex, but for our simple text-editor it is sufficient to just divide the mouse Y-coodinate by the line-height. As you will see below, knowing what line we are currently looking at is very important because we must parse each specific line of text in order to calculate the cursor x-position.


The GetTextExtent problem


Once we know what line of text we are dealing with, we must work out which character within that line has been selected. Due to the possibility of tabs and control-characters occuring in the line of text (or if we are using variable-width fonts), we must parse the entire line of text (from the start) to work out which character falls under the mouse.


In Windows there are many APIs which tell you how big a string of text is (in pixels). However there are no APIs which perform the opposite conversion - i.e. how many characters fit within the specified space. Starting with Windows 2000 two new routines were introduced to address this problem - GetTextExtentPointI and GetTextExtentExPointI.


However we can't rely on just Windows 2000, and due to the GUI-file separation of the TextView/TextDocument (i.e. accessing the content in chunks), we must devise our own strategy to work out which character has been selected.


We start by accessing the line of text in fixed-sized blocks. As we proceed in this manner, the width of each block of text is calculated using a function NeatTextWidth (which is basically a wrapper around GetTextExtentPoint32, but also takes into account tabs and control-characters). The mouse x-coordinate is then checked against this block of text to see if it falls inside.



The picture above should hopefully illustrate this process fairly clearly. It's not at all accurate (and it isn't intended to be). All we want at this stage is rough guess as to where the mouse has been placed. The code snippet below is basically what is happening here:


int curx    = 0;
int charoff = 0;

for(;;)
{
// grab some text
if((len = m_pTextDoc->getline(nLineNo, charoff, buf, TEXTBUFSIZE, &fileoff)) == 0)
break;

// find it's width
int width = NeatTextWidth(hdc, buf, len, -(curx % TABWIDTHPIXELS));

// does the cursor fall within this segment?
if(mx >= curx && mx < curx + width)
{
// narrow down the search
}

// move onto the next block
curx += width;
charoff += len;
}

Once the correct block of text has been identified we must narrow down the search using what is essentially a "binary chop" algorithm.


Binary-Chop


We are now working at the single character level. For efficiency reasons I really don't want to call NeatTextWidth for each character in turn so the binary-chop (or binary-search) is perfect for this situation. The diagram below shows the algorithm in action.


We keep track of the search using two variables, low and high, which specify offsets into the character-buffer we are searching. These offsets start at the two extreme ends of the buffer and then move inwards, narrowing the search down each interation.



For each iteration, we take the mid-point between low and high. We then compare the mouse coordinate to see which side of this mid-point the cursor falls. If it is to the left then we center in on this segment, and likewise if it is to the right of this midline. We will eventually get to the point where we have closed in on a single character (low is exactly one less than high) with the mouse somewhere in this small range of pixels.


int low   = 0;
int high = len;
int lowx = 0;
int highx = width;

while(low < high - 1)
{
int newlen = (high - low) / 2;

width = NeatTextWidth(hdc, buf + low, newlen, -lowx-curx);

if(mx - curx < width + lowx)
{
high = low + newlen;
highx = lowx + width;
}
else
{
low = low + newlen;
lowx = lowx + width;
}
}

In computer science terms this method has an efficiency of O(log2n) - actually a binary search is a very efficient algorithm, and even for very long lines of text it should still be quite fast. For variable-width fonts there really isn't any other way to perform this type of thing. Obviously for a fixed-width font display, we could simply scan through the whole line in one go but I won't bother "over-optimizing" just yet because it will just clutter the code up.


Snap to middle of character


It is at this point that we know which character has been clicked/selected with the mouse, and we have the x-coordinates of this character's starting and ending positions, in the lowx and highx variables:



The final detail to be implemented is to determine which side of the character to place the text-cursor (caret). Sloppy text-editors simply "round-down" to the start of each character (i.e. they just position the cursor at the start of the character by choosing the lowx coordinate). However a more natural "feel" can be achieved by using the center of each character to decide which side to place the caret.


if(mousepos > highx - FontWidth/2)
caret = highx;
else
caret = lowx;

Notice that the "TAB" character shown above has the selection-line positioned on the right-hand-side rather than the middle. This is a deliberate detail because I want to emulate the way Visual Studio places the cursor when it is positioned over a TAB (or any control-character which is also wider than a single letter).


Selecting with the mouse


Now that we are able to position the text-caret under any character within the TextView, we are ready to move onto mouse-selection. Cast your memory back to what we do when we process WM_LBUTTONDOWN - the m_nCursorOffset, m_nSelectionStart and m_nSelectionEnd variables were all set to point to the same location.


To extend the selection as we drag the mouse we can handle the WM_MOUSEMOVE message. Again we retreive the file-offset under the mouse using MouseCoordToFilePos. Now however, we can modify just the m_nSelectionEnd variable to "point" to this new offset, leaving m_nSelectionStart where it is. This has the effect of extending the selection. To have this reflected on the screen we must obviously redraw the display, and this is where the tricky part comes in.


LONG TextView::OnMouseMove(UINT nFlags, int x, int y)
{
if(m_fMouseDown)
{
ULONG nLineNo, nCharOff, nFileOff;
int px;

MouseCoordToFilePos(x, y, &nLineNo, &nCharOff, &nFileOff, &px);

// update the area that has changed
if(m_nSelectionEnd != nFileOff)
{
InvalidateRange(m_nSelectionEnd, nFileOff);

SetCaretPos(px, (nLineNo - m_nVScrollPos) * m_nLineHeight);

m_nSelectionEnd = nFileOff;
RefreshWindow();
}
}
}

The WM_MOUSEMOVE handler (above) is quite similar to the WM_LBUTTONDOWN handler. We first translate the mouse x,y coordinates to a file-offset. Assuming that this offset is different to the current cursor offset, we can reposition the text-caret and redraw the area of text between the old selection-end point and the new cursor position.


Invalidate a range of text


The key to a good selection/highlighting strategy is to only redraw the bare minimum at a time - we must only paint where there are changes and never anywhere else - simply to avoid flicker rather than for performance. The InvalidateRange member-function does exactly this.


LONG TextView::InvalidateRange(ULONG nStart, ULONG nFinish);

The two parameters (nStart and nFinish) specifies the range as file-offsets. It is the job of InvalidateRange to convert these two parameters to screen coordinates and cause just the specified region to be redrawn. This is the exact "opposite" to MouseCoordToFilePos - we are moving from file offsets back to screen coordinates now. You can use whatever coordinate system you want (file-offsets or text-coordinates) - it really doesn't matter. It is the concept of limiting the redraw to the change in selection that is important here.


Note that the display isn't actually redrawn using this function - the specified area is instead invalidated (using a series of calls to InvalidateRect), and the task of doing the actual drawing is left up to the WM_PAINT handler which we have already implemented.


Even though our WM_PAINT handler redraws whole lines at a time, it is because we have invalidated a specific area the update region for the window will clip our output and prevent us drawing over areas that didn't change.



The picture above is meant to illustrate a "selection in progress". The selection has been made in four steps, starting with the lightest blue segment. The basic idea is to break the task up into lines, calling InvalidateRect for each span of text. What I am trying to show is that a selection-change could be a small segment on one line, or a change involving multiple lines at the same time.


The InvalidateRange function (however it is implemented) must be able to handle these different scenarios correctly. There is little point in including the function body here so it is time to finish this part of the tutorial.


Coming up in Part 6


Mouse selection is quite a tricky subject but I hope I have covered it adequately for people to appreciate what is required for such a task. Also remember that this is a simple text editor - imagine how much more difficult it would be to write a real word processor or web-browser which has to handle many different types of text and graphics.


Part 6 will take what we have implemented here and add "mouse scrolling" to the equation.


 


AttachmentSize
neatpad5.zip72.41 KB

Part 6 - Scrolling with the Mouse

Introduction

Originally I had planned to keep all of the mouse-related concepts together in one tutorial. The problem is, does mouse-scrolling belong in the mouse tutorial (part#5) or the scrolling tutorial (part#3)? In the end I've decided to make it a separate topic by itself. Actually it has worked quite well because it gives a good sense of progression between basic mouse selection (with no scrolling) and a fully working selectable, scrollable control.

Scrolling with the mouse

The basic idea behind "mouse scrolling" is to cause the window to scroll when the mouse is dragged outside the window whilst making a selection. Almost any application that hosts a window with scrollbars will support mouse-scrolling in some form or fashion. Try it with whatever browser you are using right now - select some text and whilst holding the left-button down, drag the mouse outside the browser window. The contents is automatically scrolled into view.

The very first step we must take is to detect when the mouse leaves the window and initiate some kind of appropriate scrolling action. Fortunately we already have TextView::Scroll (written in part#3) which we will be able to use for this tutorial.

Now, take a quick look at Notepad and you will see how it handles mouse-scrolling. Notepad (or rather the embedded EDIT control) works by detecting when the mouse leaving the confines of the window, but it only does this whilst the mouse is moving. As soon as the EDIT control stops receiving mouse-movement, the scrolling stops. So to achieve this very basic functionality we first add some code to our WM_MOUSEMOVE handler which can detect when the mouse is outside the TextView:

RECT rect;
POINT pt = { mx, my };

// get the non-scrolling area (an even no. of lines) GetClientRect(m_hWnd, &rect); rect.bottom -= rect.bottom % m_nLineHeight;

// detect where mouse is if(PtInRect(&rect, pt) == FALSE) // mouse is outside window, scroll in that direction

This basic method is not enough for a "grown-up" editor so we need to take this one step further. We will use a timer to generate regular scrolling events. However we won't just stop there - the other issue that we will look at is what to do about variable-speed scrolling. For example, many controls scroll their window-content slowly when the mouse is just outside the window, and speed up incrementally when the mouse gets further and further away. The picture below should help to illustrate this idea.

The inner red rectangle represents the client-area of the window. When the mouse is inside this region no scrolling is required. As the mouse move further and further away (represented by the red arrows) the window should be scrolled in that direction at the appropriate speed.

// If mouse is within client area, we don't need to scroll
if(PtInRect(&rect, pt))
{
    if(m_nScrollTimer != 0)
    {
        KillTimer(m_hWnd, m_nScrollTimer);
        m_nScrollTimer = 0;
    }
}
// If mouse is outside window, start a timer in
// order to generate regular scrolling intervals
else
{
    if(m_nScrollTimer == 0)
    {
        m_nScrollCounter = 0;
        m_nScrollTimer   = SetTimer(m_hWnd, 1, 10, 0);
    }
}

Variable speed scrolling

There are several methods which we can use to create variable-speed scrolling - however all methods are based around WM_TIMER and the SetTimer API. The first method is to reprogram the timer interval each time the mouse gets closer/further away from the window, resulting in a faster/slower rate of WM_TIMER messages being received by the window. When we receive a WM_TIMER, we scroll +1/-1 in whatever direction the mouse is. This is a little messy because it causes the timer to be reset each time SetTimer is called. See the following from excert MSDN on SetTimer:

"If the hWnd parameter is not NULL and the window specified by hWnd already has a timer with the value nIDEvent, then the existing timer is replaced by the new timer. When SetTimer replaces a timer, the timer is reset. Therefore, a message will be sent after the current time-out value elapses, but the previously set time-out value is ignored."

I'm unsure if this behaviour will cause us problems (i.e stuttering movement) or not. But the main reason I want to avoid this technique is that it doesn't support 2-dimensional scrolling very well. Imagine the following scenario: the mouse is held quite a distance from the top of the TextView (the vertical scrolling direction), but is only just outside on the left side (the horizontal direction). Which direction do we choose to base our scrolling speed on? The answer is, we can't have both fast scrolling (vertically) and slow scrolling (horizontally) at the same time with this technique.

The next method is to therefore use a constant timer interval set at a slowish rate. As the mouse moves further from the window (in either dimension), instead of speeding up the scrolling, the distance that is scrolled is increased - i.e. slow scrolling would be 1-line-at-a-time using the slow timer interval, faster scrolling would be 3-lines-at-a-time, and so on. This is a perfectly reasonable method and you can observe many controls using it for their scrolling.

The last method is similar to the previous one, but instead we program the timer to have a high repeat rate (i.e. 10ms). This fast interval allows us to scroll the window quickly when the mouse is at it's furthest from the window. And as we move the mouse closer, we can simply "skip" processing selected WM_TIMER messages. For example, we would process every WM_TIMER for full-speed scrolling, 1 out of 2 for half speed and 1 out of every 5 for very slow speeds. This results in smoother scrolling (we always scroll a line-at-a-time) but requires more CPU because more redrawing needs to be done. It has the advantage that we only ever scroll 1 line at a time so it is a little simpler from a coding and debugging point of view.

 

I opted for method#3 for Neatpad, simply because it was easier.

Avoiding flicker

Perhaps the reason that many text-editors exhibit flickering artifacts when they are scrolled up and down is because mouse-scrolling (in general) is so hard to get right. I do hate any sign of flickering though so it is very important from my point to view to ensure that Neatpad suffers no such problems.

There are two basic manifestations of "scroll flicker" that can be found in some applications. Both occur because the mouse selection and scrolling are not correctly synchronized. A lot of the time people don't notice these problems because most text files contain lines of varying length, and it isn't until you scroll a solid block of text that you begin to see what is going wrong. The animated gifs below hopefully illustrate the two problems - well, I hope they do because they took me ages to make! Once you've had enough of the animations just click the "stop" button in your browser window and they should stop.

The first flicker problem occurs because as the mouse moves outside of the window, the selection is redrawn, briefly extending outside of the window. Then the window is scrolled, bringing the end-of-selection back into view. It has the effect that the area to the left of the cursor appears to toggle between selected and unselected states. It's quite unsettling in my opinion but many controls exhibit this problem, including many standard Windows controls. We can prevent this problem by "clipping" the mouse coordinates to the edge of the window before we work out where the selection end-point should be.

The second case is almost the same as the first, except this time the scrolling happens first, which causes the whole display to scroll down. The selection is extended upwards after the scrolling, which again results in some fairly nasty flickering - this time because the unselected area on the first line is (wrongly) scrolled downwards and briefly appears as unselected text - then it is correctly repainted as highlighted. We can also prevent this problem by using clipping when we scroll the window - in this case, restricting the area that we scroll to a specific region which doesn't include the top/bottom lines.

It is not easy to get the mouse-selection and scrolling correctly synchronized. The basic problem is, we can't really treat these two actions as separate events because (obviously) they must occur at the same time. Adding to the problem is the fact that we have already written the scrolling and selection functionality, and it would be nice if we could still keep these as "separate" as possible so the code is kept clean. Let's have a brief recap of what we have so far:

The basic strategy to synchronized mouse movement is outlined below.

  1. Work out the correct clipping region, based on the direction we are scrolling. i.e. if we are scrolling up, we exclude the top line from being scrolled. If we scroll left, we exclude the left-most column. This can all happen inside the existing TextView::Scroll routine.
  2. Scroll the window using ScrollWindowEx, but use the clipping rectangle worked out in step#1.
  3. Work out the new cursor position and selection end points. The scrolling had the effect of modifying the scrollbar positions (because we used TextView::Scroll), so we must work out the new cursor position after this scrolling has taken place.
  4. Redraw the region that wasn't scrolled (i.e. the area outside of the clipping rectangle).

In other words, all we are really doing is scrolling a sub-region of the window and then manually repainting the area we didn't scroll once the cursor/selection endpoint has been placed appropriately. How we fit all this together will determine how successful we will be.

ScrollWindowEx

Let's look at ScrollWindowEx so we know what scrolling facilities we have at our disposal:

int ScrollWindowEx(
  HWND   hWnd,            // handle to window
  int    dx,              // horizontal scrolling
  int    dy,              // vertical scrolling
  RECT * prcScroll,       // client area              [optional]
  RECT * prcClip,         // clipping rectangle       [optional]
  HRGN   hrgnUpdate,      // handle to update region  [optional]  
  RECT * prcUpdate,       // invalidated region       [optional]
  UINT   flags            // scrolling options
);

I have highlighted (in bold) the two optional parameters that we will be using - but first a brief recap of all parameters is probably appropriate at this stage:

Due to the way we are scrolling (we always scroll away from the region we are protecting) it doesn't matter if we use the prcScroll or prcClip rectangles - both rectangles would hold the same values and the effect will be identical. In the sourcecode I have elected to use prcClip just because it is a little more obvious what it is being used for.

Scrolling Example

At this point we need to look at a scrolling example so that we are sure that we understand exactly what is happening with clipping rectangles, and update/invalid regions.

The picture above shows the Neatpad window before any scrolling has taken place. The Window is going to be scrolled upwards and left at the same time (i.e. -1, -1 in text-character-units). However, this means that we scroll the content down one line and right one character position (in order to expose new content at the top/left edges). In other words, the dx and dy parameters to ScrollWindowEx are positive.

Before the ScrollWindowEx function is invoked we define the clipping rectangle to be used. The cross-hatch shaded area represents content outside of the clipping rectangle - although this region is never invalidated or modified by the scrolling it is basically "dirty" and must be redrawn manually. The clipping rectangle we pass to ScrollWindowEx covers the client area of content/text that is not shaded. In this example, we took the top/left corner of the client-area and offset it to create the clipping rectangle.

The window after ScrollWindowEx. You should be able to notice that the content has scrolled down one line and right by one character. The cross-hatch/shaded area stays where it is because it is outside of the clipping rectangle. The inverted region represents the area that became invalid after the scrolling took place. I have chosen inverted colours purely to illustrate this area - in reality these pixels do not get modified by ScrollWindowEx and are only updated because the SW_INVALIDATE flag is specified.

Note: If we were using the hrgnUpdate parameter, this HRGN object would be modified to exactly fit the area represented by the inverted colours. It is quite an odd shape (an upside-down "L"), and for this example ScrollWindowEx returns COMPLEXREGION even though we're not using hrgnUpdate.

Both the cross-hatch and inverted regions need to be updated. For simplicity's sake we will use SW_INVALIDATE when we call ScrollWindowEx which will invalidate (and consequently update) the "inverted" region. However this leaves the cross-hatch region - we must manually create a HRGN which describes this area and call InvalidateRgn at some later point to redraw it. See below for how this is achieved.

Synchronized Scrolling

The first thing we must do is develop support for this "clipped" scrolling. I have rewritten the existing TextView::Scroll routine and called it TextView::ScrollRgn. It has an extra parameter now which specifies whether or not to return a handle to the region that is invalid after the scrolling (and indirectly controls the clipping behaviour).

When fReturnUpdateRgn is true, the scrolling is performed with the appropriate clipping, and a HRGN is returned to the caller. When fReturnUpdateRgn is false, the entire window is scrolled normally (i.e. with no clipping area defined).

HRGN TextView::ScrollRgn(int dx, int dy, bool fReturnUpdateRgn)
{
    RECT clip;
    GetClientRect(m_hWnd, &clip);

// adjust the clipping rectangle fReturnUpdateRgn is false

// do the scrolling
ScrollWindowEx(m_hWnd,
-dx * m_nFontWidth,
-dy * m_nFontHeight,
NULL, // scroll the entire window
&clip, // clip the non-scrolling part
NULL,
NULL,
SW_INVALIDATE
);

if(fReturnUpdateRgn)
{
RECT client;
GetClientRect(m_hWnd, &client);

HRGN hrgnClient = CreateRectRgnIndirect(&client);
HRGN hrgnUpdate = CreateRectRgnIndirect(&clip);

// create a region that represents the area outside the
// clipping rectangle (i.e. the part that is never scrolled)

CombineRgn(hrgnUpdate, hrgnClient, hrgnUpdate, RGN_XOR);

DeleteObject(hrgnClient);

return hrgnUpdate;
}

return NULL;
}

It is important that we preserve the existing TextView::Scroll functionality so we make this a wrapper function around TextView::ScrollRgn and specify false for fReturnUpdateRgn (i.e. make ScrollRgn scroll the entire window as normal).


VOID TextView::Scroll(int dx, int dy)
{
ScrollRgn(dx, dy, false);
}

One thing I should mention is the units used by Scroll and ScrollRgn. These are always "text" units (i.e. line/character based) rather than pixel coordinates. The ScrollRgn function converts these to pixel coordinates when it is time to scroll. Also understand that when I write Scroll(-1, -1) this scrolls the document up/left - however the screen content is scrolled down/right to achieve this.


The function below is the "almost" full implementation of the TextView::OnTimer routine. The only part that has been omitted is the code that calculates what the values for dx and dy should be - its clearer without this going on so you can look in the sourcecode download to see how it is done.


LONG TextView::OnTimer()
{
// [omitted] work out scrolling increments
int dx, dy;

// do the scroll but return the region to be manually painted
HRGN hrgnUpdate = ScrollRgn(dx, dy, true);

if(hrgnUpdate != NULL)
{
// do a "fake" WM_MOUSEMOVE to get the new cursor position
OnMouseMove(0, mouse_x, mouse_y);

// manually repaint the update region
InvalidateRgn(m_hWnd, hrgnUpdate, FALSE);
DeleteObject(hrgnUpdate);

UpdateWindow(m_hWnd);
}
}

The function fulfills all our criteria for smooth, synchronized scrolling. Firstly the window content is scrolled - however a HRGN is returned which specifies an invalid region that needs manually repainting. The text-caret and selection offsets are computed after doing the scroll (when the m_nxScrollPos variables become valid) - this is achieved by manually calling OnMouseMove and reusing the code that was already there. Just in case the mouse was moved whilst the timer went off, we also take advantage of the fact that OnMouseMove will also invalidate any area affected by selection change.


The very last thing to occur is to manually invalidate the region returned by ScrollRgn. The invalid window areas are finally repainted when UpdateWindow is called, using the updated cursor and selection offsets.


Neatpad additions


Most of this tutorial series will be focussed on the TextView component of Neatpad. I don't intend to cover the development of Neatpad in any great detail unless it is directly related to the support of the TextView. Instead I'll just give a brief mention of what has been added and let the readers study the sourcecode at their leisure.



This time around an Options dialog has been implemented which allows you to select the font and colours used by Neatpad. The dialog is fairly complete and the settings are saved to the registry each time Neatpad exits. The second options-pane doesn't do anything yet, but I have left it in as a "todo" which will be implemented at some point in the future. The code can be found in the Neatpad directory, in the Options.c and OptionsFont.c files.


A new message (TXM_SETCOLOR) has been added to the TextView in order to support programmatic control of colour settings:


#define TXM_SETCOLOR (TXM_BASE + 5)

// wParam = TXC_xxx index value
// lParam = RGB color

#define TextView_SetColor(hwndTV, nIndex, rgbColor)

To send the message, use the TextView_SetColor macro. There are two parameters in addition to the window-handle. nIndex is a zero-based value taken from the TXC_xxx range of numbers. rgbColor is (you guessed it) a COLORREF RGB colour. For example, to set the selection background colour, use the following code:


TextView_SetColor(hwndTV, TXC_HIGHLIGHT, RGB(200,100,240));

There is one nice feature about this message which I hope people will like. If you want to set the colour to one of the predefined system colours (i.e. COLOR_WINDOWTEXT used with GetSysColor), then use the SYSCOL macro (defined in TextView.h):


TextView_SetColor(hwndTV, TXC_HIGHLIGHT, SYSCOL(COLOR_3DFACE));

The SYSCOL macro creates a "special" RGB value which the TextView recognises as a system-colour, not a plain RGB value. You only need to set the colour this way once, and subsequent changes in system colour schemes are automatically reflected in the TextView.


Coming up in Part 7


Hopefully I have given a good overview and explanation of how to scroll using the mouse-selection. Although it's not a particularly technical subject, it is fairly difficult to iron out the finer subtleties of mouse and timer interactions, and to be able to visualise these interactions whilst designing something like this (and then write about it!).


Onto the next tutorial then. Part 7 will be something a little simpler (I need to give my brain a rest!), so I will be implementing support for borders and margins. Margins will be provide us the ability to show line numbers and custom icons in a "selection" area (i.e. like Visual Studio uses for placing breakpoints). I also want to provide a "printer margin" on the right-side so that the printable area of a text document is distinguishable from any text that might get clipped due to printing.


You've probably noticed that so far all of the tutorials have been focussed around the graphical / user-interface aspects of Neatpad. This is a deliberate tactic as I feel it is important to have a good foundation before diving into complicated memory/file-management techniques. From experience I know it is easy to get distracted because of a half-finished GUI so I want to get all the GUI details completed.


AttachmentSize
neatpad6.zip81.73 KB

Part 7 - Margins and Long Lines

Introduction

This will probably be quite a short tutorial as the subject of margins is really quite simple to implement. This time around we will look at implementing a selection margin (complete with full-line selection), line numbers, line-indicator icons (e.g. like the breakpoint bitmaps in Visual Studio), and lastly the problem of long-line display.

Drawing a margin

The main screenshot shows Neatpad with a selection-margin, line-numbers, bitmapped bookmarks and long-line highlights. Although it looks quite complicated drawing a margin is really easy, as long as you understand that it is just a rectangular area to the left of the display that is drawn and scrolled differently.

I've highlighted the picture above to illustrate that even though we've added a vertical margin, lines are still drawn horizontally one-by-one. However the line-drawing process (the TextView::PaintLine function) has changed to allow for the margins.

First of all the margin is drawn using the new function TextView::PaintMargin

int TextView::PaintMargin(HDC hdc, ULONG nLineNo, RECT *margin)

The area taken by the margin is then clipped to prevent anything drawing over it. Next, the text is drawn as normal - but offset to the right of the margin. The clipping is important because it ensures that when we scroll horizontally the lines of text don't overdraw the margin and instead "disappear" behind the margin. Basically the process looks like this:

void TextView::PaintLine(HDC hdc, ULONG nLineNo)
{
    RECT rect; 
    // work out where to draw the line

// handle the margins
if(LeftMarginWidth() > 0)
{
RECT margin;

// work out the margin coordinates

// paint the margin
PaintMargin(hdc, nLineNo, &margin);

// clip the margin so the text doesn't draw over it
ExcludeClipRect(hdc, margin.left, margin.top, margin.right, margin.bottom);

// offset the text placement
rect.left += LeftMarginWidth();
}

// paint the text as normal
PaintText(hdc, nLineNo, &rect);
}

The key point to understand is that the normal line-text must be offset to the right to take into account the margin. It's really very simple so just take a look at the sourcecode download to see it in action.


Two more important things to mention: I have also had to modify the scrolling routine so that the margin is not scrolled when the main text is scrolled. This was very simple as the clipping rectangle we specify in ScrollWindowEx is simply adjusted so that the margin does not get included.


HRGN TextView::ScrollRgn(int dx, int dy, bool fReturnUpdateRgn)
{
...

// take margin into account
clip.left += LeftMarginWidth();

...
}

The other modification was the mouse-input (selection) code. The cursor-position must be adjusted so that the margin is again taken into account. I really don't want to include any code for this as you should be getting the idea by now. Basically, any x-coordinate is shifted to the right by the size of the margin before any input or drawing takes place.


Line bitmaps


Just a quick word about the line-indicator bitmaps you can see. I've added a new ImageList member variable to the TextView class which can hold a user-specified collection of bitmaps:


HIMAGELIST m_hImageList;

The image-list can be set using the new TextView TXM_SETIMAGELIST message (or the TextView_SetImageList macro).


HIMAGELIST hImageList = ImageList_LoadImage(...);

TextView_SetImageList(hwndTextView, hImageList);

Using an image-list is the simplest way to work with images as it manages the bitmap memory, and also provides drawing support as well - so storing/drawing bitmaps is really easy.


The difficult part comes when we draw each line. How do we know what image to place in the margin-area? For a text-editors which use an array or linked-list of lines this is really easy - because additional information can be easily stored for each line entry in the editor. However for our Neatpad design this is not possible because we have to support large-file editing at some point - and having a line-buffer for a 4gb file would not be a good idea.


For this reason I have used a separate array which holds only those lines that have bitmaps associated with them. The array is always sorted so that we can use a binary-search to quickly determine if a specific line has any bitmap assocated with it.


typedef struct
{
ULONG nLineNo;
ULONG nImageIdx;

} LINEINFO;

LINEINFO m_LineInfo[MAX_LINE_INFO];

Each time a line is drawn the LINEINFO array is searched using the TextView::GetLineInfo function:


LINEINFO* TextView::GetLineInfo(ULONG nLineNo)
{
LINEINFO key = { nLineNo, 0 };

// perform the binary search
return (LINEINFO *) bsearch(
&key,
m_LineInfo,
m_nLineInfoCount,
sizeof(LINEINFO),
(COMPAREPROC)CompareLineInfo
);
}

This function returns a pointer to the appropriate LINEINFO structure if successful, or NULL if there is no stored information for the specified line.


I anticipate that more information could be stored about lines such as a whole-line highlight colour, bookmarks, annotations etc. However for now I have just included support for an image-index. The images for each line can be set using another new TextView message, TXM_SETLINEIMAGE:


TextView_SetLineImage(hwndTextView, nLineNo, nImageIdx);

Line-selection and mouse input


Now that we have a margin to the left of the main text display we can use this area to initiate line-based selections.



To allow for this line-based selection method, we need to keep track of more than just the fact that we are making a selection. The previous boolean m_fSelection (which was just used to indicate if a selection was in progress or not) has been replaced by a new variable:


 SELMODE m_nSelectionMode;

m_nSelectionMode is a SELMODE enumeration with the following values:


enum SELMODE 
{
SELMODE_NONE,
SELMODE_NORMAL,
SELMODE_MARGIN
};

So at the moment we support two types of selection - "normal" and "margin". At some point in the future this could be extended to support other types of selection such as column and block selections. Of course the mouse-routines have been modified to understand the new types of selection.


Highlighting long lines


One feature which I find particularly useful is the highlighting of long lines. I can still remember the Borland C++ 4.0 IDE for Windows 3.1 which featured a single vertical-line margin to indicate where column-80 was. I actually found this a little tacky but I wanted to create a similar effect.


The reason for highlighting longs lines is provide a way to indicate to the user when a line of text becomes too long. This is most useful for programmers who don't want their text to "wrap" when it is printed out - so the aim is to keep all lines of text under a certain limit.


There are basically two ways of implementing long-line highlights:



In order to support this new long-line display I've had to extend the TextView::ApplyTextAttributes member-function with a new parameter - &nColumn :


int TextView::ApplyTextAttributes(ULONG nLineNo, ULONG nOffset, ULONG &nColumn, TCHAR *szText, int nTextLen, ATTR *attr)

Notice that nColumn is a C++ reference. It is continually updated by ApplyTextAttributes to keep track of the current column-position (we need a C++ reference so that the value is preserved through successive calls). Once nColumn reaches a certain value, ApplyTextAttributes will use a different default background colour for the text.


A new message (TXM_SETLONGLINE) has been added to the TextView to allow the long-line limit to be programmatically altered. The TextView_SetLongLine can be used to set this value:


TextView_SetLongLine(hwndTextView, 80);

64bit support


This probably a little late in the day (it should have happened from day#1) but I've started to make the Neatpad and TextView projects 64bit compatible. You will therefore need to use a recent Platform SDK when compiling Neatpad in order to get the new mixed 32bit/64bit definitions. This has also resulted so far in the following change:



To test that these changes actually worked I compiled the project using the Microsoft 64bit compiler, targetted for the IA64 architecture. Follow the steps below to duplicate this:



There are still alot of changes to make (ULONG vs ULONG64 issues) and as I don't have access to a 64bit machine currently I won't be able to make any more modifications until I can test properly. However I hope to have Neatpad fully 64bit compatible before the end of the series!


Conclusion


This tutorial was a little off-track as margins and long-line highlights are not really that important for a text-editor's design. Anyhow I wanted to get it out of the way so I could concentrate on the core functionality.


Coming up in Part 8 will be support for UTF-8 and Unicode!


AttachmentSize
neatpad7.zip105.87 KB

Part 8 - Introduction to Unicode

Introduction

The user-interface for the TextView has progressed enough to allow us to switch our attention back to the TextDocument. I am now concentrating on adding full Unicode support to Neatpad. Because Unicode is such a complicated subject I won't attempt to tackle it all at once so instead I will split the various aspects of Unicode across several tutorials.

The first Unicode topic (Part 8 - what you are reading now) will be an introduction to Unicode and the various encoding schemes that are in common use. There won't be any code download as this will be purely a discussion about Unicode to make sure everyone understands the various issues. The next Unicode topic (Part 9) will look at how to incorporate the concepts discussed here into Neatpad - so will focus on loading, storing and processing Unicode data. The rest of the Unicode topics will focus on the issues surrounding the display of Unicode text - and will include complex script support, bi-directional text and the Uniscribe API.

Unicode Myths

Before we start properly it may be worth dispelling the most common myths about Unicode - and hopefully by the end of this article you will have a good idea about what Unicode is all about.

The most common incorrect statement I see about Unicode is this: "In Unicode all characters are two bytes long."

This is totally incorrect. The Unicode standard has always defined more than one encoding form for its characters - with UCS-2 originally being the most common. However since Unicode 2.0 there no longer exists any encoding scheme which can represent all characters using two bytes (read about UTF-16 further down the page).

It doesn't help that even Microsoft gets it wrong in it's own documentation sometimes with statements such as "Unicode is a wide-character set". Well, they're half-way right - on Windows at least, Unicode strings are typically encoded as UCS-2/UTF-16, but it is quite misleading to claim that Unicode is a "wide-character-set" because it is so much more than that.

The next most common question you hear Windows programmers asking is "How do I convert a UNICODE string to a UTF-8 string?". This is a question usually asked by somebody who doesn't understand Unicode. A UTF-8 string is a Unicode string so it cannot be converted. Probably the person asking the question meant "how to I convert between this UCS-2 formatted string and UTF-8?". Of course, the answer in this case would be the WideCharToMultiByte() API call.

One last misconception is that the wchar_t "wide character" type is 16bits. This is maybe true on Windows platforms, but the C language makes no such assumption about the width of a wchar_t type - it can be as wide as the C compiler wants it to be in order to represent a single "wide character", and on UNIX and Linux wchar_t is commonly a 32bit quantity.

Code Pages and Character Sets

Everyone is familiar with the ASCII character-set, which encodes 128 unique character values (0-127) using 7bit integers. Most people are also aware of the existance of the ANSI character-set(s), which use a full 8bit byte to encode 256 (0-255) character values. And it is probably fair to say, most people are fully aware that an 8bit byte is not sufficient to encode all of the world's writing systems, except maybe a very few European languages.

These "byte-based" character sets are often referred to as Single-Byte-Character-Sets, or SBCS for short. Most of the 8bit character-sets keep the bottom 128 characters as ASCII, and define their own characters in the top "half" of the byte. There are many, many single-byte character-sets in existence.

These extra character-sets are referred to as codepages (a traditional IBM term), and are each identified by a unique codepage number which is usually defined by the ANSI/ISO organisations. For example, the familiar ANSI codepage used by Windows is 1252. A Windows application could set it's codepage number to tell Windows from which character-set it wanted to work with.

Of course a single 8bit character was never going to be enough to represent the rest of the world's writing systems. The east-Asian languages especially needed a different approach and this is where Double-Byte-Character-Sets (DBCS) come into play. With these character-sets, a character can be represented by either one or two bytes. There are many other character-sets which share this design and Microsoft refers to them all as Multi-Byte-Character-Sets (MBCS) in it's documentation. The one thing all these character sets have in common is their complexity - they are all quite difficult to work with from a programming perspective.

You may be familar with the many APIs and support libraries for dealing with MBCS strings such as CharNext, CharPrev, _mbsinc, _mbsdec etc. All of these APIs are designed for a program to work with legacy character-sets - and rely on the correct codepage to be setup before an application can display text correctly.

Note that all of these concepts are really quite out of date now. SBCS, MBCS and DBCS, and the whole idea of codepages all belong in the past and thankfully we no longer have to worry about them.

What is Unicode?

There seems to be alot of confusion surrounding Unicode. This is mostly due to the fact that Unicode has evolved quite significantly since its first release in 1991. A great deal of information has been written about Unicode during this time and much of the earlier information is now inaccurate. Almost 15 years later Unicode is now at version 4.1 - and your perception of Unicode has probably been most influenced depending on when you first became exposed to the subject. Understanding the history of Unicode is almost as important as understanding Unicode itself.

Unicode is the universal character encoding standard for representing the world's writing systems in computer-form. It is also a practical implementation of the ISO-10646 standard. The Unicode consortium (represented by several international companies including Microsoft and IBM) develops the Unicode standard in parallel with ISO-10646. Often you will see terms from each standard used interchangably but really they refer to the same thing.

The main purpose behind Unicode is to define a single code-page which holds all of the characters commonly in use today. At its heart Unicode is really just a big table which assigns a unique number to each character as defined by ISO-10646. Each of these numbers in the Unicode codepage is referred to as a "code-point". The following are examples of Unicode code-points:

U+0041 "Latin Capital Letter A"
U+03BE "Greek Small Letter Xi"
U+1D176 "Musical Symbol End Tie"

The standard convention is to write "U+" followed by a hexadecimal number which represents the codepoint value. Often you will also see a descriptive tag next to each code-point, which gives the full name to the codepoint as defined in the Unicode standard.

The Unicode standard can represent a little over one million code-points. With version 4.0 around 96,382 characters have been assigned to actual code-points, leaving approximately 91% of the encoding space unallocated. With most of the world's writing systems already encoded (including the garantuam Chinese-Japenese-Korean character-sets), this leaves a lot of expansion for future use.

A single code-point within this encoding space can take a value anywhere between 0x000000 and 0x10FFFF. Unicode codepoints can therefore be represented using 21-bit integer values. It is no accident that these numbers were chosen and if you read into the UTF-16 format more deeply you will understand why Unicode has been limited in this way. It is important to note that both the Unicode consortium and ISO pledge to never extend the encoding-space past this range.

UTF-32 and UCS4

Of course, a 21-bit integer is a bit of an "odd" sized unit and doesn't lend itself well to storage in a computer. As a result of this, Unicode defines several Transformation Formats with which to encode streams of Unicode code-points. The three most common are "UTF-8", "UTF-16" and "UTF-32".

Out of these three, UTF-32 is by far the easiest to work with. Exactly one 32-bit integer is required to store each Unicode character. However, UTF-32 is very wasteful - 11 bits out of the 32 are never used, and in the case of plain English text encoded as UTF-32, this means 75% waste overall. The table below illustrates how a 21bit integer (represented with 'x's) is encoded in a 32bit storage unit:

Unicode
UTF-32
00000000 - 0010FFFF 00000000-000xxxxx-xxxxxxxx-xxxxxxxx

Some operating systems (like UNIX variants) use UTF-32 internally to process and store strings of text. However UTF-32 is rarely used to transmit and store text-files simply because it is so space-inefficient and because of this, it is not a commonly encountered format.

UCS-2

UCS is a term defined by ISO-10646 and stands for Universal Character Set. When Unicode was first released the primary encoding scheme was intended to be the UCS-2 format. UCS-2 uses a 16bit code-unit to store and represent each character. At the time this was considered an adequate scheme because only 55,000 characters had been assigned to Unicode code-points thus far - this meant that every Unicode character (at the time) could be represented by a single 16bit integer. Unfortunately we are still paying the consequences for this incredibly short-sighted decision.

Even before Unicode was developed there existed many "wide character sets" which required more than one byte to store each character. The most notable were IBM's DBCS (double-byte-character-set), JIS-208, SJIS and EUC to name just a few. To support the various wide-character sets, the wchar_t type was introduced into the C standard in the late 80s (although it wasn't ratified until 1995). The wchar_t type (and wide-character support in general) provided the mechanism to support these wide-character sets.

The companies which backed the UCS-2 format pledged support for Unicode. Microsoft in particular engineered its Windows NT OS line to be natively "Unicode" compatible right from the outside, and used the 16bit UCS-2 wide-character strings to store and process all text.

UTF-16

In 1996 Unicode 2.0 was released, extending the code-space beyond 65,535 characters - or what is known the Basic Multilingual Plane (BMP). It was obvious that a single 16bit integer was insufficient to encode the entire Unicode code-space, and the UTF-16 format was introduced along with the UTF-16 Surrogate Mechanism. Importantly, UTF-16 is backward compatible with UCS-2 (it encodes the same values in the same way).

In order to represent characters from 0x10000 to 0x10FFFF, two 16bit values are now required - which together are called a Surrogate Pair. This also means that there is no longer a 1:1 mapping between the 16bit units and Unicode characters. The two 16bit values must be carefully formatted to indicate that they are surrogates:

This "surrogate range" between D800 and DFFF was "stolen" from the one of the previously named "Private Use Areas" of UCS-2. When combined together a surrogate-pair provides 1024x1024 combinations, which results in 0x100000 (1,048,576) additional codepoints outside of the BMP. The table below illustrates how the Unicode code-space is represented using UTF-16.

Unicode
UTF-16
00000000 - 0000FFFF xxxxxxxx-xxxxxxxx
00010000 - 0010FFFF 110110yy-yyxxxxxx 110111xx-xxxxxxxx

So in essence, UTF-16 is a variable-width encoding scheme much like the multi-byte UTF-8. You may be wondering (as I did) exactly what the advantage is now that UTF-16 is no longer a fixed-width format. It would be interesting to see if UTF-16 would be in use today had UTF-8 been available right from the start.

Even without the variable-width problem, using UTF-16 from a C/C++ perspective is pretty tiresome because of the strange wchar_t type and the L"" syntax for widecharacter string literals. Understand that UTF-16 is the dominent encoding format at the moment. Microsoft Windows and Macintosh OSX both use it for their operating systems, and the Java and C# languages also use UTF-16 for all string operations. UTF-16 is unlikely to go away any time soon.

Actually, the variable-width nature of UTF-16 and the slight complexity it now brings pales into insignificance when compared with the nightmare of displaying Unicode. It really doesn't matter that a string is in multibyte format - even with UTF-32 one codepoint does not necessarily map to one visible/selectable "glyph", as we will find out over the next parts of this series.

UTF-8

One very popular encoding format is UTF-8, officially presented in 1993. A common misconception is that UTF-8 is a "lesser" form of UTF-16. Nothing could be further from the truth - it encodes the exact same Unicode values as UTF-16 and UTF-32, but instead uses variable-length sequences of up to four 8-bit bytes. This means that UTF-8 is a true multi-byte format. Much of the text on the Internet (such as in web pages and XML) is transmitted using UTF-8, and many Linux and UNIX variants use UTF-8 internally.

The way UTF-8 works is quite clever. The MSB (most-significant-bit) in each byte is used to indicate whether a character-unit is a single 7bit ASCII value (top bit set to "0"), or is part of a multibyte sequence (top bit set to "1"). This means that UTF-8 is 100% backward compatible with plain 7bit ASCII text - of course, it was designed for this very purpose. The design allows older non-Unicode software to handle and process Unicode data with little or no modification.

There are actually three basic constructs in UTF-8 text:

So, a Unicode value in the range 0-127 is represented as-is. Values outside of this range (0x80 - 0x10FFFF) are represented using a multibyte sequence, comprising exactly one lead-byte and one-or-more trail-bytes. Each Unicode character with a value above 0x7F has it's bits distributed over the "spare" bits in the multibyte sequence.

The following table illustrates this concept:

Unicode
UTF-8
00000000 - 0000007F 0xxxxxxxx
00000080 - 000007FF 110xxxxx 10xxxxxx
00000800 - 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF* 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF* 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Notice that the last two rows have been marked with an asterisk: they are illegal forms of UTF-8. Although UTF-8 can theoretically be used to encode a full 31bit integer using lead-bytes 111110xx and 1111110x, these are overlong sequences because they represent numbers outside of the 0-10FFFF unicode range. Remember that this "artificial" limit has been imposed due to the UTF-16 surrogate mechanism.

Unicode text files

If you have ever used the regular Notepad on Windows NT you may be aware that text files can be saved in several formats - ASCII, Unicode (which is really UTF-16), Unicode-Big-Endian (which is big-endian-UTF-16) and lastly UTF-8.

The problem with text-files on Windows (and probably most other operating-systems) is that there is no way to tell what type of text is contained within a file - because plain-text files (by their very nature) provide no such facility. The Unicode standard therefore defines a method for tagging a text-file with a "Byte Order Mark" in order to identify the encoding scheme used to save the file. The optional "BOM" sequences are listed below.

Byte Signature
Unicode Format
none Plain ASCII/ANSI
EF BB BF UTF-8
FF FE UTF-16, little-endian
FE FF UTF-16, big-endian
FF FE 00 00 UTF-32, little-endian
00 00 FE FF UTF-32, big-endian

The table above was taken from the Unicode Standard 4.0. The BOM values are chosen because it would be extremely unlikely to ever encounter those character-sequences at the start of a plain-text document. Of course it is still possible to encounter such files - its just very rare.

With Neatpad, in the absence of any signature the file is treated as plain ANSI text. This is in contrast to how Notepad works - it uses statistical analysis of the file in order to make a 'best guess' as to the underlying format, and sometimes gets it wrong.

Relevant reading

The first place you should start is www.unicode.org/faq. This is the official site for Unicode and contains the complete Unicode 4.1 standard. The standard is also available in book (hardback) form.

However if you want a really good book on Unicode then I can recommend "Unicode Demystified" by Richard Gillam. This book gives a really good practical coverage of the many Unicode issues and I found it indispensible whilst researching this project.

The following links may also prove to be useful:

Unicode C++ projects in Windows

Seeing as we want to support Unicode in Neatpad, it makes sense for us to use the native Unicode support provided by the Windows operating systems. In practise this means using the "Wide-character" Unicode APIs - which are basically UTF-16/UCS-2. There is a certain technique to writing Unicode-enabled applications under Windows which every programmer should be aware of.

  1. The first step in creating any Unicode Windows project is to enable support for the wide-character APIs. This is usually achieved by defining the UNICODE and _UNICODE macros for every source-file in your project (and removing macros such as _MBCS and _DBCS). The reason two macros are required is simple: UNICODE is used for the Windows/Platform SDK libraries, whereas _UNICODE is used for the standard C/C++ runtime libaries.
  2. The second step is to #include <tchar.h> - this file contains many "support macros" that are very useful for Unicode projects.
  3. The third step is to define any character type as TCHAR - this is another macro and results in WCHAR string types for UNICODE projects, and char string types for regular "non Unicode" projects.
  4. The forth step is to declare all string literals using the _T and _TEXT macros, which are defined in <tchar.h>. These macros control how string-literals are defined. For a non-Unicode project these macros do nothing, however for a UNICODE project all string-literals have the L"" string prefix attached.
  5. The fith and final step is replace all calls to C-runtime string functions (such as strcpy) with their _tcs equivalent (e.g. _tcscpy). Although these equivalents can all be found in the <tchar.h> runtime header, there is a simple trick to obtaining the '_t' name from the original - just replace the 'str' part with '_tcs'.

You've hopefully got the idea that Unicode programming in Windows relies heavily on the C/C++ preprocessor for support. The example below illustrates all of these concepts together.

#include <windows.h>
#include <tchar.h>

TCHAR szFileName[MAX_PATH];

// calling one of the standard-C calls _tcscpy(szFileName, _T("file.txt"));

// calling one of the Platform-SDK APIs CreateFile(szFileName, GENERIC_READ, ...);

Because TCHAR, _tcscpy, _T and CreateFile are really MACROs, with UNICODE defined our sample program becomes:

WCHAR szFileName[MAX_PATH];

_wcscpy(szFileName, L"file.txt");
CreateFileW(szFileName, ...);

Note that the WCHAR character-type is actually another macro which is defined as wchar_t. The Visual-C compiler treats this wide-character type as a 16bit quantity. Don't assume that this is true across all platforms - for example on UNIX systems wchar_t is usually a 32bit quantity because the native Unicode format is UTF-32 on these systems.

Without the UNICODE setting defined our sample program becomes an ordinary "C" program:


char szFileName[MAX_PATH];

strcpy(szFileName, "file.txt");
CreateFileA(szFileName, ...);

Rather than putting UNICODE and _UNICODE at the top of every source-file we will make things easier on ourselves. The last thing we need to do is configure the Neatpad and TextView projects to build as Unicode applications on a project-wide basis. Rather than modify the existing projects, we will add two new project configurations (one for Debug and one for Release). This will allow us to build the an ASCII-only Neatpad, and a Unicode build from the same sourcecode.


Select Build -> Configurations menu item from Visual Studio.



The new configurations are created by using the existing non-unicode projects as templates. We need to perform this task for both the Neatpad and TextView projects, and for each Debug and Release build as well. Once done we will have four project configurations for each project - Debug, Release, Unicode Debug and Unicode Release.


Coming up in Part 9


Hopefully this has been a useful introduction to Unicode. I felt it was necessary to cover the basics first before diving straight in as Unicode is such a complicated subject. The next part of this series will look at taking the ideas presented here and integrating directly into Neatpad.


 


Part 9 - Unicode Text Processing

Introduction

The last tutorial presented an overview of the various encoding formats that are used to store Unicode text. It is now time to take this theory and apply it to Neatpad. Therefore the subject of this article will be Unicode text processing.

The image above shows Neatpad's new Encoding menu option - with a UTF-8 file displayed in all it's glory. At the top of this tutorial are a collection of Unicode files which you can use to test Neatpad's Unicode capability.

Loading text files

Previous incarnations of Neatpad supported a single text encoding - plain ASCII text. A Unicode text editor must naturally support the various Unicode file-formats so our first step will be to modify the TextDocument's init() function to detect what type of file we are opening.

Of course it isn't possible to detect what type of encoding a text-file uses until we actually open the file and read the first few bytes. We will use what Unicode terms the "Byte Order Mark" - a specific sequence of bytes that can only appear at the start of a Unicode text file, and if present will determine the exact encoding method used to save the file.

Byte Signature
Unicode Format
Neatpad Format
none Plain ASCII/ANSI NCP_ASCII
EF BB BF UTF-8 NCP_UTF8
FF FE UTF-16, little-endian NCP_UTF16
FE FF UTF-16, big-endian NCP_UTF16BE
FF FE 00 00 UTF-32, little-endian NCP_UTF32
00 00 FE FF UTF-32, big-endian NCP_UTF32BE

Therefore a new function has been added to the TextDocument - TextDocument::detect_file_format - who's purpose is to detect the format of the text-file as it is being loaded during TextDocument::init. In the absence of any file-signature we will assume that the file-contents is plain ASCII/ANSI text.

int TextDocument::detect_file_format(int *headersize);

This function's sole task is to analyse the first x bytes of a file and compare them against the various Byte-Order-Mark values that are defined in the table above. It is literally a task of performing a series of memcmp's until we match a format. The detect_file_format function returns an appropriate NCP_xxx value (Neatpad Codepage) to indicate what type of file is being processed.

The file's text-format is stored internally by the TextDocument (in member-variable 'fileformat'). The length of the Byte-Order-Mark header is also saved away in the 'headersize' member-variable - so that we can always identify the start of the real content no matter what type of file we are loading.

Internal text representations

Most text-editors (such as Notepad) will load an entire text-file into memory. No matter what the underlying file format (i.e. ASCII, UTF-8 or UTF-16), the contents will be converted to an internal format to make it easier to work with. For Windows programs, this is usually (but not always) the native UTF-16 format of Windows NT. This makes sense because all of the text-based Windows APIs are designed to handle UTF-16/UCS2.

This is a great way to structure a program because you maintain one set of source-code for the main editor (which interfaces directly with the OS's text routines), and then write a set of simple file I/O conversion routines which load and save each of your supported formats. The editor is kept very simple because the text it processes is always in one format. When it comes back to saving a file in it's original format then the entire text is converted back again.

Of course this method could require large amounts of memory because the entire file must be loaded at one time. In order for us to support our goal of a multi-gigabyte text editor we must leave the file in it's "raw" state and only map specific parts of a file into memory as required - much like the HexEdit program on this site.

However this leaves us with a problem - how do we handle many different forms of text within the same program but still keep a single code-base which is not over-complicated by the various encodings it must process? Here are the two basic strategies available:

  1. Write separate versions of the TextView/TextDocument for each specific file format. We would then create a specific instance of TextView (i.e. TextViewUtf8 / TextViewUtf16) depending on what type of file we encountered. We could potentially use macro's / C++ templates to make our lives easier but I believe this method will be a code-maintanence-nightmare. Avoid at all costs!
  2. Write a generic TextView which always handles text in the "native" format (i.e. UTF-16 for Windows). The TextView would have no knowledge of the underlying file-format, and it would be up to the TextDocument to convert the underlying file-format into UTF-16 as the TextView requested.

I think that method#2 will provide the greatest flexibility and with careful design should work well for Neatpad.

Generic text processing

The idea behind a "generic" design is that the TextView always gets to see and process UTF-16 text (i.e. standard wide-character Unicode strings). It is completely unaware that the underlying file the TextDocument is reading is anything other than UTF-16 text. This means that whenever the TextView asks for text to display, it is up to the TextDocument to translate (if necessary) the underlying file contents into UTF-16 (i.e. on the fly in realtime).

The TextDocument on the other hand understands "all" types of file-format. It knows how to read the various encodings that we will support - so this would be ASCII, UTF-8 and UTF-16.

I feel that this type of design will suit Neatpad very well. Because the user-interface (the TextView) has the potential to be so complicated, it is very important to try and isolate all of the text-conversion problems into one place so that we only have to worry about it once. It also has the advantage that we could add further text-formats to the TextDocument (i.e. UTF-32) and the TextView would never have to be modified. The TextView should only care about UTF-16.

Two coordinate systems

Deciding to move to this "generic" text model has introduced a major problem, because we now have two coordinate systems to consider - one for the TextView, and one for the TextDocument. At this point we could just say "we'll support UTF-16 for the moment and add UTF-8 later on" - but this would be a mistake. The design of a "single format" editor is very different to an editor that must handle arbitrary file-formats and we must move to a more generic design or this will cause us even bigger problems later on.

So, we have decided that the TextView will work exclusively in UTF-16 units. This is a good thing. It basically means that the entire "user-interface" to the TextView control is in the Native Windows Unicode format. Don't underestimate how important this is. We haven't progressed this far yet, but try to imagine a "user" of the TextView control (i.e. a programmer) using it in a C++ project:

This programmer's project will naturally be Unicode and all text operations will therefore also be UTF-16. TextView operations such as cursor positioning, selection management, getting and setting text at specific offsets, searching for text etc must be UTF-16 also. The user/programmer doesn't care what the underlying format of the textfile is, all they see of the world is UTF-16, and all operations must match this view of the world. Therefore our cursor offsets and selection offsets - our entire coordinate "front end" to the control - must be UTF-16 based. This is where we hit our problem though:

The TextDocument has a different view of things. It must work with arbitrary file formats and it won't know - until loaded - what format a text-file will be in. It could be dealing in single-byte formats (ASCII), multi-byte formats (UTF-8) or wide-character formats (UTF-16). The TextDocument must use a coordinate-system that is common to all these formats. Of course this will be a byte-oriented system - so all line-offsets and text-accesses will must be byte-based.

To try and illustrate this TextView/TextDocument divide, let's look at a quick example of some Unicode text:

U+0041  LATIN CAPITAL LETTER
U+06AF  ARABIC LETTER GAF
U+16D4  RUNIC LETTER DOTTED-P
U+10416 DESERAT CAPITAL LETTER JEE

The text above is just a random collection of four Unicode characters - with the Unicode code-point values listed to the side. To see how these characters relate to Neatpad, we will imagine that the text above has been encoded as UTF-8 and loaded into Neatpad. The TextDocument would therefore be working in UTF-8 multi-byte units:

The TextView of course sees the file as UTF-16. Hopefully the diagram illustrates just how separated the TextView has become from the underlying file. Apart from the first character ('A'), the raw data that the TextView gets to see is completely different to how it really appears on disk. Remember, all this should be happening in realtime, not during the file-loading process.

But we still haven't solved our problem. The TextView speaks UTF-16 and the TextDocument speaks byte-offsets. We need to devise some kind of mechanism to perform mappings between UTF-16 offsets (i.e. code-unit offsets) and the underlying file-content (whatever that may be). This task will fall to the TextDocument, and it will be the line-offset buffer that will be doing all the hard work.

Reading Unicode data

The decision to make the TextView UTF-16 only means that our TextDocument::getline routine must change. Remember that this is the main "gateway" between the TextView and the TextDocument. Let's look at what we had before:

ULONG TextDocument::getline(ULONG lineno, ULONG offset, char *buf, size_t len, ULONG *fileoff)

The TextDocument::getline routine basically returns a block of text from the specified line - and always returns this text as plain ANSI. Two things are going to change here. Obviously the text-type must change from char* to wchar_* if we want to support Unicode. This change has been achieved by converting all char* types to TCHAR* on a project-wide basis and creating a separate Unicode Build.

The second change is to move away from a line-oriented text-retrieval model. What we have now is a getline replacement - called TextDocument::gettext. The purpose of this new routine is to return UTF-16 text from the specified byte offset within the current file:

int TextDocument::gettext(ULONG offset, ULONG maxbytes, TCHAR *buf, int *buflen)

No matter what the underlying text-format, this routine will always return UTF-16 data (for a Unicode build). The text is stored in the buf parameter, and the number of "characters" stored in buf is returned in the *buflen parameter.

TCHAR buf[200];
int   buflen = 200;

// read a block of text as UTF-16 from the specified position len_bytes = textDoc->gettext(byte_offset, max_bytes, buf, &buflen);

// adjust offsets ready for next read off_bytes += len_bytes; max_bytes -= len_bytes;

Most importantly though, the number of bytes that were processed from the underlying file is returned from the function directly - i.e. the return value represents the number of ASCII/UTF-8/UTF-16 bytes that were processed during the conversion to UTF-16. This is required so that we can keep track of the "byte position" in the underlying file - to allow us to continue reading blocks of UTF-16 in an iterative fashion.

Even though the TextView will be reading UTF-16 data (and using UTF-16 based offsets for cursor positioning etc), we must access the underlying file using byte-offsets. This is to make the text-retrieval a direct form of access to the underlying file, converting whatever data happens to be at the byte-offsets into UTF-16. If we used UTF-16 coordinates to access the file content, we would have to convert this character-offset to a byte-offset by performing lengthy processing.

The new TextDocument::gettext function is a little more complicated than what we had before:

int TextDocument::gettext(ULONG offset, ULONG lenbytes, WCHAR *buf, int *buflen)
{
BYTE *rawdata = buffer + headersize + offset;

switch(fileformat)
{
case NCP_ASCII:
return ascii_to_utf16(rawdata, lenbytes, buf, len);

case NCP_UTF8:
return utf8_to_utf16(rawdata, lenbytes, buf, len);

case NCP_UTF16:
return copy_utf16(rawdata, lenbytes/sizeof(WCHAR), buf, len);

case NCP_UTF16BE:
return swap_utf16(rawdata, lenbytes/sizeof(WCHAR), buf, len);

default:
*len = 0;
return 0;
}
}

We must use the TextDocument::fileformat member-variable to decide how to convert the underlying text into UTF-16. Notice that there is one conversion routine for each type of text that we will support.

One thing I should mention which isn't detailed here is the actual conversion process to UTF-16. We must be very careful that we never "split apart" UTF-16 surrogate pairs accidently when converting to UTF-16. This could potentially happen when converting from UTF-8 and we run out of buffer space to store both of the surrogate characters. The conversion routines all make sure that surrogate pairs are kept together.


Problems with MultiByteToWideChar


You may have noticed that I have written my own Unicode-conversion routines in the TextDocument::gettext function. I really wanted to use the MultiByteToWideChar API to perform all conversions to UTF-16. Unfortunately nothing is that simple. Although MultiByteToWideChar is good at converting valid UTF-8 data, it is not so good when it comes to invalid text-sequences (such as malformed or overlong sequences).


When it comes to processing this type of data the preferred behaviour for a text-editor is to indicate invalid sequences of UTF-8/UTF-16 by using a special Unicode character - "U+FFFD Unicode Replacement Character". The problem with MultiByteToWideChar is, it doesn't perform this conversion for invalid sequences - it just returns with a failure and you don't know how many characters were invalid. This makes it impossible to restart the conversion process, because you don't know where to restart.


By writing my own routines I was able to process both valid and invalid data in a manner that is more suitable for text-stream processing - i.e. more suitable for a text editor.


Line Buffer Management


Changing to Unicode and the "double coordinate system" means that the line-buffer scheme we developed earlier in the series needs revisiting. I'm not going to go into too much detail here because I know full well that I will be changing it yet again when we come to adding "gigabyte file support" later in the series. But this needs some discussion right now so here goes:


The line-buffer in Neatpad serves two purposes. Firstly, it provides a method to quickly locate a line of text's physical location within a file. This provides a kind of "random access" to the file content. The second purpose of the line buffer is to perform the reverse operation - i.e. given a physical "cursor offset", work out which line contains this offset.


Now that coordinate-system of the TextView is UTF-16, we need to rethink the design of the line-buffer. The problem is, we still need to know where lines of text are physically located within a file so we can't just change the line-buffer to UTF-16 coordinates. Of course a "single-format" editor such as Notepad would go down this route but because we must support multiple file-formats we need to be dealing with real physical locations.


What I have done is add a second line-buffer to the TextDocument - adjacent to the original "byte-based" line-buffer. So the original line-buffer still holds real, physical byte-offsets of each line's starting position within the file. The new line-buffer records each line's starting position, however this time it stores the information as UTF-16 offsets (character positions) rather than byte-offsets. Even if the underlying file is UTF-8, this second line-buffer stores each line's offset as if it were encoded as UTF-16.


The diagram below should hopefully illustrate what I am trying to explain.



Moving to Unicode has introduced yet another problem: Exactly how do we initialize the line-buffer(s) with these "line start" offsets with all these extra formats? Previously the method used to search for CR/LF combinations was a simple byte search. Unfortunately this method will no longer be sufficient for our multi-format text editor:


We can't do a byte-search for "\r" and "\n" characters and expect it to work any more. There is no problem with ASCII and UTF-8 (they are still byte-based and CL/LF are no different), but UTF-16 presents a challange. The CR-LF ASCII sequence (0x0D followed by 0x0A) is actually "U+0D0A MALAYALAM LETTER UU" in Unicode. We must search specifically for U+0x000D and U+0x000A when processing a UTF-16 text file.


Unicode also defines it's own line-breaking and paragraph-breaking codepoints: U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR. However the convention in text-files is still to use CR/LF sequences so we must really support all these conventions.


We have two options for parsing lines of text and building the line-buffers. The first is to write separate routines - one for each format we will support. Although this might be the most efficient approach in terms of processing-speed, it is definitely not the most efficient in terms of code-maintanence. Perhaps when Neatpad is complete I will look at this approach, but for now I prefer the following method:


Quite simply, the better method for the time being is to use a generic line-parsing routine. The TextDocument::init_linebuffer remains intact, and still processes the file on a character-by-character basis, searching for CR/LF sequences. The difference is now, the file is converted into a stream of UTF-32 characters which enables us to handle all forms of text.


Text Iteration


You may be thinking that this is all getting quite complicated now - and you'd be right, it is! The main complication arises (as we already know) because the TextView deals in UTF-16 character offsets, whereas the TextDocument deals in byte-offsets. Although the TextView always retrieves UTF-16 text from the TextDocument, it must still do so using byte-offsets. This is not a terribly neat solution.


To solve the problem I have introduced a third C++ class called TextIterator. The purpose of this class is to provide a "bridge" between the coordinate system of the TextView and the underlying file-format that the TextDocument understands. This means that the TextView no longer asks the Document directly for text - all text retrieval now goes through the Iterator.


class TextIterator
{
friend class TextDocument;

public:
int gettext(WCHAR *buf, int len);

private:
// only "friends" of the TextIterator can create them
TextIterator(ULONG off, ULONG len, TextDocument *doc);

// keep track of position within the specified TextDocument
TextDocument * text_doc;
ULONG off_bytes;
ULONG len_bytes;
};

As you can see the class definition for a TextIterator is very simple. It keeps track of the TextDocument that it is being used for, and the byte-offset within the document. These values are set when the TextIterator is constructed. The only code which actually does anything useful is shown below in the TextIterator::gettext function.


int TextIterator::gettext(WCHAR *buf, int buflen)
{
// get text from the TextDocument at the specified byte-position
int len = text_doc->gettext(off_bytes, len_bytes, buf, &buflen);

// adjust the iterator's internal position
off_bytes += len;
len_bytes -= len;

return buflen;
}

The TextIterator basically encapsulates the byte-based fileoffset details and hides them from the TextView. A single function has also been added to the TextDocument which is used to start a line-iteration:


TextIterator TextDocument::iterate_line(ULONG lineno, ULONG *linestart, ULONG *linelen)
{
ULONG offset_bytes;
ULONG length_bytes;

lineinfo_from_lineno(lineno, linestart, linelen, &offset_bytes, &length_bytes);

return TextIterator(offset_bytes, length_bytes, this);
};

The interate_line function returns an independent TextIterator object which can then be used to access the file's text in a transparent manner. An example of text-iteration using this new class is shown below:


TextIterator itor = m_pTextDoc->iterate_line(100);

WCHAR buf[200];
int len;

len = itor(buf, 200);

You can see how simple the process is now. The TextView now accesses the file-content through the TextIterator. Everything is line/character offset based as far as the TextView is concerned. The nasty byte-offsets and conversion details are hidden away in the Iterator and TextDocument, which is exactly how we want it.


In all probability I will end up changing the design yet again when some other issue crops up (I am expecting headaches with bidirectional text and complex scripts), but for the time being the TextView/TextIterator/TextDocument design that I have outlined here seems to work pretty well.


Additions to Neatpad


A quick mention on some changes to the actual Neatpad application. I have added three things. The first is command-line support. Now it is possible to specify a text-file at the commandline (just like with Notepad) and the file opens automatically when Neatpad starts.


The second addition is shell-menu support. There is a new setting in Neatpad's options to add an entry to Explorer's shell context menu for all filetypes, enabling you to right-click any file and select "Open with Neatpad". I always add this entry for Notepad when I build a new system and having the same (automatic) feature for Neatpad will be very useful in my opinion.


The last addition is window-position persistence. You may have noticed with Notepad that it saves it's window-position each time it exits, so that the next time it starts the window is restored to the saved position. I have gone one step further than this - Neatpad saves the window position for individual files rather than for the application as a whole. This means that you could open+close different files in Neatpad and they each remember their own position on screen.


The way I have done this is to use NTFS Alternate Data Streams. I have been dying to find a use for "NTFS Streams" since they first appeared in Windows NT and I believe I have found the perfect use for them. Each time a file is opened, an NTFS steam attached to the file (called Neatpad.WinPos) is opened as well. A WINDOWPLACEMENT structure is saved in this stream - so when a file is opened the SetWindowPlacement API is called using the saved structure. And when a file is closed, Neatpad's current window position is retrieved with GetWindowPlacement and saved back into the main file's Neatpad.WinPos stream.


Coming up in Part 10


The subject of Unicode has proven to be rather difficult to solve. In fact I went through several rewrites of the TextView/TextDocument classes before I arrived at the solution I've presented here. This is just one of the reasons why it took such a long time to get right - the other reason being that I had to do alot of background reading to make sure I understood all of the issues surrounding Unicode before I started.


I will mention again the book "Unicode Demystified" by Richard Gillam - this book is well worth a read and covers many more Unicode topics than I can present here. Although it was written for Unicode 3.0 don't let this put you off - the changes between Unicode 3.0 and 4.0 are fairly minimal and are basically just things like additions to the character repertoire.


Moving on to Part 10. The next tutorial will focus on the proper display of Unicode text. Understand that at the moment, all I have really done is turn Neatpad into a "wide-character" text viewer which happens to support UTF-8 and UTF-16 encoding formats. Although we are now using the "unicode" Windows APIs (specifically TextOutW) we are still a long way off being a real Unicode editor. Complex scripts, combining characters and bidirectional text are not supported yet. If you thought displaying Unicode text was a matter of simply calling TextOutW then think again - Unicode text display is a very complicated problem which cannot be solved using TextOut on it's own.


The next tutorial will therefore focus on the Uniscribe API. This API (available since Windows 2000) provides support for displaying complex-scripts and bidirectional text. We will have to redesign Neatpad's text-display engine slightly because of the way Uniscribe works, and also modify the mouse-input and selection routines, but hopefully after the next tutorial we will be in a very good position in terms of Unicode support.


AttachmentSize
neatpad9.zip147.57 KB
unisamp.zip16.56 KB

Part 10 - Transparent Text

Introduction

It's been almost three months since I posted the last Neatpad tutorial, and it's also been over a year since I started this project. During this time I have been steadily working on the migration to the Uniscribe API. Alot of issues have become apparent with the way I am rendering text in Neatpad, so this tutorial will hopefully highlight these issues and offer a solution. There isn't going to be any code-download this time as I will be incorporating the ideas presented here into subsequent tutorials (which will be about Uniscribe).

Rendering text

So before we start looking at Uniscribe I want to take a moment to re-examine the process of text rendering. This is not going to be a very in-depth discussion - rather it will be an overview of text-rendering in general. As the basis for our discussion a single letter 'F' in italics will be used, as shown below.

The image shows a black glyph, representing the lower-case italic letter 'f', drawn on top of a grey rectangular background. The grey background represents the bounding rectangle of the character. Notice how the character overhangs the rectangle on the left and right edges. These measurements are called the left-bearing and right-bearing of a glyph.

In Windows the total width of a character is also represented as three values, called the ABC width. The A-width represents the width of the left-most overhanging portion of the glyph, whilst the C-width represents the right-most overhanging portion. The B-width measures the extent of the glyph. For characters such as the lower-case 'F' above, the A and C-widths are frequently negative values, which allows the characters to be positioned closer together when displayed as part of a string. Of course when a character has a negative A or C-width, it will overlap the space occupied by it's neighbouring characters. The purpose of this short tutorial is to highlight the issue with rendering glyphs that have negative A and C-widths.

All applications (including Neatpad) render text using one of the Windows text-display APIs - usually DrawText or TextOut. Text in Windows can be drawn in one of two ways - with and without a background rectangle. Text drawn with a background fill is referred to as opaque text, whilst text drawn without is called transparent text. This text-drawing feature is controlled with the SetBkMode API, or the ETO_OPAQUE option when calling ExtTextOut.

The problem of overhanging glyphs only presents itself when breaking a string into smaller segments and rendering them individually. The issue occurs because the act of filling the background causes the overhanging portions of a glyph to be overpainted, resulting in the characters being clipped by the neighbouring background rectangles. The image below illustrates the two methods of painting text. The text is made up of six individually rendered 'F's, each in a different colour.

Clipping (opaque text)
No clipping (transparent text)

The image on the left illustrates the text drawn using an opaque background-fill mode. This is the simplest way to draw text, because the text and background is rendered in one go. This is how Neatpad currently renders text. However problems occur when text is broken up into blocks. Of course the problem is made much worse by the Italic text - for most Roman text this issue never occurs.

SetBkMode(hdc, OPAQUE);
TextOut(hdc, x, y, szText, nTextLen);

The image on the right illustrates the ideal way to display multi-coloured text. In order to render the text this way the process had to be split in two phases. Firstly, the background was drawn in it's entirety (from left-to-right). Then the text was drawn 'transparently' over the top of the background.

FillRect(hdc, &backgroundRect);

SetBkMode(hdc, TRANSPARENT);
TextOut(hdc, x, y, szText, nTextLen);

Breaking the text with two phases will introduce flickering, so to remedy this side-effect we must resort to 'double-buffering' whenever we draw text. This means that we must draw each line of text into an off-screen buffer before displaying it visually. It is entirely necessary to draw text this way, and this is the direction I am heading in with Neatpad. However the consequence of moving to a double-buffering scheme means we can re-examine how selection-highlights are drawn.

Drawing Selection Higlighting


Selection-highlighting has always seemed to be a rather elusive subject - at least to me. The problem is, there aren't really any strict guidelines as to how one should represent text-selections, and it was only after delving into the world of Uniscribe and 'low level' glyph rendering, that I fully understood the issues involved.


The images below illustrate the two main methods of representing text-selection. For the purposes of this example I am using the same string of six 'F's. This time however two of the six glyphs are shown as 'selected' - using the default Windows system colours.






Background highlighting

Inversion highlighting


The two basic strategies can be referred to quite simply as Background highlighting, and Inversion highlighting.



My preference is for the 'inversion' method. When text is highlighted I like to think of the selection-rectangle being overlayed on top of the text, rather than being behind it. Personally I don't like the look and feel of the background-highlighting method. Because of the typical colour-scheme in Windows (white background, white selected text), some selected text seems to disappear when the selection highlight moves over them. If you look closely at the example you can see this effect clearly - because the two selected characters are white, any overlap onto a white background makes it look as if the characters have been truncated. I am moving to the 'inversion' method in Neatpad simply because it looks more professional.



The process of 'inversion highlighting' is rather more involved than you might expect. The problem we face is the issue of overhanging characters. In the example above the character is shown as selected - however the leading and trailing edges fall into unselected areas and must be painted in separate colours. When displaying this character in a selected state it must somehow be split apart - with the main character body rendered one way, and the left-and-right overhanging segments rendered diffierently. My solution to this problem is to use a three-phase rendering strategy, which is outlined below.






1a. Background

1b. Mask selected areas


The first stage is to paint the background using a series of simple flat fills. Every character has it's regular bounding rectangle painted this way, with both normal and selected areas being rendered. An important detail to understand is the use of clipping at this early stage. Any selected-areas are masked immediately after being painted. What this means is that the ExcludeClipRect API is used to remove the selected areas from the current device-context's clipping-region. The reason why this detail is important will soon become clear.






2a. Normal text

2b. Invert mask


Once the background has been painted in it's entirety (for all characters/glyphs), the second stage is to render the text over the top. A transparent drawing-mode is used to paint the text (SetBkMode with the TRANSPARENT setting). Because of the clipping-region created during the first pass, any selected (blue) areas are protected as the text is drawn over the them.


After the text been drawn the clipping-region for the device-context is inverted (ExtSelectClipRgn with RGN_XOR), which results in the selected areas becoming unmasked. The text we have just rendered will now be protected from subsequent drawing operations.






3a. Selected text

3b. The result


The final stage is to redraw the text. The exact same text is drawn, directly over the top of the text we just drew. The difference is that the text is drawn in a single colour this time - using the system-highlight colour (i.e. white). Because of the clipping-region we have created, only the selected (blue) areas are modified this time. When the clipping-region removed the drawing is complete. The result is high quality text-output that would be suitable for a word-processing package.


The use of clipping is the key to making this method work. Although it seems as if the text is overdrawn, in reality it is not because of the careful use of clipping-regions. Importantly, when ClearType or another form of anti-aliasing is enabled, using clipping this way is absolutely necessary, because if we really did overdraw the text we would ruin the ClearType effect. Using clipping in the method outlined above protects us from this 'overdraw', even though we attempt to draw the text twice.


Of course there are always going to be issues with rendering text this way. Performing three separate passes will introduce a fair amount of flickering and basically dictates that we must use double-buffering when drawing all text. There is also the potential performance-decrease of drawing the text twice. Note however that this last issue can be managed quite easily by only performing this 'double drawing' when necessary - most of the time we can avoid drawing text twice (such as when there is no selection in the text, or if all the text is selected).


Although the method I have described here might seem quite strange if you've not encountered it before, I can promise you that it is not that unusual. The ScriptString API (and therefore apps like Notepad) use this exact same method, and you can be sure that any text-editing package that uses the 'inversion method' will use something very similar also. The point is, there really isn't any alternative to the method I have chosen. If your goal is high-quality text display then concessions must be made.


Coming up in Part 11!


It should be noted that I have taken an extreme example here. Most English text in a Roman typeface will not require complex rendering such as this - however because we must support all forms of text (especially the more exotic Unicode scripts) we must do things properly from this point on. In fact the inversion-method is entirely necessary for rendering with Uniscribe as we will find out shortly.


This tutorial is really just a preview of things to come. The technique of inversion-highlighting will be incorporated into Neatpad's text-display engine as we migrate to the Uniscribe API.


Part 11 - Introduction to Uniscribe

Introduction

Uniscribe is a low-level Win32 API that provides a high degree of control over the processing and display of Unicode text. The API is designed to provide a generic interface to all forms of Unicode text (complex or otherwise), and transparently handles properties such as bi-directional text and combining characters sequences.

Uniscribe is a single DLL called USP10.DLL, which contains all of the Uniscribe APIs. This DLL is present on Windows 2000 and above, or any computer with Internet Explorer 5.0 (or greater) installed. Two Platform SDK files (USP10.H and USP10.LIB) are provided by Microsoft to allow an application to make use of this complex-script support. An important point about Uniscribe is that it doesn't just handle complex-scripts - it can be used to process and display all Unicode text - so can be used as a direct replacement for existing text-output routines such as DrawText and TextOut.

The Uniscribe API is divided into two categories - the low-level API itself, and a wrapper library called ScriptString which hides much of the complexities of dealing with Uniscribe directly. The purpose of this tutorial is to give a brief introduction to the world of Uniscribe before we start delving in properly.

Uniscribe in Windows

When I first started Neatpad I was unfamiliar as to exactly what Uniscribe entailed, and it was only after researching Unicode that I fully appreciated the issues surrounding the display of Unicode text. Although Uniscribe occupies it's own section within the MSDN documentation (here), other than the occasional reference it is very easy to miss unless you already know of it's existence.

MSDN states that since Windows 2000, the ExtTextOut function (and others like TextOut, DrawText etc) have been extended to support complex scripts. Although this is true, it gives the impression that an application can call ExtTextOutW (the Unicode version) at any time with a buffer of UTF-16 text and it will always display correctly.

Unless Windows has been configured to do so, functions such as ExtTextOut do not automatically support complex scripts. The image above shows the "Regional and Language Options" dialog. The two settings which have been highlighted are not normally enabled by a default installation of American/English Windows.

Enabling complex-script support installs a number of extra libraries, after which ExtTextOut will use the Uniscribe when necessary to display complex scripts.

BOOL ExtTextOut (
  HDC       hdc,          // handle to DC
  int       X,            // x-coordinate of reference point
  int       Y,            // y-coordinate of reference point
  UINT      fuOptions,    // text-output options (ETO_GLYPH_INDEX etc)
  RECT    * lprc,         // optional dimensions
  LPCTSTR   lpString,     // string
  UINT      cbCount,      // number of characters in string
  INT     * lpDx          // array of spacing values
);

ExtTextOut is most commonly used to display a string of text. However it can do much more than this. When the ETO_GLYPH_INDEX and ETO_PDY options are specified, ExtTextOut can be used to display a buffer of glyphs instead of characters. This feature of ExtTextOut is used when displaying a string containing complex-scripts, as the diagram below illustrates.

Text drawing in Windows 2000 and above

For any complex string containing complex scripts, ExtTextOut makes use of Uniscribe to display it. Uniscribe breaks the string down into groups of glyphs and then re-calls ExtTextOut, this time with the ETO_GLYPH_INDEX option, and a buffer of glyph-indices instead of the original character values. For regular Unicode text which doesn't require any special processing, ExtTextOut behaves exactly the same as it did under previous Windows versions.

You may be wondering why Uniscribe is necessary if routines such as DrawText and TextOut can for the most part render complex scripts quite sucessfully. For applications which just output single strings of text, Uniscribe is not necessary.

It is only when a string must be broken up (for the purposes of styling/formatting) that Uniscribe is required. It is just not possible to split a Unicode string into sections (as we have been with Neatpad up 'til now). Doing so breaks all kinds of things such as contextual shaping behaviours and bidirectional support. A modern text-editor simply must support Unicode and all the various scripts that come along with that - we have no other choice than to move to Uniscribe.

The ScriptString API

The ScriptString API is designed for applications which want to display text in a single font and colour. Notepad (and the standard Windows EDIT control) is a prime example of the ScriptString API. One of the nice features of this API is that it allows you to display a string of text, with a portion of that string optionally displayed as 'selected'. This is actually a very nice touch as it saves a tremendous amount of effort.

The ScriptStringAnalyze function is the starting point with Uniscribe. It is a pretty intimidating function to look at. However its purpose is used to perform shaping and glyph-generation for any string of Unicode text, and returns a SCRIPT_STRING_ANALYSIS structure when complete.

HRESULT WINAPI ScriptStringAnalyse (
  HDC                       hdc,
  void                    * pString,
  int                       cString,
  int                       cGlyphs,
  int                       iCharset,
  DWORD                     dwFlags,
  int                       iReqWidth,
  SCRIPT_CONTROL          * psControl,
  SCRIPT_STATE            * psState,
  int                     * piDx,
  SCRIPT_TABDEF           * pTabdef,
  BYTE                    * pbInClass,
  SCRIPT_STRING_ANALYSIS  * pssa
);

SCRIPT_STRING_ANALYSIS is an opaque structure - there is no documention which details what it contains. This is not important though as this structure is simply passed to the rest of the ScriptString API without requiring any further knowledge.

HRESULT WINAPI ScriptStringOut (
  SCRIPT_STRING_ANALYSIS    ssa, 
  int                       iX, 
  int                       iY, 
  UINT                      uOptions, 
  RECT                    * prc, 
  int                       iMinSel, 
  int                       iMaxSel, 
  BOOL                      fDisabled 
);

ScriptStringOut is used to display a string of text that was previously analyzed. Note that a text-string is not specified with this call - only the SCRIPT_STRING_ANALYSIS structure is passed which contains all the necessary information to display the original string.

HRESULT WINAPI ScriptStringXtoCP (
  SCRIPT_STRING_ANALYSIS    ssa, 
  int                       iX, 
  int                     * piCh, 
  int                     * piTrailing 
);

ScriptStringXtoCP is an interesting function. It provides a mechanism for caret and mouse positioning within a string of Unicode text.

HRESULT WINAPI ScriptStringCPtoX (
  SCRIPT_STRING_ANALYSIS    ssa, 
  int                       icp, 
  BOOL                      fTrailing, 
  int                     * pX 
);

ScriptStringCPtoX is the counterpart to ScriptXtoCP. It performs the opposite task - converting a string-position to a display-coordinate.

HRESULT WINAPI ScriptStringFree(
  SCRIPT_STRING_ANALYSIS  * pssa  
);

When an application has finished displaying the string the ScriptStringFree function can be used to clean up. There are more ScriptString functions than what I have listed here, but with just these five an application can implement the front-end to a fully-functional text-editor with minimal effort.

The image above shows a simple application I wrote which demonstrates the ScriptString API. The source-code and demo executable can be downloaded at the top of this article.

An oddity of ScriptString is this: ScriptStringOut fails if the device-context used to render is not the same as the one used when analyzing the string with ScriptAnalyze!

Introducing UspLib

The main problem with the ScriptString API is its inability to display text in more than one font and colour. This makes it particularly unsuitable for our purposes with Neatpad. Our only option is to make use of the low-level Uniscribe functions directly.

USPLib is a library I have written to provide a far richer capability than ScriptString can offer. This new library provides a wrapper around the low-level Uniscribe API that we will be discussing over the next couple of tutorials. UspLib is very similar in approach to the ScriptString Uniscribe wrapper, but goes alot further in terms of text-colouring and formatting.

USPDATA * USP_Allocate();

The first API is USP_Allocate. This function returns a pointer to a USPDATA object which must be used for subsequent UspLib operations.

BOOL USP_Analyze (
  USPDATA   * uspData,
HDC hdc,
WCHAR * wstr,
int wlen,
ATTR * attrRunList,
UINT flags,
USPFONT * uspFont
);

USP_Analyze is similar to ScriptStringAnalyze. The difference is, a string of text can be re-analyzed using an existing USPDATA object.

void USP_ApplyAttributes (
  USPDATA  * uspData, 
ATTR * attrRunList
);

Once a string has been analyzed (i.e. itemized and shaped etc.), colour-attributes can be reapplied at any time using the USP_ApplyAttributes. The font-information stored in the ATTR run-list is ignored.

void USP_ApplySelection (
  USPDATA  * uspData, 
int selStart,
int selEnd
);

USP_ApplySelection performs a similar task to USP_ApplyAttributes. However this time only the selection-flags are modified in the USPDATA object.

int USP_TextOut (
  USPDATA  *  uspData,
HDC hdc,
int xpos,
int ypos,
RECT * rect
);

USP_TextOut is the counterpart to ScriptStringOut. It takes as input the USPDATA object which was previously analyzed, and draws it to the specified location. Any fonts, colours and selection-highlights are applied to the text as it is drawn.

void USP_Free(USPDATA * uspData);

USP_Free should be called then the USPDATA object is no longer needed. Over the course of the next two or three tutorials I will be detailing how I have implemented UspLib, and will provide details and examples of using Uniscribe directly.

I have designed UspLib in isolation from Neatpad. My intention is that it is a completely stand-along library, which can be used by any application to add complex-script support. It should certainly be possible to import UspLib into your projects and use it straight away, because it contains no dependencies other than the Uniscribe DLL.

Further Reading

There is very little information available about Uniscribe other than what is available in MSDN.

Uniscribe Platform SDK Reference

Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0

Globalization Step-by-Step - Complex Scripts Awareness

Windows Glyph Processing

There is also the CSSamp example program from Platform SDK, in the Samples sub-directory:
\PlatformSDK\Samples\winui\globaldev\CSSamp

Alternatives to Uniscribe

Not every editor uses Uniscribe. If open-source is your thing then there are currently two very impressive efforts available which offer a very strong alternative to Uniscribe. There is also an equivalent version of Uniscribe available for Apple's OSX called ATSUI.

International Components for Unicode (ICU) is IBM's open-source Unicode support library. It contains alot of functionality, from character-conversions, analysis, searching and layout.

Pango is an open-source library for laying out and rendering Unicode text. It appears to sit on top of the GTK display library and can specify either Cairo or Win32 (Uniscribe) rendering back-ends. It offers a more complete solution than Uniscribe and appears to be very well designed and implemented. However Pango is UTF-8 based so this may be a consideration if the rest of your application is UTF-16.

Apple Type Services For Unicode Imaging (ATSUI) is Apple's own version of Uniscribe, although it appears to be higher-level than Microsoft's effort. A brief look at the documentation for ATSUI indicated a much easier-to-use design, and substantially better documentation than Microsoft had managed for Uniscribe.

Coming up in Part 12

This was just a short introduction to Uniscribe - hopefully you are a little more aware of what Uniscribe is capable of, and have downloaded and tested the ScriptString sample program.

Part 12 will focus on the first two Uniscribe functions: ScriptItemize and ScriptLayout. There is alot of detail to cover with just these two APIs and it won't be until Part 13 that we actually see any text being drawn this way with Neatpad.

Lastly, I've not had much feedback in the last few months about Neatpad - did you read this tutorial and find it useful?

AttachmentSize
scriptstring.zip16.32 KB

Part 12 - Uniscribe Mysteries

Introduction

The last tutorial presented a very brief overview of the Uniscribe ScriptString API. Unfortunately ScriptString is insufficient for our purposes with Neatpad because of the limitations of a single font and colour. The aim of this tutorial is to therefore investigate the "low-level" Uniscribe API. Because we have very specific requirements for Neatpad's text display our approach that of a multi-font, syntax-coloured text editor.

The string of Unicode text shown below will be used as the basis for much of our discussion. The Arabic phrase in the middle has been chosen because it's Unicode properties suit the context of our discussion, not because they have any special meaning.

HelloيُساوِيWorld

You will notice that two of the "glyphs" in the Arabic phrase above have been highlighted in different colours to the rest of the string. These two letters are "U+0633 ARABIC LETTER SEEN" and "U+0627 ARABIC LETTER ALEF". In isolation they display as follows:

سا
سا

The box on the left shows the two characters rendered with contextual-shaping (assuming you are using a Unicode-enabled web-browser such as Internet Explorer and have the appropriate fonts installed). This is the behaviour we are aiming for. The box on the right shows the characters rendered separately from each other. If both boxes look the same then your browser is not displaying Unicode properly.

One of the big reasons Unscribe exists is to provide the kind of complex "shaping" behaviour illustrated above. The requirement on our part (as programmer) is that we do not split Unicode strings into individual characters because this would break the shaping behaviour we are aiming for. Therefore the major goal of this tutorial is to explain how characters can be drawn individually (in different colours) whilst still maintaining the contextual shaping.

Basic Outline

The basic set of steps for drawing text with Uniscribe are outlined below. Note that I am omitting word-wrapping (and the ScriptBreak API) for the moment. So assuming that we have a string of UTF-16 Unicode text, this is what we do:

  1. ScriptItemize - to break the string into distinct scripts or "item-runs".
  2. Merge item runs with application-defined "style" runs to produce finer-grained items.
  3. ScriptLayout - to potentially reorder the items.

Then for each item/run (in the order dictated by the ScriptLayout results)

  1. ScriptShape - to apply contextual shaping behaviour and convert the characters from each run into a series of glyphs.
  2. ScriptPlace - to calculate the width and positions of each glyph in the run.
  3. Apply colouring/highlighting to the individual glyphs.
  4. ScriptTextOut - to display the glyphs.

This outline closely follows how Microsoft recommends you use the Uniscribe API. Note however that I have included an extra step#6 (text-colouring) which is not mentioned in MSDN. The reasoning behind this difference will be explained as we progress through the tutorial. I will leave the subject of word-wrapping to a later tutorial, as this is more of a problem of line-buffer management rather than using Uniscribe.

BOOL UspAnalyze (
  USPDATA         * uspData,   
  HDC               hdc,
  WCHAR           * wstr,
  int               wlen,
  ATTR            * attrRunList,
  UINT              flags,
  SCRIPT_TABDEF   * tabDef,
  USPFONT         * uspFont 						  
);

The function prototype above is for a function called UspAnalyze. It is part of the new UspLib text-rendering engine that I have written for Neatpad. UspAnalyze is similar in many ways to ScriptStringAnalyze, but with the additional capability of allowing the caller to specify font and style information for the string.

The rest of this tutorial will begin to focus on each aspect of the Uniscribe API as outlined above and will discuss any issues related to each stage. However each stage that we look at will be a key step towards implementing the UspAnalyze function.

1. ScriptItemize

ScriptItemize is usually the first Uniscribe function to be called when displaying a string of Unicode text. It's purpose is to identify the various scripts in a string, and then split this string into items (or runs) according to the script, with one item per script.

H

e
l
l
o
ي
ُ
س
ا
و
ِ
ي

W

o

r

l

d

0048
0065
006C
006C
006F
064A
064F
0633
0627
0648
0650
064A
0057
006F
0072
006C
0064
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

The table above illustrates how the UTF-16 string "HelloيُساوِيWorld" would be treated by ScriptItemize. The characters are shown in logical order - in other words, the order that they appear when stored in memory. The string has been divided into three segments. Note that these items are derived purely by their script - not by the finer-grained glyphs and grapheme clusters that are present in the string.

HRESULT WINAPI ScriptItemize(
  WCHAR          * wszText,       // pointer to unicode string
  int              wszLength,     // count of WCHARs         
  int              cMaxItems,     // length of pItems buffer
  SCRIPT_CONTROL * psControl,    
  SCRIPT_STATE   * psState, 
  SCRIPT_ITEM    * pItems,        // out - array of SCRIPT_ITEM structures
  int            * pcItems        // out - count of items
);

ScriptItemize returns an array of SCRIPT_ITEM structures, one for each "shapable" item (script) in the paragraph of text. The number of structures is returned in *pcItems. In the example above, *pcItems would hold the value "3". This SCRIPT_ITEM structure is very simple and is shown below.

struct SCRIPT_ITEM
{ 
   int              iCharPos; 
   SCRIPT_ANALYSIS  a;
};

The SCRIPT_ITEM::iCharPos variable is used to identify the starting position of each "run" of text in the string. The SCRIPT_ANALYSIS child structure holds alot of extra information about the run including the reading-direction and the shaping-engine that should be used to convert the run into glyphs.

The image below this time illustrates how our Unicode string is represented by an array of SCRIPT_ITEM structures:

Notice that there is always a "hidden" SCRIPT_ITEM on the end of the array which represents the end-of-string. This makes it possible to calculate the length of each SCRIPT_ITEM by using the following construct:

itemLength = pItems[i+1].iCharPos - pItems[i].iCharPos;

There are a couple of general points worth making here. Notice that the first parameter to ScriptItemize is a WCHAR*. There is no ANSI version of this function so from now on Neatpad will be a pure Unicode application. Unless we can use the Microsoft Layer for Unicode (MSLU) we will have to drop support for Win9x.

Note also that you can never know in advance how many SCRIPT_ITEMs will be returned for a string of text, so it is usually necessary to use a loop of some kind - which allocates more and more memory for the SCRIPT_ITEM buffer until the call to ScriptItemize succeeds:

SCRIPT_CONTROL scriptControl = { 0 };
SCRIPT_STATE   scriptState   = { 0 };

SCRIPT_ITEM *itemList = 0;
int itemCount;

do
{
itemList = realloc(itemList, ... );

hr = ScriptItemize(
wstr,
wlen,
allocLen,
&scriptControl,
&scriptState,
itemList,
&itemCount);

if(hr != S_OK && hr != E_OUTOFMEMORY)
break;

} while(hr != S_OK);

A word of warning here - make sure you always pass fully initialized SCRIPT_CONTROL and SCRIPT_STATE structures to ScriptItemize even if their contents are initialized to all "zeros". Unless both these structures are specified, the Unicode bi-directional algorithm will not be used for the purposes of itemizing the string. This can result in incorrect identification of item-run positions in some circumstances (such as LTR and RTL scripts appearing in the same string).

Interestingly, MSDN says that when the SCRIPT_CONTROL and SCRIPT_STATE are NULL the itemization is based purely on character code. When non-null, the full bidirectional algorithm is applied as stated above. For this latter case the entire paragraph must be in memory. Although I'm not going to go down this path, this does suggest a method for handling arbitrarily long lines of text that cannot reside in memory as whole paragraphs.


2. Merging Style Runs


The reason we are using Uniscribe directly instead of the ScriptString functions is because we want finer-grained control over text colouring and font selection. And we have now reached the point (after calling ScriptItemize) where Microsoft's documentation advises us to merge "application-defined" style runs with the item information returned by ScriptItemize. Here's the quote from MSDN:



"Before using Uniscribe, an application divides the paragraph into runs, that is, a string of characters with the same style. The style depends on what the application has implemented, but typically includes such attributes as font, size, and color....Merge the item information with the run information to produce runs with a single style, script and direction."



This quote is one of the most confusing, cryptic and misleading statements in the whole of the Uniscribe documentation. The problem is, there are no hints in MSDN as to how one should merge style-runs with item-runs, or even what a "style run" actually is. We will look at how to "merge runs" a little further down, but first let's understand what is meant by the term "style run".


Of course, a style-run is whatever an application wants it to be. In essence it is a range of text that has been assigned a specific set of attributes. In the case of Neatpad I have used an ATTR structure to represent colour and font - one for each character in a string of text. The string of text and the attribute-list looked something like this:


WCHAR buff[ MAXLINELEN ];
ATTR attr[ MAXLINELEN ];

However since migrating to Uniscribe and the 'inversion-highlighting' scheme, I have extended the ATTR structure somewhat, so that it is no longer a 'one ATTR per character':


struct ATTR
{
COLORREF fg; // foreground text colour
COLORREF bg; // background text colour

int len : 16; // length of this run (in WCHARs)
int font : 7; // font-index
int sel : 1; // selection flag (yes/no)
int ctrl : 1; // show as an isolated control-character
};

The foreground and background colours remain unchanged. The new structure-members are detailed below.



The problem we now face is that we have two lists of unrelated entities - a SCRIPT_ITEM list which identifies the position of scripts within the original character array, and an ATTR list which identifies the ranges of style in the original string. We need to understand what MSDN means when it instructs us to merge these two unrelated lists together:


SCRIPT_ITEM *itemList;
ATTR *attrList;

The basic process is to look at the style-runs and item-runs together and identify any position within the string where a run of one type overlaps another. For example, suppose a SCRIPT_ITEM run overlaps the boundary-position between two ATTR structures. This SCRIPT_ITEM would have to be split into two new halves - each representing a different ATTR style-run.


The way the split occurs is like this: the SCRIPT_ITEM::iCharOffset variables are modified to point to new positions within the original string and an array of ITEM_RUN structures is built up which holds these new character-postions. The other contents of the SCRIPT_ITEM (i.e. the SCRIPT_ANALYSIS structure) must be duplicated between the resulting two halves. Think of it as follows: The ScriptItemize function first breaks the string into discrete units based on script. The merge process then further breaks the string into smaller units based on style, should there be any overlap between the two.


The following diagram hopefully illustrates what is meant by a "style merge":



Now here's the problem. If we break up a SCRIPT_ITEM won't this affect the contextual shaping behaviour of the Uniscribe engines? The short answer is, yes, we will break the Uniscribe shaping behaviours by breaking up a string - and no, there is no magic way to get around this problem.


You may notice in the above diagram that I have written "font only" next to the ATTR style runs. This is delibrate, because although Microsoft advises us to break up a string based on style, this is not really correct. In fact, breaking a string due to colour differences (for the purpose of selection/syntax highlights) at this stage is wrong:



We must only take fonts into account when merging style-runs and item-runs, and ignore colour-information entirely.



Hopefully I have gotten this point across adequately. After following the advise of the Microsoft docs I wasted about a week trying to figure out how to colourise a string only to realise that I was going about it the wrong way. Syntax-colouring (or any kind of text colouring for that matter) must be applied to a string after the shaping has taken place - i.e. after ScriptShape and ScriptPlace have been called, and just prior to calling ScriptTextOut. This doesn't mean that we can't store colours in our ATTR structures - it's just that we don't use this information whilst performing the 'merge'. Any ATTR structures which share the same font must therefore be coalesced into a single run by the merge process before doing any "splits".


OK so once we have broken up the ATTR and SCRIPT_ITEM structures what do we do with them? I have defined a new structure called ITEM_RUN which contains the necessary content from the SCRIPT_ITEM and ATTR structures:


struct ITEM_RUN
{
SCRIPT_ANALYSIS analysis; // from the original SCRIPT_ITEM
int charPos; // character-offset within the original string
int len; // length of run in WCHARs
int font; // only font is required, not colours
...
};

ITEM_RUN basically allows us to keep "formatting" information alongside the item-runs. Once we have itemized the string, Uniscribe only cares about the SCRIPT_ANALYSIS structures for each run. The other members of the ITEM_RUN structure are for our own private use. The item-run-list is stored inside the USPDATA structure for the string, in the itemRunList field:


struct USPDATA
{
ITEM_RUN * itemRunList;
int itemRunCount;
...
};

The algorithm to merge runs is actually quite complicated - in fact it's one of the trickier aspects with Uniscribe programming, not helped by the fact that Microsoft give absolutely no hint as to how this should be performed, other than the 7-year old CSSamp application from the 1998 article "Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0".


To solve this problem I have written a new function called BuildMergedItemRunList that builds an array of ITEM_RUN structures for a given Uniscribe string. It performs two tasks - calling ScriptItemize and then merging the results with the style-runs specified by attrList.


BOOL BuildMergedItemRunList(
USPDATA * uspData, // in/out - holds results of merge
WCHAR * wstr,
int wlen,
ATTR * attrList,
);

BuildMergedItemRunList is an private function to USPLib, and is called by UspAnalyze as one of the first steps when building a USPDATA object. Taken in isolation, the function is used something like this:


ATTR attrList[2] = 
{
{ RGB(0xff, 0x00, 0xff), RGB(0,0,0), 5, 0, 0 }, // five characters using font#0
{ RGB(0xAA, 0x22, 0xAA), RGB(0,0,0), 6, 1, 0 } // six characters using font#1
}

BuildMergedItemRunList(uspData, L"Hello World", 11, attrList);

Understand that the big advantage of using Uniscribe is the contextual-shaping and complex-script support. Dividing a Unicode string into sections by splitting SCRIPT_ITEM structures will break the script-shaping behaviour that we seek. We must try to keep the number of split SCRIPT_ITEMs to a minimum - and splitting based on colour differences at this stage is wrong. Although Neatpad will already have built it's ATTR style-lists before displaying text with Uniscribe, using the colour information in these lists must occur after shaping has taken place.


Finally, if you are building a text-editor that only deals with a single font then you can completely skip this phase and save yourself alot of work (or even use the ScriptString API if you don't want syntax colouring!)


3. ScriptLayout


The next stage with Uniscribe is to take the merged item-runs and establish the correct visual order for display. In our case we use the array of ITEM_RUN structures produced by BuildMergedItemRunList. This is an important step and is the key to the correct display of bidirectional text. Note that unless a string contains right-to-left scripts reordering is not necessary but we still need to go through the motions because we won't know until runtime what kind of scripts and languages we might be processing.


The Uniscribe ScriptLayout function is called to perform the reordering, and uses the Unicode Bidirectional Algorithm to achieve this task.


HRESULT WINAPI ScriptLayout(
int cRuns,
BYTE * pbLevel, // in
int * piVisualToLogical, // out
int * piLogicalToVisual // out
);

ScriptLayout takes as input a simple array of BYTEs which represent the bidi run-embedding levels of the string - one BYTE per item-run. This bidi run-embedding value is stored in the SCRIPT_STATE::uBidiLevel variable for each ITEM_RUN. It is up to us to build this BYTE[] array before calling ScriptLayout.



We have to therefore manually extract the uBidiLevel variable from each item-run. uBidiLevel is buried deep within each SCRIPT_ANALYSIS, as a member of the SCRIPT_STATE structure. Once the BYTE[] array is built the ScriptLayout API can be called. It all seems like rather alot of work just to return a further array of integers but thats just the way it is. Presumably the Uniscribe developers did it this way because they assumed that you would be creating and merging your own ITEM_RUN (or similar) structures.


VOID BuildVisualMapping( ITEM_RUN *  itemRunList, 
int itemRunCount,
int visualToLogicalList[] // out
)
{
int i;
BYTE * bidiLevel = malloc(itemRunCount * sizeof(BYTE));

// Manually extract bidi-embedding-levels ready for ScriptLayout
for(i = 0; i < itemRunCount; i++)
bidiLevel[i] = itemRunList[i].analysis.s.uBidiLevel;

// Build a visual-to-logical mapping order
ScriptLayout(itemRunCount, bidiLevel, visualToLogicalList, NULL);

// free the temporary BYTE[] buffer
free(bidiLevel);
}

The function above shows how obtain the visual-mapping list given an array of ITEM_RUN structures. This list is essential when displaying a string of text, or in fact doing anything which requires visual-order processing such as mouse/caret hit testing.


int xpos = 0, ypos = 0;

for(visualIdx = 0; visualIdx < itemRunCount; visualIdx++)
{
int logicalIdx = visualToLogicalList[visualIdx];
ITEM_RUN *itemRun = itemRunList[logicalIdx];

ProcessRun(itemRun, xpos, ypos);
xpos += itemRun->width;
}

This type of processing-loop is necessary because even though we may be dealing with right-to-left scripts (i.e. Arabic or Hebrew), when it comes to text-display we still draw everything from left-to-right, including the 'backwards' runs. The visual-to-logical list provides a way to map from a visual to logical index and ensures we always process the runs in the appropriate order.


Coming up in Part 13


Uniscribe is a very complicated business as you can probably tell from reading this tutorial. Unfortunately this is a necessary evil, as all software written today should be fully Unicode compliant. Don't think for a minute that Uniscribe can be ignored - we need to support Unicode, and we must accept the added complications that it brings. The days of ASCII/English text display have gone for good.


So far we have covered the process of breaking up, and reordering a string of Unicode text into a series of item-runs. However we are still only half-way towards implementing the UspAnalyze function. The next tutorial will reveal how to take the item-runs we have produced and generate glyph and width information using the ScriptShape and ScriptPlace APIs.


Part 13 - More Uniscribe Mysteries

Uniscribe Mysteries continued...

We are going to pick up directly from where we left the last tutorial, in which we began to look at the Uniscribe API in detail. Remember that we are still working on the UspAnalyze function, and the sequence of events last time had led us to the point where we had broken a string of Unicode text into several item-runs. Below are the steps we made to get this far:

  1. ScriptItemize - to break the string into distinct scripts or "item-runs".
  2. Merge item runs with application-defined "style" runs to produce finer-grained items.
  3. ScriptLayout - to potentially reorder the items.

The result of this work was an array of ITEM_RUN structures (called itemRunList) and the visual-logical mapping array (called visualToLogicalList) - which tells us in what order to display the runs. Both these arrays are stored inside the USPDATA object:

struct USPDATA
{
    ...

ITEM_RUN * itemRunList;
int itemRunCount;
int * visualToLogicalList;
...
};

The next task is to take each item-run in turn and get to the point where we can actually render some text (using ScriptTextOut). This will involve calling two more closely related Uniscribe functions (ScriptShape and ScriptPlace) for each run. Below are the steps we will now follow:



  1. ScriptShape - to apply contextual shaping behaviour and convert the characters from each run into a series of glyphs.

  2. ScriptPlace - to calculate the width and positions of each glyph in the run.

  3. Apply colouring/highlighting to the individual glyphs.

  4. ScriptTextOut - to display the glyphs


4. ScriptShape


Of all the Uniscribe functions, ScriptShape is probably the most important. It's purpose is to convert a run of Unicode characters into a series of glyphs ready for display. ScriptShape supersedes the functionality provided by the GetCharacterPlacement API but is quite similar in the type of data it returns.


ScriptShape is a fairly complicated function. It takes as input a single run of text (as identified by the SCRIPT_ITEM / ITEM_RUN structures), and also the SCRIPT_ANALYSIS structure associated with each item-run.


HRESULT WINAPI ScriptShape(
HDC hdc,
SCRIPT_CACHE * psc,
const WCHAR * pwsChars, // in
int cChars,
int cMaxGlyphs,
SCRIPT_ANALYSIS * analysis, // in

WORD * pwOutGlyphs, // out - array of glyphs
WORD * pwLogClust, // out - glyph cluster positions
SCRIPT_VISATTR * psva, // out - visual attributes
int * pcGlyphs // out - count of glyphs
);

Calling this function results in a bewildering array of information. Let's look at each parameter in turn to understand what they represent.



The most important parameter here is the pwLogClust[] array, the contents of which can be used to map between logical character positions and glyph-cluster positions. We will be looking at this array in more detail in the next tutorial.


Font Fallback


The majority of fonts do not support the full range of characters as defined by Unicode. In fact I don't know of any font which can display all Unicode scripts and languages. One of the nearest is "Arial Unicode MS" - which is available on the Microsoft Office CDs - but even this font only has around 55,000 characters available. Missing glyphs in a font usually (but not always) results in those little square boxes being displayed.


Applications usually solve this problem by utilizing specific fonts for each Unicode script type. This process is referred to as Font Fallback, and is implemented when the primary display font (say, for a text-editor) does not contain the appropriate glyphs to render all characters in a string. An internal lookup-table is searched for a 'backup font', from which the required glyphs can be substituted in favour of the missing glyphs in the primary font.


Font-fallback is not handled by the low-level Uniscribe API - only the ScriptString API has this facility. All Uniscribe-based applications are therefore required to have a built-in list of fallback fonts. For this reason I have decided not to implement Font-fallback in UspLib. It will be Neatpad's responsibility to handle font-fallback, and substitute fonts can be specified in the ATTR style-runs when analysing each line of text.


5. ScriptPlace


ScriptPlace takes the output of ScriptShape (the glyph-index-list and SCRIPT_VISATTR list) and generates glyph advance-width information. Advance-widths are simply the offset in pixels from one glyph to the next. This information is returned in an array of integers (piAdvance), which can be used to position the output coordinates when displaying text and also for mouse hit-testing.


HRESULT WINAPI ScriptPlace(
HDC hdc,
SCRIPT_CACHE * psc,
WORD * pwGlyphs, // in - the results from ScriptShape
int cGlyphs, // in - number of glyphs in pwGlyphs
SCRIPT_VISATTR * psva, // in - from ScriptShape
SCRIPT_ANALYSIS * analysis, // in - from the ITEM_RUN

int * piAdvance, // out - array of advance widths
GOFFSET * pGoffset, // out - array of GOFFSETs
ABC * pABC // out - pointer to a single ABC structure
);

Instead of accepting a buffer of WCHAR characters as input (as did ScriptShape), ScriptPlace requires the buffer of glyph-indices that were produced by ScriptShape. The parameters of note are:



Finally, the width of the item-run is represented by the ABC structure pointed to by the pABC parameter. The total width of each run can be calculated using the following expression:


runWidth = abc.abcA + abc.abcB + abc.abcC;

Note that the same value can also be calculated by summing together all of the integers in the piAdvance array.


for(i = 0; i < uspData->itemRunCount; i++)
ShapeAndPlaceItemRun(hdc, &uspData->itemRunList[i]);

ScriptPlace is so dependent on the results of ScriptShape that the two functions are usually called together and isolated in a wrapper function. The ShapeAndPlaceItemRun function is used to this effect, and is called once for each item-run in the string.


Tab Expansion


Handling tabs is really easy with Uniscribe, even though there is no built-in support. The thing to understand is, any character in the original text-string will always be represented by at least one glyph after ScriptShape is called. This is even true for non-displayable control-characters such as carriage-returns, spaces, and of course tab characters.


To illustrate this idea, an example string "Hello" will be used, in which has two TAB characters embedded:



The table below holds the results after calling ScriptShape and ScriptPlace on this text-string:




Array [0] [1] [2] [3] [4] [5] [6]
pwGlyphs[]
43

3

72

79

3

79

82

piAdvance[]
165

0

102

64

0

64

115


Notice that the tab-characters have both been represented by a glyph-index of "3". Although this glyph-index is only valid for a specific font, it represents the 'non-displaying' glyph - that is, a glyph with no visual representation. More interesting though is the resulting widths of these 'invisible' glyphs, which are initially set to zero "0".


The normal course of action once we have got to this stage is to call ScriptTextOut, with the generated widths+glyphs shown above. This would result in the following:



The dotted-outline is purely used here to bring across the concept of each glyph being an individual entity. Also notice the two vertical bars which are supposed to represent the (currently) zero-width tab characters.



The process of tab-expansion is straight-forward. All we need to do is to modify the individual width-entries for tabs inside the width-list. Once this is done all drawing and mouse hit-testing will use the modified glyph-widths, resulting in extra space being allocated where the tab characters would be.


Tab-expansion must obviously occur after ScriptShape and ScriptPlace have been called. After all item-runs have been processed in this way, UspAnalyze calls another internal function - ExpandTabs:


BOOL ExpandTabs(USPDATA *uspData, WCHAR *wstr, int wlen, SCRIPT_TABDEF *tabdef);

SCRIPT_TABDEF is a standard Uniscribe structure used for ScriptStringAnalyze. It contains information about the tab-stops in a string (size and locations). I have used this same structure for UspLib purely to be consistent.


Applying Attributes


UspLib supports variable length attribute-runs when styling a string of Unicode text, using an array of ATTR structures. Although Neatpad does not take advantage of this facility (it just sets each ATTR to "1" unit long), the possibility still exists for variable-length runs to be specified.



Whilst this is not a problem in itself, processing variable length style-runs at the same time as displaying runs of glyphs can get very complicated. To simplify this matter UspLib always flattens any user-supplied attribute-run, and keeps an internal copy inside the USPDATA object. The flattened run-list is allocated to the same length as the original Unicode string, and contains exactly one ATTR structure per original Unicode character.


UspApplyAttributes(USPDATA *uspData, ATTR *attrRunList) 

The UspApplyAttributes (above) is used to update the style-run information belonging to a USPDATA object, and is called by UspAnalyze as part of the string-analysis process. However this function can be called at any time after a string has been analyzed. Note that only the colour-information is updated on subsequent calls to UspApplyAttributes - as reapplying font information would require the entire string to be re-analyzed.


UspAnalyze


We have now covered enough ground to complete the implementation of UspAnalyze. All of the related code for this analysis phase is located in the UspLib.c file. The functional break-down of the analysis is shown below.



The result of all this work is a single USPDATA object, which contains all of the information necessary to display a string of Unicode text.


typedef struct _USPDATA
{
//
// Item-run information
//

int itemRunCount;
ITEM_RUN * itemRunList;
int * visualToLogicalList;

//
// Logical character/cluster information (1 unit per original WCHAR)
//

int stringLen; // length of current string (in WCHARs)
WORD * clusterList; // logical cluster info
ATTR * attrList; // flattened attribute-list

//
// Glyph information for the entire paragraph
// Each ITEM_RUN references a position within these lists:
//

int glyphCount; // count of glyphs currently stored
WORD * glyphList;
int * widthList;
GOFFSET * offsetList;
SCRIPT_VISATTR * svaList;

//
// external, user-maintained font-table
//

USPFONT * uspFontList;

} USPDATA, *PUSPDATA;

The listing above details the USPDATA structure. For the purposes of clarity I have omitted several 'house-keeping' fields which are not required for this discussion.


One of the major difficulties when dealing with Uniscribe is knowing what to do with the huge amount of information that is generated. The strategy that I have taken with UspLib is to keep all information inside the USPDATA object. The "per-run" glyph information is concatenated into several large buffers (glyphList, widthList etc). Each ITEM_RUN refers to a certain range of data within each of these large buffers, using the ITEM_RUN::glyphPos and ITEM_RUN::glyphCount fields.


There are basically two approaches with Uniscribe - and can be categorized as Speed vs Memory consumption. The first strategy is to gather together all the information generated by the Uniscribe APIs into one object. This has the advantage of being quick in operation, because the 'analysis' phase (itemization, shaping etc) happens one time only. After this the glyph data is stored away and then reused each time the text is displayed.


The other approach is to conserve memory, by only allocating buffers when necessary, and repeatedly calling ScriptShape/Place each time glyph information is required. The advantage has already been mentioned, but the disadvantage is performance loss. Re-shaping item-runs each time they are displayed will be quite alot slower - and considering that a text-editor will need to redraw it's display every time the mouse-selection changes, this strategy is something that I want to avoid.


For UspLib I have opted for the speed (resource-heavy) approach.


Coming up in Part 14


We still haven't drawn any text but it won't be long before we do. The next tutorial will focus on the UspTextOut function, and will demonstrate how to display styled Unicode text by taking the output from ScriptShape and ScriptPlace, and applying the attribute-runs stored in the USPDATA object.


Part 14 - Drawing styled text with Uniscribe

Introduction

The last tutorial saw the completion of the UspAnalyze function, one of the main APIs of the new UspLib text-rendering engine. We will now switch our attention to the implementation of the UspTextOut function. Our goal is to divide up the glyph-lists we ceated in the last tutorial, and apply colour information prior to display with ScriptTextOut. The method we will use to identify which colour belongs to each glyph is the central theme of this tutorial.

Now, there is alot of very specific information in this tutorial related to Unscribe and you are only going to find it interesting if you have also been trying to understand how to draw styled text. So feel free to skip to the next tutorial if you want to see UspLib in action.

The image above shows another small utility I wrote whilst working with Uniscribe. The purpose of this app is to demonstrate (and test) the UspLib library. You can download the demo, and also the UspLib sourcecode, at the top of this article.

6. Drawing styled text

At this point we could quite simply call ScriptTextOut with a whole run of glyphs and be done with it. It would display correctly and we would have succeeded in our goal to display Unicode text. However the text would only be drawn in a single font and colour, and it would have been much simpler to use the ScriptString API instead! Remember, the entire reason we are looking at Uniscribe is because we need to apply font and colour information in a very fine-grained manner.

Back in part#10 of this series I proposed a new method for rendering text in Neatpad, using three separate passes. I have implemented this rendering scheme with the UspTextOut function:

void UspTextOut(USPDATA  *  uspData,
HDC hdc,
int xpos,
int ypos,
RECT * bounds
) { //
// 1. Draw all background colours, including selection-highlights;
// selected areas are added to the HDC clipping region which prevents
// step#2 (below) from drawing over them
//
PaintBackground(uspData, hdc, xpos, ypos, bounds);

//
// 2. Draw the text normally. Selected areas are left untouched
// because of the clipping-region created in step#1
//

SetBkMode(hdc, TRANSPARENT);
PaintForeground(uspData, hdc, xpos, ypos, bounds, FALSE);

//
// 3. Redraw the text using a single text-selection-colour (i.e. white)
// in the same position, directly over the top of the text drawn in step#2
// Before we do this, the HDC clipping-region is inverted,
// so only selection areas are modified this time
//

PaintForegound(uspData, hdc, xpos, ypos, bounds, TRUE);
}

UspTextOut is quite similar to ScriptStringOut, in that it requires a string to be analyzed prior to display. It takes as input the USPDATA object that contains the information generated by UspAnalyze. Although there are three passes involved, there are only two functions that need to be implemented - DrawBackground, and DrawForeground which will be used to draw both regular (styled) as well as 'selected' text. We will take a look at the implementation of these functions a little further down.


Characters vs Glyphs vs Clusters


The major problem with Uniscribe is understanding how to decipher the results of the ScriptShape and ScriptPlace calls. There is just so much information returned about each run of text that it takes a fair amount of time and effort to understand it all. Hopefully by the end of this tutorial you will have a little more insight into how all of the Uniscribe functions hang together.


The key detail to understand about Uniscribe (and computer Typography in general) is the difference between characters and glyphs. Up until this point the main focus with Neatpad has been logical Unicode character sequences. However once Uniscribe has been involved the focus is very much on glyphs. The thing to understand here, is that there is no direct relationship between characters and glyphs.


For simple scripts such as English a font usually contains one glyph per Unicode character. However for more complex scripts this relationship can change. Sometimes a single Unicode character can result in more than one glyph. The opposite is also true - there can also be multiple Unicode characters resulting in just a single glyph. This behaviour various depending on what font is being used. This separation between characters and glyphs presents a problem because our attribute style-runs are all character based, and we somehow need to translate this styling information onto specific glyphs.


To make things even more complicated, the concept of glyph clusters must be understood. A cluster is basically a grouping of glyphs which must be treated as a single selectable unit. Whilst this is not a problem in itself, it does make rendering glyph sequences a little more complicated because cluster boundaries must be respected.


Understanding the Logical Cluster List


The logical-cluster list is key to establishing a relationship between characters and glyphs. This list is returned by ScriptShape in the pwLogClust[] array. It provides the mapping between logical character-positions and glyph-cluster positions. UspLib stores each run's logical-cluster information inside the clusterList[] field of the USPDATA object.


To support this idea of character-to-glyph mapping, the logical-cluster list must represent two important concepts:



In other words, the individual element values (the content) of the clusterList defines the glyph-clusters, whilst the positions of the array elements represents the clusters in logical character terms.


As an example we will use the same Arabic string "يُساوِي" we were looking at previously. This string of seven Unicode characters results in the following logical cluster information being generated by ScriptShape. Note that the logical array-index positions are listed across the top of the table.




Array [0] [1] [2] [3] [4] [5] [6]
WCHAR wszText[]
U+064A

U+064F

U+0633

U+0627

U+0648

U+0650

U+064A

WORD clusterList[]
6

6

4

3

2

2

0


Whole clusters are identified by grouping together any identical numbers in the logical-cluster list. As you can see from the cluster-list in the table above, there are two 6's and two 2's (in addition to the other singlular numbers), resulting in a total of five whole clusters all together. The image below illustrates this grouping concept.



Notice that cluster-list is always stored in logical order, whilst the glyph-list is always in visual order. This means that for right-to-left scripts (such as the Arabic string above), the cluster-list elements will decrease when reading the array. As a result of this, the first glyph that must be drawn will be at the very end of the glyph-list. Bearing this in mind, the breakdown of the clusters is as follows:



Hopefully it should be fairly obvious how the logical clusters were identified - with the number of WCHARs in each cluster calculated by the number of characters in each grouping. Calculating the number of glyphs in each cluster is less obvious. The key here is looking at the difference between the cluster values. This is how the identification of each cluster occurred:



  1. The first two 6's identify cluster#1, comprising two WCHARs (in character-positions 0 and 1). This value of 6 points to the end of the glyphList which contains the glyphs for this cluster. We know that this cluster is represented by two glyphs (#5 and #6) because:

  2. The next value in the cluster-list (4) tells us two things. Obviously this cluster starts at glyph #4 in the glyph-list. However this also means that there were 2 (two) glyphs in the last cluster (6-4 = 2).

  3. The third cluster is comprised of the single glyph#3, and a single WCHAR.

  4. The fourth cluster is comprised of two WCHARs again, which are represented by glyphs #1 and #2.

  5. The fifth and final cluster is again a single WCHAR, represented by glyph#0 in the glyph-list.


As you can see the key detail here is looking at the difference between glyph-indices in order to count the number of glyphs in each cluster. Special consideration must also be taken with right-to-left scripts because of the way that glyphs are stored in reverse order. The way I handled this was to advance the x-coordinate to the end of the run, call SetTextAlign(TA_RIGHT), and then move the output-location to the left each time, resulting in the glyphs being output in logical (right-to-left) order.


The important thing to understand is that we always follow the cluster-list in logical order, even for right-to-left scripts. We rely on the ordering of the element values to locate each glyph-cluster as it should be drawn.


Another example


The previous example was of couse a right-to-left script and highlighted the unique way in which these scripts are represented by Uniscribe. The example shown next is based on the example in MSDN under the ScriptShape documentation, and highlights how complex left-to-right text is represented by Uniscribe.



U+920, U+911, U+915, U+94D, U+937, U+91D, U+949


The string this time is from the Devanagari script. I have no idea what it means because I just strung a sequence of code-points together which happened to have the right "look". If anyone can supply me with a Unicode phrase of 7 characters, which results in the glyph+cluster properties shown below, then please get in touch!




Array [0] [1] [2] [3] [4] [5] [6]
Unicode string
U+0920

U+0911

U+0915

U+094D

U+0937

U+091D

U+0949

clusterList[]
0

1

4

4

4

5

5


The key difference here is how the cluster-list elements increase when reading the array. For left-to-right runs, the glyphs are stored in the same order as the original Unicode characters. This is the ordering that many Western readers will find most natural.



The diagram hopefully again illustrates the relationship between logical characters and glyph-clusters, this time for a left-to-right run of text. This example is purely fictitious as I couldn't find any phrase, font & script which satisfied the required glyph+cluster properties. Again, the number of glyphs per cluster is calculated by the difference between the cluster-list elements.



The number of glyphs for the last cluster was calculated because we knew how many glyphs (8 in total) were generated for this run by ScriptShape.


Interpolation is the key


The one thing to understand about Uniscribe is the separation between characters and glyphs. So looking now at another example, what happens when we have three characters comprising two glyphs? The problem we face is, how do we distribute the colour information for each character across the glyphs, and which glyph takes which colour?



U+0635 U+0651 U+0650


For some scripts where the glyphs typically order horizontally you can almost infer the colour relationships. However when glyphs stack vertically on top of each other within a cluster, or when there is an unequal number of characters-to-glyphs, there is no easy way to associate a colour with a particular glyph.


With UspLib I solved this problem in two ways - using one method for drawing the background, and another when drawing the actual glyphs themselves (the foreground). Drawing the foreground was easy: I decided simply to paint all glyphs in a cluster as a single colour. Should there be multiple colour-attributes for the cluster, only the first is chosen and the rest are ignored. This is by far the easiest method and in reality you wouldn't expect individual glyphs to have their own colours when in a cluster.


Painting the background is rather different, because the inversion-highlighting scheme must be taken into account. The strategy I have used here is to interpolate the colours across the width of each cluster, when drawing the background. This method is hinted at by Microsoft in the following quote from MSDN, in the section "notes on ScriptXtoCP and ScriptCPtoX":



"Cluster information in the logical cluster array is used to share the width of a cluster of glyphs equally among the logical characters they represent."



It took me a while to make the logical leap that this strategy could also be used for text-rendering, however after implementing it I realised it is the exact same method used by the ScriptString API.



The process is very simple. We know how many characters make up each cluster, and also how many glyphs make up each cluster. We therefore sum the width of these glyphs to calculate the total width of the cluster. We then divide the cluster-width by the number of characters in the cluster - this number tells us how wide each colour band should be.


advanceWidth = clusterWidth / charCount;

For some scripts, dividing the clusters this way makes alot of sense, especially for Arabic because the caret is conventially positioned at character boundaries rather than glyph-cluster boundaries. However for most scripts this would be viewed as incorrect. Rather than having a special-case just for Arabic, I have instead written UspTextOut so that it always interpolates over glyph-clusters, should any colour-attributes happen to be this fine-grained. We will rely on the fact that ScriptCPtoX will only allow the caret (and therefore selection-highlghts) to be placed in the middle of clusters when appropriate.


Lastly, using integer math for the cluster-division will be result in potential rounding errors. Whilst this is not a massive problem, we need to have the exact same results that ScriptCPtoX produces when it does its own presumed division-operations (otherwise we could be out by a pixel occasionally). Presumably ScriptCPtoX uses MulDiv in its calculations because this appears to give the correct results, and is what I have used for UspLib.


Drawing the background


As mentioned above, drawing the background is a little different because of the use of interpolation. We will start by looking at the PaintBackground routine:


void PaintBackground(USPDATA * uspData, HDC hdc, int xpos, int ypos, RECT * bounds)
{
int i;
ITEM_RUN * itemRun;

// Process the item-runs in visual-order
for(i = 0; i < uspData->itemRunCount; i++)
{
itemRun = GetItemRun(uspData, i);

// paint the background of the specified item-run
PaintItemRunBackground(uspData, itemRun, hdc, xpos, bounds);
xpos += itemRun->width;
}
}

As you can see this function is very simple. It merely processes the item-runs in visual-order and advances the x-coordinate by the item-width for each run. Each item-run background are rendered individually by the PaintItemRunBackground function.


void PaintItemRunBackground(USPDATA *uspData, ITEM_RUN *itemRun, HDC hdc, int xpos, int ypos)
{
int i, lasti;

// locate the item-run buffers
WORD * clusterList = uspData->clusterList + itemRun->charPos;
ATTR * attrList = uspData->attrList + itemRun->charPos;
int * widthList = uspData->widthList + itemRun->glyphPos;

for(lasti = 0, i = 0; i < itemRun->len; i++)
{
// search for a logical cluster boundary (or end of run)
if(i == itemRun->len || clusterList[lasti] != clusterList[i])
{
<< process cluster >>
}
}
}

The primary task is to identify the logical-cluster positions. The two loop-indices (lasti and i) represent these cluster positions in the original text-string. The number of WCHARs in each cluster is therefore (i-lasti). Because we always iterate in logical order, this is true for both LTR and RTL texts.


<< process cluster >>

int glyphIdx1, glyphIdx2;

// locate glyph-positions for the cluster
GetGlyphClusterIndices(itemRun, clusterList, i, lasti, &glyphIdx1, &glyphIdx2);

// measure width of this group of glyphs
for(runWidth = 0; glyphIdx1 <= glyphIdx2; )
runWidth += widthList[glyphIdx1++];

// divide the cluster-width by the number of code-points that cover it
advanceWidth = MulDiv(runWidth, 1, i-lasti);

Once a cluster has been identified, GetGlyphClusterIndices is callled. This function inspects the clusterList and returns the corresponding glyph-index positions for i and lasti.


The width of the glyph-cluster is computed next, by simply iterating between glyphIdx1 and glyphIdx2, before dividing the cluster-width by the number of characters (WCHARs). We now know how far to advance each time we paint a bit of background.


    for(a = lasti; a <= i; a++)
{
// look for change in attribute background
if(a == itemRun->len ||
attr.bg != attrList[a].bg ||
attr.sel != attrList[a].sel )
{
PaintRectBG(uspData, itemRun, hdc, xpos, &rect, &attr);
rect.left = rect.right;
}
}

The final task is to interpolate the colour-attributes over the cluster. We only ever paint the background if we detect a change in colour, so most of the time an item-run background is painted with just one operation. Missing from the code listing above is the small detail of correcting for rounding errors in the (integer) division - however this is not necessary for understanding the code.


I won't bother including the code for the PaintRectBG function - suffice to say it is not really very interesting, other than the fact that it calls ExcludeClipRect after drawing any selection-highlight background area.


void GetGlyphClusterIndices( USPDATA  * uspData, 
ITEM_RUN * itemRun,
int clusterIdx1,
int clusterIdx2,
int * glyphIdx1,
int * glyphIdx2
)
{
WORD *clusterList = uspData->clusterList + itemRun->charPos;

// locate glyph-positions for the cluster
if(itemRun->analysis.fRTL)
{
// RTL scripts
*glyphIdx1 = clusterIdx1 < itemRun->len ? clusterList[clusterIdx1] + 1 : 0;
*glyphIdx2 = clusterList[clusterIdx2];
}
else
{
// LTR scripts
*glyphIdx1 = clusterList[clusterIdx2];
*glyphIdx2 = clusterIdx1 < itemRun->len ? clusterList[clusterIdx1] - 1 : itemRun->glyphCount - 1;
}
}

Above is the GetGlyphClusterIndices function. Note the two distinct cases for LTR and RTL scripts - this is required because the cluster-elements decrease when reading the cluster array (for RTL scripts), but increase for LTR scripts.


Drawing the Foreground


The process for drawing the text is so similar to that of the background that I won't bother including too much code this time. We'll jump straight in to the start of the DrawForegroundItemRun function:


 // right-left runs can be drawn backwards for simplicity
if(itemRun->analysis.fRTL)
{
oldMode = SetTextAlign(hdc, TA_RIGHT);
xpos += itemRun->width;
runDir = -1;
}

The first thing we do is set the text-alignment to TA_RIGHT for any right-to-left string, and advance the x-coordinate to the end of the run. This will allow us to draw the text in logical order (as we walk the logical-cluster-list). This is important because apart from this one detail, it means we can maintain a single function for drawing both LTR and RTL texts.


 // loop over all the logical character-positions
for(lasti = 0, i = 0; i <= itemRun->len; i++)
{
// find a change in attribute
if(i == itemRun->len || attrList[i].fg != attrList[lasti].fg )
{
// scan forward to locate end of cluster (we must always
// handle whole-clusters because the attr[] might fall in the middle)

for( ; i < itemRun->len; i++)
if(clusterList[i - 1] != clusterList[i])
break;

// locate glyph-positions for the cluster [i,lasti]

GetGlyphClusterIndices(itemRun, clusterList, i, lasti, &glyphIdx1, &glyphIdx2);

<< display text >>
}
}

The next difference between foreground and background rendering is how we identify cluster-boundaries. This time we look for changes in colour first of all. Once a new colour is found we scan forward to locate the end of the cluster. This means we can paint the whole cluster in one colour and not worry about interpolation.


<< display text >>

// measure the width (in pixels) of the run
for(runWidth = 0, g = glyphIdx1; g <= glyphIdx2; g++)
runWidth += widthList[g];

// only need the text colour as we are drawing transparently
SetTextColor(hdc, forcesel ? uspData->selFG : attrList[lasti].fg);

//
// Finally output the run of glyphs
//

hr = ScriptTextOut(
hdc,
&uspFont->scriptCache,
xpos,
ypos,
0,
NULL,
&itemRun->analysis,
NULL,
0,
glyphList + glyphIdx1,
glyphIdx2 - glyphIdx1 + 1,
widthList + glyphIdx1,
NULL,
offsetList + glyphIdx1
);

// +ve/-ve depending on run direction
xpos += runWidth * runDir;
lasti = i;

Once the text-colour has been set, ScriptTextOut is called with the range of glyphs which fall within the cluster. Once again, we only output any text should there be a change in colour so usually there would only be one call to ScriptTextOut.


7. ScriptTextOut


For the sake of completeness here's the prototype for ScriptTextOut:


HRESULT WINAPI ScriptTextOut(

HDC hdc,
SCRIPT_CACHE * psc,

int x,
int y,
UINT fuOptions, // ExtTextOut options
RECT * rect,
SCRIPT_ANALYSIS * analysis,
WCHAR * pwcReserved,
int iReserved,

WORD * pwGlyphs, // in - results of ScriptShape
int cGlyphs,
int * piAdvance, // in - results of ScriptPlace
int * piJustify,
GOFFSET * pGoffset // in - results of ScriptPlace
);

That's a pretty intimidating function by anyone's standards! The parameters of note are:



ScriptTextOut is basically a wrapper around ExtTextOut - however you will notice that there is no WCHAR* parameter to this function. This is because ScriptTextOut calls ExtTextOut with the ETO_GLYPH_INDEX option, and passes the buffer of glyphs we specifed.


ScriptTextOut may perform additional processing (such as glyph-reordering) before calling into GDI, so don't be tempted to bypass ScriptTextOut by calling ExtTextOut directly.


Uniscribe Limitations


One of the drawbacks of Uniscribe is the very thing it does best - the breaking up of a string into individually shapable items. The problem is that some strings containing alot of whitespace or punctuation result in a large number of item-runs. Whilst this is not bad in itself, it does present a problem when it comes to rendering the line of text. The shear number of calls to ScriptTextOut has a performance penalty - in comparison to calling ExtTextOut with the same line of text.


For complex-scripts there is no alternative but to break up the string using ScriptItemize. However it would be nice if for non-complex (i.e. English) scripts we could somehow re-combine the item-runs and reduce the potential number of calls to ScriptTextOut. I haven't ventured too far down this path yet, but it is certainly possible to identify if an item-run is complex or not by inspecting the SCRIPT_ANALYSIS::eScript field.


struct SCRIPT_ANALYSIS
{
WORD eScript : 10;
WORD fRTL : 1;
...
};

Now, the eScript field is 'opaque' which means we shouldn't make any assumptions about its value. However it can be used as an index into the "global script table", which contains information about the specific script-shaping engines installed in a system.


HRESULT WINAPI ScriptGetProperties(SCRIPT_PROPERTIES ***ppSp, int *piNumScripts);

The ScriptGetProperties function returns a pointer to this global-script-table, and each entry in the table is a pointer to a SCRIPT_PROPERTIES structure:


struct SCRIPT_PROPERTIES
{
DWORD langid;
DWORD fNumeric;
DWORD fComplex;
...
};

There are many information-fields in this structure, however the interesting one for us is the fComplex flag. Drawing all this together results in the following function, which returns a boolean indicating if an item-run is complex or not:


BOOL IsRunComplex(ITEM_RUN *itemRun)
{
SCRIPT_PROPERTIES ** propList;
int propCount;
int scriptIndex;

// get pointer to the global script table
ScriptGetProperties(&propList, &propCount);

// the SCRIPT_ANALYSIS::eScript is an index to the global script table
scriptIndex = itemRun->analysis.eScript;

// locate the script from the script-index
return propList[scriptIndex]->fComplex;
}

Any non-complex item-runs could theoretically be identified and then merged together into a single run, with the SCRIPT_ANALYSIS::eScript field set to SCRIPT_UNDEFINED. All this should happen before ScriptShape is called.


Coming up in Part 15


Every time I post a new tutorial I promise that there'll be another update to Neatpad, and of course it hasn't happened (again!). Uniscribe is just so damn complicated it has taken me far more time to document than I first anticpated. For now you can download the UspLib demo at the top of this tutorial, and next time we really will be seeing a new-and-improved Neatpad.


AttachmentSize
uspdemo.zip39.63 KB

Part 15 - Integrating UspLib

Introduction

It's finally here! - a new and improved Neatpad which demonstrates the rendering capabilities of UspLib. The purpose of this is tutorial to document the UspLib API, and secondly to mention a few details about how UspLib was integrated into Neatpad's code. I very much hope that the design of UspLib is good enough that it will others to import it into their own editors and get instant styled-text support!

The image above shows Neatpad's new Unicode text-rendering engine in action. Five different scripts are being displayed - Devanagari, Tamil, Thai, Arabic and of course Latin. Font-fallback is not currently supported in Neatpad, so to display all of these different scripts a suitable font must be selected. In the example above I used the "Arial Unicode MS" font which weighs in at a hefty 22Mb!

Now, don't get too exited about this latest version. On the surface it is no different than before - it is only until you load a Unicode file containing lots of complex scripts that you will see where all the work has gone into.

The UspLib API has been documented below. Please let me know if you were successful in integrating this API into your own projects!


To use UspLib, include the single header-file usplib.h, and link against usplib.lib. There are no dependencies on the library itself other than the Uniscribe Script Processor DLL (usp10.dll) which will be present on Windows2000 and above.

UspAllocate

USPDATA * UspAllocate();

UspAllocate initializes and returns a new USPDATA object, which must be used for all subsequent UspLib operations.

UspAnalyze

BOOL UspAnalyze (
  USPDATA         * uspData,
HDC hdc,
WCHAR * wstr,
int wlen,
ATTR * attrRunList, // optional
UINT flags, USPFONT * uspFont, // optional SCRIPT_CONTROL * scriptControl,
SCRIPT_STATE * scriptState,
SCRIPT_TABDEF * scriptTabdef, // optional
);

UspAnalyze takes as input a single USPDATA object and analyses the the specified paragraph of UTF-16 text, saving the results back in uspData.

UspAnalyze must be used to analyze an entire paragraph of text. The resulting USPDATA object can be used in subsequent calls to UspTextOut and UspSnapXtoOffset.

struct ATTR
{
   COLORREF   fg;         // foreground text colour
   COLORREF   bg;         // background text colour

int len : 16; // length of this run (in WCHARs)
int font : 7; // font-index into the USPFONT table
int sel : 1; // selection flag (yes/no)
int ctrl : 1; // show as an isolated control-character
int eol : 1; // only valid for last character in line, prevents mouse selection
int reserved : 6; // unused
};

All fields of the ATTR structure must be initialized before use. Any unrequired field should be set to zero. The ATTR::font field is used as an index into the USPFONT table. Any font in referenced by ATTR::font must have initialized using UspInitFont.


UspInitFont


void UspInitFont (
USPFONT * uspFont,
HDC hdc,
HFONT hFont
);

UspInitFont must be called once for each font referenced by UspAnalyze in the attrRunList array. Several font-related resources are managed by the USPFONT object, including the Uniscribe SCRIPT_CACHE object, and the text-metrics for the font.



The USPFONT structure is defined below:


struct USPFONT
{
HFONT hFont;
SCRIPT_CACHE scriptCache;
TEXTMETRIC tm;
int yoffset; // height-adjustment when drawing font (set to zero)
};

The yoffset field is user-defined and specifies the vertical adjustment to be applied to all text using this font. UspInitFont initially sets this value to zero, however it can be modified after this call. All other structure members are managed by UspInitFont and should not be modified by the caller.


UspFreeFont


void UspFreeFont (
USPFONT * uspFont
);

UspFreeFont must be called when the specified USPFONT resource is no longer required. The font-handle specified in the call to UspInitFont is released, as well as the SCRIPT_CACHE object held internally to the structure.


UspApplyAttributes


void UspApplyAttributes (
USPDATA * uspData,
ATTR * attrRunList
);

UspApplyAttributes can be called at any time to re-apply the style-run attributes for the specified USPDATA object. Only the colour and selection information is used - all other fields of the attribute-runs (including the font) are ignored.



The attribute-run list must reference a range of text the same length as the string that was previously analyzed by UspAnalyze.


UspApplySelection


void UspApplySelection (
USPDATA * uspData,
int selStart,
int selEnd
);

UspApplySelection performs a similar task to UspApplyAttributes. However this time only the selection-flags are modified in the USPDATA object.



UspSetSelColor


void UspSetSelColor (
USPDATA * uspData,
COLORREF fg,
COLORREF bg
);

UspSetSelColor controls the selection-highlight colour to be used when calling UspTextOut. Any character marked with the ATTR::sel attribute, or any range of text identified by UspApplySelection will be drawn using this colour. Note that by default, the Windows selection-highlight colours will be used.



UspTextOut


int UspTextOut (
USPDATA * uspData,
HDC hdc,
int xpos,
int ypos,
int lineHeight,
int lineOffsetY,
RECT * rect
);

UspTextOut is the counterpart to ScriptStringOut. It takes as input the USPDATA object which was previously analyzed, and draws the text to the specified location. Any fonts, colours and selection-highlights are applied to the text as it is drawn.



It is recommend to "double-buffer" the output of this function as the multi-pass rendering will result in flickering. The alignment-mode, background-mode and device-context colours of the device-context are unspecified on this function's return.


UspTextOut will change in the future to support word-wrapping.


UspSnapXToOffset


BOOL UspSnapXToOffset (
USPDATA * uspData,
int xpos,
int * snappedX, // out, optional
int * charPos, // out
BOOL * fRTL // out, optional
);

UspSnapXtoOffset converts an x-coordinate to the nearest character-offset. In addition it returns the x-coordinate of the selected character.



The fRTL parameter is useful for the case when the text-caret's shape is modified to reflect the reading-direction of the run of text that corresponds to xpos.


UspXToOffset


BOOL UspXToOffset (
USPDATA * uspData,
int xpos,
int * charPos, // out
BOOL * trailing, // out
BOOL * fRTL // out, optional
);

UspXToOffset converts an x-coordinate to a character position.



UspOffsetToX


BOOL UspOffsetToX (
USPDATA * uspData,
int offset,
BOOL trailing,
int * xpos // out
);

UspOffsetToX returns the x-coordinate for the leading or trailing edge of a character position.



UspFree


void UspFree(USPDATA * uspData);

UspFree should be called when the specified USPDATA object is no longer required.




Changes to Neatpad


It was very straight-forward to import UspLib into Neatpad's existing codebase. However there were several changes made to key aspects of the TextView library which made this possible. These changes are briefly mentioned below.



Whilst a large amount of code has removed from the TextView, in reality these areas of functionality have been transferred to UspLib which now handles all aspects of drawing, fonts and mouse hit-testing.



UspLib was designed primarily for use with Neatpad. However this does not mean that it cannot be used for other text-editor projects, or in fact any application that requires the use of complex, styled text. Remember, UspLib is Freeware and can be used in any project!


Problems with paragraphs


With Uniscribe (and therefore UspLib), the basic unit of text is the paragraph. For text-editors such as Neatpad, an entire line can be treated as a paragraph. This concept is important however, as it imposes a restriction on how UspLib should be used. Because whole lines must be analyzed, this effectively means that an entire line of text must be in memory at one time. The consequence of this means that we must impose a "line length" limit on text files that we load. In Neatpad, any line of text beyond a certain length will be truncanted.


I'm quite please about this limitation actually as I wasn't looking forward to handling arbitrarily long lines. These are just a few of the issues that long-lines present:



I don't have any good answers to these questions so I'm happy for the moment to have a simple restriction of something like 65Kb per line. I'd like to hear any thoughts in this area though!


Caching with GetUspData


The big issue with Uniscribe is all the memory that must be allocated in order to display just a single line of text. UspLib hides this complexity behind the USPDATA object. However the memory overhead that each USPDATA imposes is quite significant:


16 bytes per glyph.
14 bytes per wide-character.
32 bytes for each item-run.


For a typical string of UTF-16 text we are looking at an increase of many times that of the original string length. Obviously this is far too much to be creating USPDATA objects for every line of text in a file. To solve this problem a new TextView member function was written, which manages USPDATA objects from an internal cache.


struct USP_CACHE
{
USPDATA * uspData; // the UspLib data for this line
ULONG lineno; // which line this refers to
ULONG usage; // usage count for caching purposes
};

class TextView
{
...

// keep an internal cache of USPDATA objects
USP_CACHE m_uspCache[USP_CACHE_SIZE];
};

Whenever a line of text is required by the TextView (for drawing or mouse hit-testing), the GetUspData function is called. The drawing and mouse-related routines no longer directly access the underlying TextDocument. All data-access is now through this single function.


USPDATA *TextView::GetUspData(HDC hdc, ULONG nLineNo)
{
TCHAR buff[TEXTBUFSIZE];
ATTR attr[TEXTBUFSIZE];
int len;

USPDATA * uspData = << find a cached object >>

// if found a match (an already analyzed line) then return it here!!
if(....) return uspData;

// otherwise we need to style + analyze a new line
len = m_pTextDoc->getline(nLineNo, buff, TEXTBUFSIZE, &off_chars);
len = ApplyTextAttributes(nLineNo, off_chars, colno, buff, len, attr);

// setup the tabs
int tablist[] = { m_nTabWidthChars };
SCRIPT_TABDEF tabdef = { 1, 0, tablist, 0 };
SCRIPT_CONTROL scriptControl = { 0 };
SCRIPT_STATE scriptState = { 0 };

// generate glyphs etc
UspAnalyze(uspData, hdcTemp, buff, len, attr, 0, m_uspFontList,
&scriptControl, &scriptState, &tabdef);

return uspData;
}

The sample-code above gives the general idea for how GetUspData works. The caching details are rather boring so I've omitted them here (just look in the real sources). The idea behind this method though, is that any time we want a USPDATA object, GetUspData will return one ready-analyzed. Most of the time this object will be from the cache, and only occasionally will a line need to be fetched from the TextDocument and analyzed with UspAnalyze.


Conclusions


The move to Uniscribe defines a turning-point in Neatpad's development. It has taken alot of effort to get here but the future now looks alot clearer. In many ways I wish I had started this project with Uniscribe right from the beginning - it would have saved alot of work. However Unicode is quite complicated and I think the beginning tutorials would have suffered from this extra complexity. Besides, I think it is good to see the evolution that has occurred since the start of this project, and also the mistakes that I have made along the way.


Overall I've found working with Uniscribe to be a very rewarding experience. The API itself is rather complicated but it is very well designed. The main difficulty is coming to terms with the concept of glyph-based rendering. However I do feel that the MSDN documentation for Uniscribe to be rather inadequate in places. For someone who had no prior experience in displaying Unicode text I struggled for quite some time before finally completing this phase of the project.


As a comparison, take a look at the Apple documentation for ATSUI (an equivalent API to Uniscribe but higher-level). The documentation is much clearer in my opinion - it doesn't just document the ATSUI API but gives guidelines on how it should be used on the Apple system.


Coming up in Part 16


There are still some minor "todo's" with UspLib which I haven't quite managed to finish. The issue of CRLF sequences at the end of a line of text needs addressing for bi-directional texts. Sometimes the CRLF will not be on the far-right of a line - for RTL texts it can be on the left-side, or even in the middle of the line! The other issue is properly displaying the file with full right-to-left alignment, with the scrollbar positioned on the left.


The next tutorial will look at adding keyboard support to the TextView. We will focus only on caret-movement with the keyboard, as actual text-entry must wait until the TextDocument can actually edit text. The caret-movement code will be using Uniscribe's ScriptBreak routine, which will probably result in a couple more UspLib functions being added.


Beyond this I will probably tackle syntax highlighting, and once the GUI is completely finished I will finally move onto file-editing. The end is getting alot closer now I feel!


AttachmentSize
neatpad15.zip153.23 KB

Part 16 - Keyboard Navigation with Uniscribe

Introduction

Keeping with the Uniscribe theme brings us to the next area of Neatpad's development that hasn't been touched on yet, which is keyboard-input. I've deliberately left this stage until now because I knew that without Uniscribe keyboard-navigation would be very difficult indeed. The problem with keyboard-handling is not how to process keyboard input (which is easy), but rather how to navigate through a Unicode file - taking into account combining sequences, surrogates, graphaeme clusters etc.

Up until this point the Uniscribe API has been used extensively provide text-rendering support. Fortunately Uniscribe can be used for more than just text output, and we will be looking in detail at the ScriptBreak API and how it can help us manage keyboard navigation.

Keyboard messages in Win32

All Windows programs receive keyboard input in the form of WM_KEYDOWN and WM_KEYUP messages. When a key is pressed, a series of WM_KEYDOWN messages are sent to an application's message-queue, and when the key is released a single WM_KEYUP message is sent. These two messages are relatively 'low level' but together still form the foundation of keyboard-input in Windows.

  Key Pressed Key Released
Normal Keystroke
WM_KEYDOWN
WM_KEYUP
System Keystroke
WM_SYSKEYDOWN
WM_SYSKEYUP

The table above summarises the two basic keyboard-input messages, and also their 'system' counterparts - WM_SYSKEYDOWN and WM_SYSKEYUP. These last two messages are seldom used by Windows programs and have no relevance to Neatpad's development so I won't bother describing them here.

The WM_KEYDOWN message is most commonly used by applications to detect when specific keys on the keyboard have been pressed. This is a good way to detect keys such as <control>, <shift> and arrow keys. However when it comes to text-entry, processing specific key presses is not actually the best way to go about things.

For example, when processing specific key presses, there is no easy way to determine the case of a letter. The user could have hit the 'A' key, but is this in lowercase or uppercase? All we know is the virtual-keystroke 'A' was entered, but we don't know anything about the state of the CAPSLOCK button or whether the user is holding the SHIFT key down. Obviously the actual character entered would be different depending on these factors - in this simple case it could be either 'a' or 'A'. Things get even more complicated when you move beyond English keyboards into the realm of Unicode, Input Method Editors and system locales, where multiple key-strokes can result in wildly different characters.

Fortunately there is another mechanism for handling character input in Windows - the WM_CHAR and WM_UNICHAR messages. These messages are specifically intended to represent characters rather keystrokes. Interestingly, WM_CHAR is not automatically sent to an application when a key is pressed on the keyboard. It is not until the TranslateMessage function is called inside an application's message-loop that the WM_CHAR message is dispatched.

while(GetMessage(&msg, 0, 0, 0) > 0)
{
    TranslateMessage(&msg);
    DispatchMessage(&msg);
}

Above is the standard message-loop of many Win32 programs. Most programmers probably copy+paste this loop straight into their code without giving it much thought, but the TranslateMessage function in particular has a very specific purpose. It translates certain messages (such as WM_KEYDOWN) into a series of corresponding WM_CHAR messages. TranslateMessage takes into account certain things such as the state of the SHIFT or CAPSLOCK keys, and also the current locale. Note that the WM_KEYDOWN message being translated is not modified in any way - instead a new WM_CHAR message is constructed by TranslateMessage and posted back into the current thread's message-queue for subsequent processing.

  Characters Dead Characters
UTF-16 Character
WM_CHAR
WM_DEADCHAR
UTF-32 Character
WM_UNICHAR
 
Input Method Editor
WM_IME_CHAR
 
System Character
WM_SYSCHAR
WM_DEADSYSCHAR

The table above this time summarises the various character-input messages available to Windows programs. We don't be performing actual data-entry into Neatpad until much later on in this series so there is no point in looking at these messages now.

Keyboard Navigation with WM_KEYDOWN

The purpose of this tutorial is to cover the implementation of Keyboard Navigation - so we are only interested in physical keys such as the arrow-keys, page-up, page-down, home & end, etc. Therefore will only need to handle the WM_KEYDOWN message at this time - actual character-input (and the WM_CHAR/WM_UNICHAR messages) will wait until later in this series until we actually have a mechanism to modify the TextDocument.

In general, keyboard navigation in Windows text-editors is fairly consistent. The arrow keys (left, right, up, down) are used to move the text-caret in these four basic directions, and page-up, page-down, home and end are all well established in what they should achieve. In addition holding the control or shift keys should modify the behaviour of whatever navigation key is being pressed at the time - the shift key being used to alter the text-selection as the cursor is being moved.

The table below summarises the behaviours that we wil be implementing in Neatpad.

Key Code Normal Action TextView Method With <Control> TextView Method
VK_LEFT
Character left
MoveCharPrev
Word left
MoveWordPrev
VK_RIGHT
Character right
MoveCharNext
Word right
MoveWordNext
VK_UP
Line up
MoveLineUp(1)
Scroll line up
Scroll
VK_DOWN
Line down
MoveLineDown(1)
Scroll line down
Scroll
VK_PRIOR
Page up
MoveLineUp(x)
 
VK_NEXT
Page down
MoveLineDown(x)
 
VK_HOME
Line start
MoveLineStart
Document start
MoveFileStart
VK_END
Line end
MoveLineEnd
Document end
MoveFileEnd

Each action will be represented by a TextView member-function that will perform the associated operation. As you can see there are actually quite a number of different actions that we must implement. This is due in part to the effect of the control-key which almost doubles the number of methods we must implement.

The WM_KEYDOWN handler in Neatpad's TextView is shown below. A switch-statement is used to process each keystroke that we are interested in:

LONG TextView::OnKeyDown(UINT nKeyCode, UINT nFlags)
{
    bool fCtrlDown  = IsKeyPressed(VK_CONTROL);

switch(nKeyCode)
{
case VK_LEFT:
if(fCtrlDown) MoveWordPrev();
else MoveCharPrev();
break;

case VK_RIGHT:
...

}

<< extend selection if <shift> is held down >>

<< update text-caret position >> }

The purpose of the MoveXxxx functions is to update the m_nCursorOffset variable to reference a new position within the current file. Each MoveXxxx function adjusts m_nCursorOffset in a different way depending on what keyboard action should be processed. I won't include the entire function here because hopefully you can get the general idea from the snippet above.


Once a keypress has resulted in m_nCursorOffset being updated, the next step is to handle text-selections:


// Extend selection if <shift> is down
if(IsKeyPressed(VK_SHIFT))
{
InvalidateRange(m_nSelectionEnd, m_nCursorOffset);
m_nSelectionEnd = m_nCursorOffset;
}
// Otherwise clear the selection
else
{
if(m_nSelectionStart != m_nSelectionEnd)
InvalidateRange(m_nSelectionStart, m_nSelectionEnd);

m_nSelectionEnd = m_nCursorOffset;
m_nSelectionStart = m_nCursorOffset;
}

Extending the text-selection is simply a matter of checking the state of the shift key (up/down), and then modifying the m_nSelectionStart and m_nSelectionEnd TextView variables appropriately. When the selection should be extended (shift key is down) then only the m_nSelectionEnd variable is modified. Otherwise both variables are updated to the same value, effectively 'zeroing' the selection.


The final step is to update the physical caret-position from the cursor-offset. This is an important concept because at no time during the keyboard navigation does the caret's physical on-screen position need to be taken into account. All keyboard navigation is based soley on a single logical character-offset and it is only after the cursor-offset has been updated (due to a keypress) is the caret repositioned:


// update text-caret location (xpos, line#) from the offset
UpdateCaretOffset(m_nCursorOffset, &m_nCaretPosX, &m_nCurrentLine);

UpdateCaretOffset is already being used to position the caret from a previous tutorial (using the UspOffsetToX and SetCaretPos APIs), so this function is simply reused for our keyboard handling.



All keyboard navigation in Uniscribe is based around logical offsets. In other words, the cursor advances through the backing store (the file) in WCHAR units. When it comes to navigating through bidirectional strings the caret still advances logically through the file. We rely on Uniscribe converting the logical cursor-offset to a physical location on screen (using the ScriptCPtoX function). This may mean that the cursor appears to move both 'left' and 'right' within the same string even if a single arrow-key is being used. Don't worry about displaying the caret in bidirectional strings - because we are using Uniscribe it handles all these details for us automatically.



Lastly note the use of the IsKeyPressed function, which is a simple wrapper around the GetKeyState API. It's purpose is to simplify the test for whether a key is pressed or not and returns a boolean value indicating this fact.


bool IsKeyPressed(UINT nVirtKey)
{
return GetKeyState(nVirtKey) < 0 ? true : false;
}

ScriptBreak


ScriptBreak works alongside ScriptItemize to identify the logical attributes of each character in a string. ScriptBreak must be called once for each individual item-run in the string (as identified by ScriptItemize) and returns an array of SCRIPT_LOGATTR structures. Each entry in the array represents a single WCHAR in the Unicode string, and must be allocated by the caller to have the same number of elements as there are WCHARs in the run.


HRESULT WINAPI ScriptBreak ( 
WCHAR * pwcChars,
int cChars,
SCRIPT_ANALYSIS * psa,
SCRIPT_LOGATTR * psla
);

The individual attributes for each character are held within the SCRIPT_LOGATTR structure, shown below:


struct SCRIPT_LOGATTR 
{
BYTE fSoftBreak : 1;
BYTE fWhiteSpace : 1;
BYTE fCharStop : 1;
BYTE fWordStop : 1;
BYTE fInvalid : 1;
BYTE fReserved : 3;
};

Although each field of the SCRIPT_LOGATTR structure has a specific purpose, this information as returned by ScriptBreak is generally useful for two purposes: word-wrapping and keyboard navigation:



Don't under-estimate just how much work ScriptBreak is doing on our behalf. The identification of character and word positions alone saves us a tremendous amount of effort. Added to this is the fact that ScriptBreak supports all of the various Unicode scripts, so for languages such as Thai (which require dictionary support to identify 'soft breaks'), all of the hard work is already done.


The task of calling ScriptBreak for each item-run is handled by the UspAnalyze function, which we looked at in previous tutorials. The SCRIPT_LOGATTR buffer is allocated and stored inside the USPDATA object's breakList* member. A simple loop is then used to iterate over each item-run, and the results of ScriptBreak stored inside the USPDATA::breakList array. The array holds the results for all item-runs, concatenated together:


<< UspLib.c - UspAnalyze(...) >>

// allocate memory for SCRIPT_LOGATTR structures
uspData->breakList = malloc(wlen * sizeof(SCRIPT_LOGATTR));

// Generate the word-break information for each item-run
for(i = 0; i < uspData->itemRunCount; i++)
{
ITEM_RUN *itemRun = &uspData->itemRunList[i];

ScriptBreak(
wstr + itemRun->charPos,
itemRun->len,
&itemRun->analysis,
uspData->breakList + itemRun->charPos
);
}

Any string (or paragraph) of text analyzed with UspAnalyze will therefore automatically have it's SCRIPT_LOGATTR information stored inside the USPDATA object. Because the information for each run has been concatenated into the same buffer in effect individual item-runs do not need to be taken into account when inspecting the logical-attributes for each character in the string.


Let's look at a quick example and see how the string "Hello يُساوِي World" would be treated by ScriptBreak. Note that there are two spaces in the string, one either side of the Arabic phrase:




SCRIPT_LOGATTR


H


E


L


L


O



ي ُ س ا و ِ ي  
W


O


R


L


D

SoftBreak  
0

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

WhiteSpace  
0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

0

CharStop  
1

1

1

1

1

1

1

0

1

1

1

0

1

1

1

1

1

1

1

WordStop  
0

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

Invalid  
0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0


 


Tabs and Whitespace


ScriptBreak does not identify tab-characters as whitespace by default. This poses a problem because every text-editor under the sun understands that tabs are basically the same as spaces and should therefore be treated the same. The solution is to parse the Unicode string looking for tab-characters and modify the corresponding entries in the breakList:


for(i = 0; i < wlen; i++)
{
if(wstr[i] == '\t')
uspData->breakList[i].fWhiteSpace = TRUE;
}

The loop above can be found inside UspAnalyze and is executed after the ScriptBreak information has been obtained.


UspGetLogAttr


I have introduced a new UspLib function called UspGetLogAttr, which is similar in concept to the Script_pLogAttr function. It returns a pointer to the breakList buffer inside each USPDATA object. The difference is however, that a non-const SCRIPT_LOGATTR array is returned which can be modified by the caller of the function.


SCRIPT_LOGATTR * UspGetLogAttr( USPDATA * uspData )
{
return uspData->breakList;
}

As you can see the UspGetLogAttr function is really very simple - all it does is return a pointer to the breakList buffer which is held inside each uspData object. Of course the real work in building this buffer was performed by the UspAnalyze function.


The reason this function exists is to allow the caller to modify the internal SCRIPT_LOGATTR structure inside each USPDATA object. This is important, because when it comes to syntax-highlighting I invisage that we will have to fine-tune the SCRIPT_LOGATTR buffer for each line to cater for more specific language-syntax details.


For now the UspGetLogAttr function is purely used by the keyboard navigation functions to control cursor-placement within each line of text.


Character and Word navigation


Character and Word navigation are quite closely related to each other. Both actions operate within a single line of text and both use the SCRIPT_LOGATTR information returned by ScriptBreak to position the caret at valid character/word positions:


VOID TextView::MoveCharPrev()
{
USPCACHE * uspCache;
CSCRIPT_LOGATTR * logAttr;
ULONG lineOffset;
int charPos;

// get Uniscribe data for current line

uspCache = GetUspCache(0, m_nCurrentLine, &lineOffset);
logAttr = UspGetLogAttr(uspCache->uspData);

// get character-offset relative to start of line
charPos = m_nCursorOffset - lineOffset;

// find the previous valid character-position
for( --charPos; charPos >= 0; charPos--)
{
if(logAttr[charPos].fCharStop)
break;
}

<< move up to end-of-last line if necessary >>

// update cursor position
m_nCursorOffset = lineOffset + charPos;
}

MoveCharPrev begins by obtaining the cached UspData object for the current line, and the SCRIPT_LOGATTR structure for that line is retrieved by calling UspGetLogAttr. The SCRIPT_LOGATTR array is then parsed to detect valid character-stop positions. Because UspData objects represent individual lines, all processing is relative to the start of each line:


// find the previous valid character-position
for( --charPos; charPos >= 0; charPos--)
{
if(logAttr[charPos].fCharStop)
break;
}

The loop above is quite simple. All it does is continue looping until a character-stop position is found. When the loop exits the charPos variable has been modified and the text-caret can be repositioned.


Word-navigation is a little more complicated. The MoveWordNext logic can be seen below:


// if already on a word-break, go to next char
if(logAttr[charPos].fWordStop)
charPos++;

// skip whole characters until we hit a word-break/more whitespace for( ; charPos < uspCache->length_CRLF; charPos++) { if(logAttr[charPos].fWordStop || logAttr[charPos].fWhiteSpace) break; }

// skip trailing whitespace while(charPos < uspCache->length_CRLF && logAttr[charPos].fWhiteSpace) charPos++;

Remember that all we are doing so far is altering the logical cursor-offset for the TextView. The actual text-caret is positioned independently and does not require any form of sophisticated processing on our part. This is the great thing about Uniscribe - as a programmer we only have to deal with logical character units - all of the complicated display-related code is handled automatically for us.

Line Wrapping

I was hoping that this stage would not be necessary due to the linear (offset-based) coordinate system that we are using for Neatpad. However because of the way Neatpad handles CR/LF sequences, specific checks must be in place to detect when the cursor moves past the beginning/end of a line. Should this occur the cursor is moved onto the previous/next line accordingly.

The difficulty occurs because the text-caret should not be allowed to move past the CR/LF at the end of each line. In effect the CR/LF sequences are 'dead' characters that cannot be used as character-stop positions. The image below illustrates this by showing the caret at the very last position it can reach, despite there being a line-feed character at the end.

The task of wrapping to the previous/next line is deferred to the MoveLineEnd and MoveLineStart functions. You will therefore see the following code in many of the MoveXxxxPrev functions:

if(charPos < 0)
{
    charPos  = 0;

if(m_nCurrentLine > 0)
{
MoveLineEnd(m_nCurrentLine-1);
return;
}
}

...and the following code in the corresponding MoveXxxxNext functions:


if(charPos == uspCache->length_CRLF)
{
if(m_nCurrentLine + 1 < m_nLineCount)
MoveLineStart(m_nCurrentLine+1);

return;
}

I was hoping to use the SCRIPT_LOGATTR arrays somehow to 'skip' CR/LF sequences rather than have specific code just for this purpose. I realise that I have probably not found the neatest way to deal with line-wrapping but I've been working at this for so long now I'm just going to release what I've got. I anyone can suggest a nice way to deal with line-wrapping that doesn't require any additional processing then please get in touch...


Line navigation and Anchoring


Conceptually line-based navigation is very simple - on the surface all that is required is for the cursor's line-number to be adjusted - in order for the cursor to move up/down a specified number of lines. Unfortunately the implementation is slightly more complicated than that because of the necessity to support variable-width fonts.


The problem occurs when the user moves the cursor (or text-caret) up or down a line. The user's expectation when he/she hits the up/down arrows is for the cursor to be shifted vertically to the previous/next line. For fixed-width fonts this is not an issue - the caret's y-position can be adjusted quite freely and this is all that is usually required. However variable-width fonts require that the caret's x-position be potentially modified to ensure that the caret always locates to a valid character position. In effect the caret must always 'snap' to the nearest character-stop boundary when moving up or down.



The image above illustrates this idea. As the cursor moves down through the file, you can see various numbered caret-positions being displaced horizontally around a fixed vertical line. This vertical line brings us onto the next concept to explore which is sometimes referred to as 'anchoring'.


Quite simply, anchoring is the process where the text-cursor is kept as close as possible to a specific horizontal coordinate within each line when moving up or down through a file. Imagine that the user places the text-caret at a character-position using the mouse. This location is represented by an x-coordinate within the current line and also the line-number itself. When the user moves up or down the file using the arrow keys, they expect the cursor to follow the vertical line that they chose as closely as possible. In Neatpad's TextView I refer to this process as anchoring.


The anchor position is represented by the m_nAnchorPosX variable. It is set whenever the user moves left/right along a line, or instead places the caret using the mouse. Importantly, the anchor position is not set when the up/down arrow keys are used as this would defeat the object of the exercise.


UspXToOffset(uspData, m_nAnchorPosX, &charPos, &trailing, 0);

Now when moving to the previous/next lines, the appropriate character-position can be identified by calling Uniscribe's ScriptXToCP, which converts the anchoring-coordinate to a logical character offset. This function has been encapsulated by the UspXToOffset function within UspLib and can be seen above.


VOID TextView::MoveLineUp(int numLines)
{
USPDATA * uspData;
ULONG lineOffset;
int charPos;
BOOL trailing;

// move 'up' the specified number of lines
m_nCurrentLine -= min(m_nCurrentLine, (unsigned)numLines);

// get Uniscribe data for that line
uspData = GetUspData(0, m_nCurrentLine, &lineOffset);

// move to character position nearest the anchoring x-coordinate
UspXToOffset(uspData, m_nAnchorPosX, &charPos, &trailing, 0);

m_nCursorOffset = lineOffset + charPos + trailing;
}

The MoveLineUp function (above) shows how the cursor is adjusted when moving up through the file. The key detail here is the call to UspXToOffset. This UspLib function takes the caret-anchoring position and finds the closest Unicode character-offset.


Coming up in Part 17


Keyboard Navigation has the potential to be incredibly complicated due to Unicode. Fortunately the Uniscribe API solves all the complexity with the ScriptBreak API. This is another huge benefit of moving to Uniscribe - the amount of code that we have been saved from writing is quite significant.


Next up will be syntax-highlighting. I have decided that regular expressions are definitely the best method in this regard, however there are many issues that must first be solved before any further work is completed. Some of the forthcoming topics will probably be regular expressions, parsing techniques and finite-state machines (FSM). If you have any comments in this regard then I'd be happy to hear them!


AttachmentSize
neatpad16.zip163.5 KB

Part 17 - Editing Text with Piece Chains

Introduction

OK so I lied about getting the syntax-highlighting implemented this time around. I got bogged down in "regular expression hell" and needed something else to concentrate on. So during the summer period I've instead been slowly implementing Neatpad's text-editing capability. As I hinted at in the very first installment of this tutorial series, I have decided to follow the "Piece Table" design. The aim of this tutorial will be to discuss the rationale behind this decision and also highlight some (but not all) of the implementation details. Note that I will not be discussing issues related to the editing of Unicode text (this will come in the next tutorial). Instead we will look at the design of the piece-table from a low-level perspective, independent of any Unicode concerns.

Several years ago when I started writing HexEdit there was very little information I could find regarding editor design. Google was unheard of and AltaVista was the search-engine of choice. I had a 33k dial-up modem and the single resource I stumbled over at the time was a paper written by a University professor named Charles Crowley. His paper 'Data Structures for Text Sequences' and the 'Piece Table' approach he described heavily influenced the design of HexEdit, making it one of the slickest hex-editors available. I later documented HexEdit's piece-table design (or span-table, as I called it then) in my 'Memory Management for Large File Editors' article. Today there seems to be a little more information available regarding Piece-Tables - and certainly more evidence of people using this design in their projects:

Piece-tables are not a new development by any means and have been around for several years in one form or another:

 

Obviously the piece-chain technique has been around for over 30 years now, with the first notable occurance with the Bravo editor. The surprising thing is that this technique is still quite rare even today.

The Perfect Text Editor

Of course, there is no such thing as the perfect text editor - otherwise we would all be using the same tool and I wouldn't be writing this article-series in a (likely futile) attempt to create my vision of a perfect editor.

Even if we ignore the overwhelming differences between editor's user-interface design, there is still a great diversity in editor implementation. What I mean by this, is that there is no single 'true' design that all text editors follow. Instead many distinct methods exist - such as the buffer-gap scheme, linked-lists of lines, and the somewhat rarer piece-chain techniques - all of which have been used with varying degrees of success over the years.

One of the reasons the piece-chain method is not very common is the complexity of it's implementation for text editors. It is not the actual piece-management that is difficult, but rather the maintanence of the editor's line-buffers that becomes troublesome. For this reason many editors choose not to implement piece-chains - and mature editors such as Vi and Emacs can be seen to be very successful without this design.

The fact that all of these techniques exist is a strong indication that there is no overwhelming 'best' design for text-editors. I don't know at this stage if I am making the right decision by implementing piece-tables for Neatpad. However I am determined to complete this project, and at the very least I will know one way or another, if the piece-table design is suitable for a plain-text editor.

Piece Chains with Linked Lists

Neatpad implements it's piece-chain data structure using a doubly-linked list, which closely follows the original design of HexEdit. Whilst other structures such as binary-trees could be used, a linked-list is preferred for it's simplicity. The original design for HexEdit maintained a Head and Tail pointer for representing the start and end of the piece-chain. Most people should be familiar with this concept; if you aren't then I would suggest reading up on this subject before going any further. Anyway, the classic doubly-linked list (which I used in HexEdit) is illustrated below:

The snippet below further illustrates how such a linked-list might be initialized:

// sequence constructor
sequence::sequence()
{
    head = NULL;
    tail = NULL;
}

The small problem with this design is the way the head and tail pointers are managed. Whenever a node is inserted at the front or back of the list, specific code is required to handle these 'special case' conditions because the head or tail pointers need updating to point to the new nodes. Link management is cumbersome and over-complicates sequence manipulations a great deal.

An alternative design is to maintain what are termed 'sentinel' nodes. In this model, two dummy nodes are introduced at the start and end of the list. Their contents are not defined (lengths will be zero), but their very presence means that every node in the list is guaranteed to have a valid neighbour. In other words, apart from the sentinels themselves, every node's next and previous link always point to a valid node. This essentially removes any 'special case' code for dealing with insertions or deletions at the start and end of the list.

Linked-list with sentinel nodes.

To further illustrate, an empty list contains just two nodes; the head and tail sentinels, which simply point to each other.

Empty list.

Supporting this linked-list design is very simple. Instead of initializing the head and tail pointers to NULL, two 'empty' nodes are created which link between each other. Any nodes that are subsequently added are inserted in-between the sentinels.

sequence::sequence()
{
    head = new span(0, 0, 0);
    tail = new span(0, 0, 0);

head->next = tail;
tail->prev = head;
}

It is surprising how much work this simple idea saves. Of course I can't claim credit for this method - I originally read about this trick in one of Michael Abrash's assembly-optimization books many years ago.


Spans and Pieces


The term piece-chain is nice and succinct, because one can straight away imagine the pieces of text being chained together within the data structure. However I use the term span in Neatpad to represent each piece of text in the chain, purely because this is how I did things in HexEdit. There is no difference between a piece and a span apart from the name - they both serve the same purpose.


// span - private to the sequence
struct span
{
span * next;
span * prev;

size_w offset;
size_w length;
int buffer;
};

The piece-table is formed by chaining together the span objects through their next and prev links. The linked-list is maintained by a sequence C++ class, which encapsulates the entire data structure inside a single object. The insert, replace and erase functions are included to provide the editing interface:


// define the type of strings the sequence will hold
typedef wchar_t seqchar;

class sequence
{
public:
bool insert (size_w index, seqchar *buffer, size_w length);
bool replace (size_w index, seqchar *buffer, size_w length);
bool erase (size_w index, size_w length);

bool undo ();
bool redo ();

// other members snipped

private:
span * head; // pointers to list sentinels
span * tail;
};

The central idea behind spans is that they provide a level of indirection to the underlying file contents. The individual nodes never store any text - they only refer to ranges of text stored in the original file, or to any range of text in the modify-buffer added due to text insertions. You can see from the definition above that a span has no means for storing text - instead the offset, length and buffer fields identify a range (or piece) of text in the original file.


The image below illustrates a typical piece-chain organisation. The 'original file' buffer is initialized with the text "The brown fox jumped over the lazy dog". The single word "ing" has been inserted into the sequence - with this text appended to the modify buffer. The spans in the piece-chain form the sequence "The jumping dog".



An important thing to note is that spans have no knowledge of their logical position in the sequence. They only know the physical location of the data they reference. This has the advantage that new spans can be inserted or deleted from the piece-chain with no effect on the other spans in the list. This is why inserting and deleting from a piece-table is so fast - assuming that you know where to insert.


The flexibility of the piece-table's is also it's biggest drawback. Because spans don't know their logical position there is no way to directly access a specific text-offset within the document. All accesses must go through the linked-list by starting at the head of the list. Any time we want to locate a specific character-offset we must iterate through each span in turn, summing their lengths in order to keep track of the current logical position:


span * sequence::spanfromindex (size_w index, size_w *spanindex = 0) 
{
span *sptr;
size_w curidx = 0;

// scan the list looking for the span which holds the specified index for(sptr = head->next; sptr->next; sptr = sptr->next) { if(index >= curidx && index < curidx + sptr->length) { if(spanindex) *spanindex = spanidx;

return sptr;
}

spanidx += sptr->length;
}

// insert at tail? if(sptr && index == curidx) { *spanindex = curidx; return sptr; }

return 0;
}

The spanfromindex function above shows how the piece-chain is traversed from the start in order to locate the span. The lengths of each span are summed together until the correct node is found. The simplicity of a linked-list design is also it's biggest drawback - random access to a linked list is slow.


Unlimited Undo & Redo


In theory unlimited undo+redo is incredibly simple with piece-tables. In fact it is almost 'free'. This is one reason why this technique is so desirable - the big selling point for me is that the piece-chain is (in computer-science terms) a Persistent Data Structure. In essence this means that the piece-chain preserves older versions of itself even when modified. Most importantly these older versions are easily restorable whilst still maintaining the integrity of the data-structure.


It is important to remember that the underlying data in the file we are editing never changes - it is only the linked-list nodes (the spans) that change, in order to represent modifications to the file. Therefore there is no need to maintain separate data-buffers containing the document contents which have been modified. The memory-savings the piece-chain brings are significant enough to warrant the added complexity this method.


The key to Neatpad's implementation of undo/redo is the use of span-ranges. Quite simply, a span-range is an object which represents a contiguous range of spans within the sequence. Any time the piece-chain is manipulated, a span-range object is pushed onto the undo stack which represents the range of spans affected by the edit.


struct span_range
{
span * first;
span * last;

bool boundary;
size_w sequence_length;
};

Each span-range is therefore used to represent a single modification to the sequence. The first and last fields point to the range of spans that encompass a particular edit operation. All spans in the range are linked together internally using their regular next and prev pointers. Now whilst a span-range can conceptually hold multiple spans all chained together, it can also reference a single span by pointing both first and last to the same span.


Notice how the sequence-length is stored inside each span-range. This is necessary because we don't want to re-calculate the sequence-length each time we undo or redo an action (doing so would be very slow for large piece-chains). By preserving the sequence-length prior to editing the sequence we can easily restore this value when we undo.


Span boundaries are a special form of span-range. Think of a span-boundary as the gap between two adjacent spans. A span-range is still used to represent this form of 'empty' range, with its boundary field set to 'true' to indicate what kind of range it is.



The image above illustrates both kinds of span-range. The range on the left is a 'regular' range, with boundary set to false. The range on the right is an 'empty' range, and the span-range object has it's boundary field set to true.


In actual fact span-ranges are used for more than just holding history - they are also used as auxiliary helper objects when manipulating spans during insert and erase operations. They are just a convenient way to store sections of the piece-chain as they are moved in and out of the sequence.


Inserting Data


Data insertion is by far the simplest to implement. There are two basic scenarios that we need to consider - inserting in the middle of a span, and inserting at a span boundary. As we look at these operations in more detail, remember that edits to the text-sequence are just modifications to the spans in the list. Note that in all of these examples, the text is shown to be contained within each span. This is for illustrative purposes only - because we know that in reality the spans do not hold any text, rather they refer to ranges of text in the original or modify buffers. In addition the head and tail sentinel nodes are also shown - as the grey blocks at either end of the list.


Inserting in the middle of a span is the first scenario. We start with the piece-chain in the following state:



The sequence holds the text "TheQuickBrown" and for whatever reason the linked-list holds three spans. Our first example will be to insert the string "xxxx" at index "6". This position in the sequence happens to fall in the middle of the second span.



Inserting in the middle of a span requires that the span be split into two separate pieces - which represent the data before and after the insertion point. A third span is linked in-between these two new pieces which will represent the actual inserted data. As expected the inserted data is appended to the 'modify buffer' which leaves the original file-buffer untouched.



Although we have conceptually split a span in half in order to represent an 'insertion', in reality we do not do this. If you study the piece-chain you can see that the span representing the word 'Quick' has been removed from the sequence, to be replaced with three new spans. However rather than deleting this span we instead preserve it, by holding it inside a span-range object. This span-range is then pushed onto our 'undo stack' in order to represent the range of spans that were modified by the insertion.


This strategy of span preservation is key in this implementation of a persistant data structure. Notice that after the insertion has taken place, no spans in the piece-chain reference the "Quick" span anymore. However "Quick" still maintains it's own links back into the main list. This is a very important detail, because when it comes to restoring the sequence (undoing this last action) we need to know where-abouts in the list the spans on the undo-stack should be re-inserted.


Shown below is the basic outline for how data is inserted for this first scenario, taken from the sequence::insert function:


// initialize a new 'undo'. It will be pushed onto the undo-stack.
span_range *oldspans = init_undo(index, length, action_insert);
span_range newspans;

// preserve the span that we are inserting into oldspans->append(sptr);

// new spans for before and after the insertion point newspans.append(new span(sptr->offset, insoffset, sptr->buffer)); newspans.append(new span(modbuf_offset, length, modifybuffer_id)); newspans.append(new span(sptr->offset+insoffset, sptr->length-insoffset, sptr->buffer));

// insert the new pieces into the sequence!! swap_spanrange(oldspans, &newspans); sequence_length += length;

All sequence-modifications in Neatpad follow the same basic pattern:

  1. Spans to be removed from the sequence are stored inside an "oldspans" span-range.
  2. Spans to be introduced into the sequence are stored inside a "newspans" span-range.
  3. The "oldspans" are pushed onto the undo-stack.
  4. The two span-ranges are then swapped around.

Inserting at a span boundary is the next scenario to consider. In this example the two-letter string "yy" will be inserted at sequence-position "6" again, which now falls between "Qui" and the "xxxx" span we inserted previously.

This time a single span is inserted into the list, in-between the "Qui" and "xxxx" nodes. The undo-event which represents this action is a span-range holding the spans either side of the insertion-boundary. This 'span boundary' is distinguised by the '*' symbol. Notice how these spans again maintain their links back into the linked-list.

Another important detail to notice is how the undo-stack has grown with the new span-range pushed onto it. Hopefully you can see how multi-level undo would be implemented: Each time we edit the sequence we push another span-range onto the undo-stack. Each entry in this stack represents the range of spans affected by that particular edit operation. Every time we perform an "undo" a span-range is "popped" from the undo-stack and re-inserted back into the main linked-list. The spans that it replaces are removed, but still preserve their own links back into the list. Each span-range that is removed during an undo is pushed onto a redo stack.

The code for span-boundary-insertion is much simpler because only one span is inserted this time:

// initialize a new 'undo' 
span_range *oldspans = init_undo(index, length, action_insert);
span_range  newspans;

// this is a 'boundary insertion' oldspans->spanboundary(sptr->prev, sptr);

// single span for the inserted data newspans.append(new span(modbuf_offset, length, modifybuffer_id));

// insert the new span into the sequence in place of the old ones swap_spanrange(oldspans, &newspans); sequence_length += length;

The code follows the same pattern as before: define the 'oldspans', collect together the 'newspans' and then swap them around. The only difference this time is how the 'oldspans' are defined. No spans were modified in the boundary-insertion case so the span-range represents a span-boundary by pointing to the spans either side of the boundary we inserted at.

Erasing Data

Erasing data from the sequence is rather more complicated. In fact, it is a lot more complicated. The problem is that deletions can potentially encompass several spans, and can start and stop mid-span as well. Here are the scenarios that must be catered for:

We must also consider the case where a deletion can encompass several spans. All four scenarios above must be taken into account in this case. There is also the case were a deletion is contained entirely within a single span. Again, boundary or mid-span conditions must be taken into account.

The mistake I made with HexEdit's piece-chain implementation was to handle all of these cases separately which resulted in hugely over-complicated code. What I should have done, and what I have done this time around, is have one 'general-case' that handles all scenarios:

The example above shows a deletion which encompasses several spans, and both starts and stops mid-way through a span as well.

Because the deletion affected every span in the list, the entire linked-list has been pushed onto the undo-stack in the form of a span-range, to be replaced with just two new spans that represent the data before and after the deleted range.

The code starts by checking to see if the deletion starts mid-way through a span. If it does, that span is added to the oldspans span-range, and a replacement span created which represents the data in the span just before the deletion-index. This 'split' span is added to the newspans range:

// does the deletion *start* mid-way through a span?
if(remoffset != 0)
{
    // split the span - keep the first "half"
    newspans.append(new span(sptr->offset, remoffset, sptr->buffer));

// have we split a single span into two?
// i.e. the deletion is completely within a single span

if(remoffset + removelen < sptr->length)
{
// make a second span for the second half of the split
newspans.append(new span(
sptr->offset + remoffset + removelen,
sptr->length - remoffset - removelen,
sptr->buffer)
);
}

removelen -= min(removelen, (sptr->length - remoffset));

// archive the span we are going to delete
oldspans.append(sptr);
sptr = sptr->next;
}

Once this first scenario has been handled, a loop is used to process any further spans until we reach the end of the deletion-range. Every node that falls under the 'delete range' is appended to the oldspans container object. A special-case is used to handle the scenario when a delete stops mid-way through a span.


// we are now on a proper span boundary, so remove
// any further spans that the erase-range encompasses

while(removelen > 0 && sptr != tail)
{
// will the entire span be removed?
if(removelen < sptr->length)
{
// split the span, keeping the last "half"
newspans.append(new span(
sptr->offset + removelen,
sptr->length - removelen,
sptr->buffer)
);
}

removelen -= min(removelen, sptr->length);

// archive the span we are replacing
oldspans.append(sptr);
sptr = sptr->next;
}

The very last thing to do is swap out the 'oldspans' with the 'newspans' and update the sequence-length:


 swap_spanrange(&oldspans, &newspans);
sequence_length -= length;

All of this code can be found in the sequence::erase member function, but is a lot more complicated than what I have shown here. In particular there are special-cases for 'optimized' deletes which almost doubles the amount of code required. I have no intention of detailing this process any further, just look at the sourcecode download of you are interested.


Replacing Data (overwriting)


Data-replacement is the most complex operation to implement, and should be thought of as a hybrid form of erase and insert, combined into a single function. All of the complexities of these first two operations must be taken into account when implementing sequence::replace. In addition the 'optimized replace' scenario - in which multiple, consecutive replaces are coalesced into a single operation - makes replace very complicated indeed. I am not going to go into any great detail - suffice to say, the sourcecode download contains everything you need to know.


When I originally wrote HexEdit's sequence class, yet another mistake I made was to write a separate routine for replacing data in the sequence. I took the already overcomplicated 'erase' function and duplicated the code into a new 'replace' function. I then merged in the code for inserting data which made it even more complicated. This made code-maintanence very difficult because of the code-duplication problems.


I didn't make the same mistake second time around. Neatpad's sequence::replace function is simply a wrapper around sequence::erase and sequence::insert. First of all erase is called to delete the range of data that will be replaced. This temporarily shortens the sequence. Next, insert is called with the 'replace' data which is inserted into the place where the delete occured, re-growing the sequence back to it's original length. Importantly, these two actions are grouped together so that they appear as a single undoable action.


bool sequence::replace(size_w index, seqchar *buffer, size_w length)
{
group();
erase(index, length);
insert(index, buffer, length);
ungroup();
}

Writing sequence::replace this way significantly simplified the sequence's implementation. It wasn't quite as straight-forward as I make out because of the problem of 'optimized replaces' - the sequence::erase function must have a small amount of logic to cater for 'regular erases' vs 'replace erases'. But in general I found this 'shortcut' far more preferable to what I had done before. Importantly, any time I have to change the erase or insert functions (because of a bug for example) I don't have to modify the replace function in parallel - it gets all changes for free because it's just a simple wrapper function.


Piece-chain demo application


Testing a piece-chain implementation is rather difficult because of all the subtle scenarios that can occur. I started off with a simple console application in which I would type "insert index text"' or "erase index length". It was rather cumbersome so I took the UspLib Demo application which I wrote for a previous tutorial, and came up with the "Piece Chain Demo":



I found this form of test-harness very useful as it gives instant feedback as to the state of the piece-chain after each operation. It also shows the contents of the Original File, Modify-Buffer, and also the Undo stacks. It's not so useful now that I have integrated the piece-chain into Neatpad, but at the very least it might be interesting to anyone who is attempting to learn more about piece-chains. Give it a whirl!


Linked Lists vs Binary Trees


There is some debate as to how a piece-table should be represented. The two main choices are either a linked-list, or some form of binary tree. For the time-being Neatpad uses a doubly-linked list to manage it's piece-table. This method has been proven to be effective in my HexEdit application, but it remains to be seen whether or not a linked-list is suitable for a text editor:


Linked lists are very simple to manipulate and traverse. However locating a specific character-position can be potentially very slow if there are a large number of nodes in the list, because the list must be traversed from it's start any time a particular node is required. Caching can solve most of the linked-list's speed problems, because the majority of operations on the sequence will be very localized (you can only type within a fixed viewport of the document at any one time). Caching the last-accessed span can yield large improvements here.


Binary trees on the other hand are the exact opposite. A balanced binary tree is great for searching but bad for inserting/modifying. Searching is always O(log N), so locating a specific character-position is always fairly fast, even for a sequence that holds many thousands of nodes. However manipulating a binary-tree is always fairly expensive compared to the linked-list. Any time a node is inserted into a binary-tree, the tree must be rebalanced, and all children of that node must be updated to reflect the change.


In both cases the complexity of the data-structure we choose (the number of spans it contains) is a function of the number of edits to the file and is not at all related to the size of the file or the number of lines it contains. This makes the piece-chain a great data structure because it scales very well, even for very large files. It is far preferable to a "linked list of lines" model, which does not scale well for large files. The piece-table is definitely the data structure of choice in my opinion.


If we were concerned purely with edits to the piece-table, I believe the linked-list is the superior choice. Should we need to use the data-structure as a way to perform other kind of searches (such as looking up line-numbers, or mapping line-numbers to character-offsets) then this might require a move to a binary-tree implementation.


C++ templates and VS2005


For the moment I have resisted the temptation to implement the sequence class using C++ templates. This is primarily due to the poor template support in VC6.0, but I also wanted to make the initial version 'clean' enough for non-C++ programmers to understand. Currently the sequence class uses a hard-coded 'seqchar' datatype as it's basic unit of text, which is currently a 16bit WCHAR type. It is simple to re-target the sequence to any other character type by simply changing the typedef statement at the top of sequence.h, so for the time-being templates would be a luxury rather than a necessity.


However I have recently started using VS2005 and now have the opportunity to take advantage of a much better C++ compiler. If I do migrate to VS2005 it will make it harder to compile Neatpad under VC6.0 because the projects will be incompatible. And if I change the sequence-class to a template implementation, Neatpad won't compile under VC6.0 at all. What do people think about this proposal? If I get no comments I'll go ahead and migrate to VS2005...


Conclusion


I have deliberately tried to keep the piece-chain implementation as simple as possible. I see no need at this stage to migrate to a binary-tree implementation for the piece-chain, although future tutorials may have to address this issue if performance becomes a problem. The important thing is, it doesn't really matter if we use a list or a tree - everything is encapsulated inside the sequence C++ class so changing to a binary-tree would have no real impact on the rest of the design.


Piece-tables are actually a very simple concept and are not as complicated as they may seem from this article. The complexities arise from the way the piece-chain is optimized each time an edit occurs. Consecutive inserts, erases and replaces are coalesced in such a way that the number of spans in the list is always at a minimum. This requires that individual span's be adjusted 'in place' rather than introducing new spans into the table each time an edit occurs. This also complicates the undo/redo support as well. Without this optimization piece-chains are very simple, but unfortunatey a 'real world' piece-chain must have these features if it is to be successful.


The next tutorial will focus on how I integrated the piece-chain design into Neatpad. There are a lot of issues to discuss, mainly due to the subtleties of editing Unicode text. I also need to implement proper 'memory management' and large-file support - this will be covered in a later tutorial as well.


AttachmentSize
piecechain.zip58.32 KB

Part 18 - Unicode Text Editing

Introduction

Unicode character input presents some unique problems for text-editors - issues that did not have to be considered when the first ASCII editors were written. The main difficulties are the Unicode 'combining sequences' - where multiple code-points are combined to form a single selectable 'character cluster'. Modifications to a Unicode text-file require careful coding to ensure that character cluster-boundaries are preserved and that no invalid sequences are inadvertantly introduced into the document. The Uniscribe API will again be used to aid us in this area.

Character input (of any kind) is not possible without some form of data-structure to manage and represent any alterations to the document. The last tutorial saw the implementation of a piece-table data structure which implements three basic edit operations: insert, erase, and replace. Unlimited undo and redo are also supported. The sequence class was presented which encapsulates the piece-table and these basic editing operations within a single C++ object. The purpose of this tutorial is therefore to document the modifications required by Neatpad to support the piece-table editing model.

Unicode character input

We already looked at keyboard navigation in Part 16 - Keyboard Navigation, in which we discussed caret movement within a Unicode document, and we briefly looked at the various Win32 character-input messages that a program can encounter when receiving keyboard input:

  Characters Dead Characters
UTF-16 Character
WM_CHAR
WM_DEADCHAR
UTF-32 Character
WM_UNICHAR
 
Input Method Editor
WM_IME_CHAR
 
System Character
WM_SYSCHAR
WM_DEADSYSCHAR

Even though the WM_CHAR message has been around since the first versions of Windows, it is still the most appropriate way for a Win32 application to receive character input. For any UNICODE application, the WM_CHAR message sends a single UTF-16 character value instead of a plain ANSI character. This is perfect for us, because Neatpad is already an UTF-16 (wide-character) application. Even complex scripts will be handled seemlessly because keyboard input for these languages is usually associated with an Input Method Editor (IME) - which will translate any 'complex' key-strokes into the appropriate stream of UTF-16 characters, without any extra work on our part. The Windows Input Method Editor will be the subject of a future tutorial.

The other messages look interesting but are not really necessary. Supposedly the WM_UNICHAR message sends UTF-32 characters rather than the 16bit WCHARs - however I have never seen WM_UNICHAR being sent to a program, even on a XP machine. I suspect that this is a message that is sent by other applications (such as IME's) rather than the OS itself. Likewise the WM_IME_CHAR message is only sent under special circumstances. Although we will ignore these additional input-messages, we will not be losing any functionality by simply handling WM_CHAR at this point.

The code below shows the standard method of handling character-input in a Win32 program:

LONG TextView::OnChar(UINT nChar, UINT nFlags)
{
    // do something with 'nChar'
    return 0;
}
case WM_CHAR:
    return OnChar(wParam, lParam);

We don't need to do anything special to receive Unicode input. As long as we compile with the UNICODE macro defined we will receive UTF-16 characters. Unicode values outside of the BMP (i.e. > 0xFFFF in value) will be sent as two separate messages, one for each surrogate character. However it is very unlikely that a user will manually enter two surrogate values separately - more than likely they will be using an Input Method Editor, and it will be the IME that breaks their keyboard input into UTF-16 units.

Regardless of how the text is entered, all we need to do is receive each WM_CHAR as it is sent to the TextView and process it accordingly. Usually that's where the tricky part comes in, but fortunately we now have the piece-chain sequence class which will enable us to make edits to the underlying document each time a character is received.

Integrating the Piece-table

My first attempt at integrating the piece-table into Neatpad went quite smoothly. I'm not completely certain it's the best way to do things in an editor but at least it shows that the idea can work in practice. The image below illustrates the various components of the text editor as it stands at the moment.

The basic idea has been to incorporate the piece-table in a manner that caused minimal impact on the rest of the design. To achieve this aim I layered the piece-table directly on top of the raw file content. The piece-table (or sequence class) therefore presents the underlying file in it's raw form - as a sequence of BYTEs rather than WCHAR units. This makes sense for the moment, because the TextDocument can still accesses the file in it's raw form prior to converting it to UTF-16. The only difference now is that the TextDocument can affect changes to the raw file through the piece-table as well as read it's content.

Not everything is as smooth as it seems however. The big problem with this design is the reliance on the line-buffer, which must be reinitialized every time a change is made to the piece-table. This is because the line-buffer indexes directly into the piece-table, and any change to the piece-table (such as an insert or delete) effects the layout of the file. Therefore any change to the piece-table will require the line-buffer to be modified appropriately. This is not really noticable on very small files, but any file larger than a few kb suffers serious performance problems. You will notice this problem with the text-editor download this time - any time a single character is inserted there will be a noticable delay as the line-buffer is reinitialized. Of course this is unacceptable and this issue will be addressed fully in the next tutorial.

Deleting text

Even though text-input is the primary function of a text-editor, text-deletion is an important aspect too. So before we look at how text is entered I will describe some of the issues involving text deletion.

There are three basic forms of 'delete' that a Unicode text-editor must consider. The first is the deletion of a static range of text - usually defined as the user's current selection. The key detail here is that the starting and ending positions of the deletion range are well defined, and we rely on the existing keyboard and mouse navigation routines to always place the cursor at proper cluster boundaries. The other two delete methods are Forwards delete and Backwards delete. These last two operations typically map to the standard <delete> and <backspace> keys. We will discuss these last two forms of deletion a little further down.

Text-deletion in Neatpad is made slightly more complicated because of the way Neatpad currently manages the underlying data in the document. Neatpad is basically a multi-format text-editor which supports ASCII, UTF-8 and UTF-16 formats. So even though all text coordinates are in UTF-16 units (in the TextView), the TextDocument must map these coordinates to whatever format the document is in. (This design was discussed in Part 9 - Unicode Text Processing). Deleting a range of text in the document requires Neatpad to convert the character-offsets into the appropriate physical offsets in the underlying file.

The TextDocument performs this conversion by accessing the Line-Buffer component of the editor. Although lookups through the line-buffer occur in O(log N) time, this is a relatively costly operation for larger files. This presents a problem, because our delete operation takes as input two values - the start and end of the range of text to delete - and requires two lookups through the line-buffer. It's not ideal but this is just the way the editor has evolved up to this point. Remember that the whole reason these conversions to/from UTF-16 are necessary is because I want Neatpad to support multiple file-formats. UTF-8 is the real killer here because it doesn't correspond at all to UTF-16, making the conversions an absolute necessity. It is quite likely that I will move away from this 'dual coordinate' design!

In an ideal world we could load a UTF-8 file entirely into memory, converting it to UTF-16 when it is first opened so that there would be no conversion issues at runtime. This is how most editors work. It is only because of my intention to support 'large files' that we have these issues. Obviously a multi-gigabyte text-file will not fit into memory all at once and will require us to page the file into memory in smaller units, making the runtime conversions to/from UTF-16 a requirement. But I really don't like the 'dual coordinate' system that I have at the moment (it's very messy to deal with), and because of this I may decide to limit 'large-file' support (on-disk editing) to just plain ASCII and UTF-16. As always I'd appreciate any feedback people may have to offer.

Backspace vs Delete

This is the title of one of Michael Kaplan's Sorting it all out blog entries. You should take the time to read Michael's blog as there is a wealth of high-quality information regarding Internationalization. Many of the topics on his site have been tremendously helpful when it came to understanding Unicode.

Although delete and backspace are both methods of text-deletion, their operations are subtly different in practice. In fact a better way to describe their actions are Forwards delete and Backwards delete, as these are the directions (in logical text-units) that the cursor moves in each case. The differences can be highlighted by examining a string of Unicode text containing combining sequences. As an example we will consider the word "Déja":

As you can see the text is made up of five code-points - with the combining accute accent (U+0301) shown as a separate character. In practice of course, the accute-accent is positioned above the letter 'e' and together they are rendered as a single grapheme cluster:

Fowards-delete is the first case to consider, and a basic text-editor would implement this operation by removing a single Unicode character each time the <delete> key was pressed. However if we look at the 'Déja' example we will see it is a little more complicated than that:

Imagine that the cursor is positioned at the start of the letter 'e'. If we hit <delete> then this single letter would be erased as expected. However if Neatpad just deletes a single UTF-16 character each time delete is pressed, the combining accute-accent character would not be removed. In fact it would become attached to the letter preceeding the 'e' (the letter 'D'), resulting in the illegal Unicode sequence shown above.

Of course the logical thing to do in these scenarios is to delete the base character and any combining characters that might follow. The same behaviour would also be used for UTF-16 surrogate pairs so that both surrogate-characters are deleted. The Uniscribe 'logical attribute list' is used to identify these character boundary conditions.

Back-delete is slightly different in practice. Imagine this time that the caret has been placed in-between the accute accent and the letter 'j'. Performing a back-delete using the same logic as before would remove the combining character as well as the base character. Doing so is not incorrect, but a nicer alternative is to remove just the combining character, leaving the 'e' untouched:

Deleting single Unicode characters in response to a <backspace> is preferable because only the combining-character is removed from the end of the sequence, leaving the base character intact. Deleting the entire sequence is not necessary and would only cause the user frustration if they had only mistyped this one character. Unfortunately blindly peforming a single-character delete is not sufficient because we must still consider UTF-16 surrogate-pairs - which must be treated (and deleted) as an atomic unit of text. The blog post I referred to earlier discusses this scenario and highlights a bug in several Windows applications which don't handle surrogate-pairs properly in this case.

Supporting Unicode delete operations requires us to inspect the text-stream we are modifying and handle any combining sequences (including surrogrates) appropriately. Fortunately because we are using Uniscribe we can again refer to the logical-attribute list which is calculated for each line - remember that this list identifies cluster boundaries and enables us to easily detect combining sequences.

Line-termination is another scenario that must be delt with in addition to the surrogate / combining-sequence issue. It's not really a problem in itself, other than special care must be taken when carriage-return/line-feed combinations are encountered. Obviously CR/LF pairs must be treated as a single unit - but really they are handled in the exact same way as surrogate-pairs.

Because Neatpad's piece-table is not a line-oriented data structure there are no further issues to deal with. In contrast, an editor that utilized a 'linked list of lines' model would require special logic to manipulate the list-nodes whenever lines (CR/LFs) are added/removed from the file. Neatpad doesn't have this problem as it's underlying data structure is a simple flat stream of characters.

Inserting text

Neatpad receives character input via the TextView::OnChar function. This function is defined in TextViewKeyInput.cpp and is shown below:

LONG TextView::OnChar(UINT nChar, UINT nFlags)
{
    WCHAR ch = (WCHAR)nChar;

// translate carriage-returns to CR/LF combinations if(nChar == '\r')
PostMessage(m_hWnd, WM_CHAR, '\n', 1);

// input this single character at current cursor position
EnterText(&ch, 1);

// 'break' piece-table optimizations whenever we input a new line
if(nChar == '\n')
m_pTextDoc->m_seq.breakopt();

NotifyParent(TVN_CHANGED);
return 0;
}

TextView::OnChar simply passes each wide-character it receives to the more general-purpose EnterText function, which is also used by the clipboard-related code:


BOOL TextView::EnterText(TCHAR *szText, ULONG nLength)
{
ULONG selstart = min(m_nSelectionStart, m_nSelectionEnd);
ULONG selend = max(m_nSelectionStart, m_nSelectionEnd);

BOOL fReplaceSelection = (selstart == selend) ? FALSE : TRUE;

switch(m_nEditMode)
{
case MODE_READONLY:
return 0;

case MODE_INSERT:

// remove selection if necessary
if(fReplaceSelection)
{
m_pTextDoc->m_seq.group();
m_pTextDoc->erase_text(selstart, selend-selstart);
m_nCursorOffset = selstart;
}

// enter the text!
m_pTextDoc->insert_text(m_nCursorOffset, szText, nLength);
break;

case MODE_OVERWRITE:
// overwrite happens here
break;
}

m_nCursorOffset += nLength;
return TRUE;
}

It is the responsibility of EnterText to take the appropriate action depending on the state of the current edit-mode - which can be either MODE_READONLY, MODE_INSERT or MODE_OVERWRITE. Only 'text insert' is shown above - the overwrite is described later on but is much the same in terms of implementation.


ULONG TextDocument::insert_text(ULONG offset_chars, WCHAR *text, ULONG length)
{
ULONG offset_bytes;

offset_bytes = charoffset_to_byteoffset(offset_chars);

return insert_raw(offset_bytes, text, length);
}

The key detail to highlight at this stage is what the TextDocument::insert_text function does with the WCHAR buffer that is passed to it. The first thing it does is to convert the supplied character-offset into a raw-byte offset. Once insert_text knows where to physically insert the text into the piece-table, the insert_raw function is called - which then converts the UTF-16 string into the appropriate text-format (or leaves it unchanged if the file is already UTF-16). With the text now in the same format as the underlying file's, it can finally be inserted into the piece-table.


Overwriting text


Whilst all text-editors support an 'insert' mode, a large proportion also provide an 'overwrite' mode as well. Neatpad's piece-table implementation already provides a 'replace' function - which combines erase+insert into a single atomic operation. The prototype is shown below:


size_w sequence::replace(size_w index, seqchar *buffer, size_w length)

The piece-table's replace function works by first erasing the specified range of text (between index and index+length), and then inserting the data in buffer back into the same location. The result is a 'strict' overwrite of binary data, with the file's length unchanged at the end of the operation. This behaviour is a hang-over from HexEdit's original design.


Overwriting text in Unicode is more complicated than simple byte (or character) replacement, as I found out when I finally started editing files with Neatpad. Text-overwrite must follow the exact same rules that we already discussed with 'forward deleting'. That is, we must respect cluster boundaries and combining sequences when we are replacing text. Consider the following example, in which the letter 'o' is entered whilst in 'overwrite' mode.



Ultimately this results in a scenario where we may need to delete several UTF-16 characters for every single character we type in at the keyboard. In other words, a 'replace' operation in a Unicode text editor could very well change the length of the file that it is editing. This is in stark contrast to a binary editor which must never change the length of any file it is editing. Unfortunately that's just the way text-editors work and we must be prepared to deal with these eventualities.


Replacing past the end of line


Replacing (overwriting) past the end-of-line is another scenario that must be catered for. Once again it wasn't until I was actually using the editor that I realised this was an issue at all. Consider the following two lines of text as a simple example, with the caret positioned at the end of the first line:



Although the editor has split the text across two lines, remember that Neatpad's piece-table presents the file contents as a simple stream of text:



Using the existing forward-delete logic in 'overwrite mode', the CR+LF combination would be deleted when the next character is input. Erasing the line-terminater in this way would have the effect of joining the two lines together:



Obviously this is not the behaviour we expect for a text editor. Neatpad must therefore give CR/LF characters special treatment whilst in overwrite mode. In fact, any entered text that immediately preceedes a CR/LF must actually be inserted into the document (extending the line length), regardless of whether the editor is in overwrite mode or not.


The CR/LF scenario highlights the same deficiency in the sequence::replace function that we first saw when overwriting combining-sequences. The underlying problem is that the number of bytes (or characters) we want to delete does not have to equal the number of characters we are replacing them with. We therefore need to be able to exactly control the number of bytes to delete for any given replace operation:


sequence::replace( size_w           index, 
const seqchar * buffer,
size_w length,
size_w erase_length
);

Adding the erase_length parameter now gives us the flexibility to handle any form of replace operation. When replacing over combining sequences this parameter would be set to the number of bytes that the combining sequence occupies in the file. For the CR/LF scenario above, this parameter would be set to zero.


Optimized insert and replace


You may be wondering why we couldn't just have called sequence::insert rather than sequence::replace for the example above. The reason is simple: Neatpad's piece-table is designed to optimize each insert/replace operation, by coalescing consecutive edits (of the same type) into a single operation - in order to keep the number of spans in the piece-table to a minimum at all times. If we started off calling sequence::replace when entering text, but switched to calling sequence::insert when dealing with CR/LFs, this would break the optimized operation into a 'replace' and an 'insert' operation. This is not an intuitive thing for the user to understand - they would expect their overwritten text to be treated as a single operation. Rather than complicate matters I simply decided to extend the capability of the sequence::replace function.


Whilst we are on the subject of CR/LF combinations and sequence-optimizations, it is probably worth mentioning what happens when the user actually enters a CR/LF into the document (rather than deleting it). Neatpad's piece-table would naturally take the CR/LF characters and coalesce the edit-operation where possible. If the user entered several lines of text consecutively in one go (without undoing/deleting/redoing anything), all lines of text would be optimized into one operation. The consequences of this optimization would become noticable when the user subsequently decided to undo the text-insertions - when they would find that all the lines of text they entered would be undone in one go.


It turns out that this is not actually very intuitive. Instead what is preferable is for the sequence-optimizations to be 'broken' whenever a new line is started - in other words the sequence should only optimize whole lines of text and nothing more. To support this model a new function sequence::breakopt was introduced:


void sequence::breakopt()

This function is called whenever a single carriage-return is entered into the document. The result is more natural because editors are line-oriented by nature. Breaking the sequence-optimization in this way allows each line to be undone/redone individually. This does result in more spans being introduced into the piece-table - but there will always be a trade off between usability and performance. In my opinion it is better to improve the user's overall experience rather than forcing optimization behaviours on them that they don't understand, and don't even care about.


Undo and Redo


The ability to perform undo and redo has always been a fundamental aspect for many text editors. Some do it better than others - with the best editors usually having undo support designed in from the very start. The piece-table implementation developed for Neatpad provides unlimited undo and redo - that is, any successful modification to the text-sequence is guaranteed to be undoable. Whilst this is a very nice feature in itself, an interesting issue is raised when the piece-table is integrated into Neatpad.


The issue is that of selection-highlighting and caret placement. You may or may not have noticed, but when you perform an undo (or redo) in your favourite editor, the text-caret is automatically repositioned to the offset where the last action occured. This means that for every action taken on the text-sequence, the piece-table must keep track of the logical offsets that were specified in those operations. When the sequence is restored due to an undo, the sequence must be able to provide details of what range of data in the sequenece was modified as a result of the undo/redo.


As a quick example, load up your favourite editor and open a file. Select a range of text and delete it, then reposition the cursor somewhere else in the file. Now hit 'undo' in the editor. The range of text that you just deleted will be inserted back into the file and will reselected as it was just before the edit took place. To achieve this goal, the sequence class stores the offset and length of each operation in the corresponding span_range object. The sequence class provides two methods which provide access to these internal details:


size_w  sequence::event_index();
size_w sequence::event_length();

The TextView will call these two methods any time an undo or redo occurs in the editor. The event_index() function provides the logical offset of the last operation, whilst event_length() provides the length of data involved. Any time an undo/redo occurs, the internal state of the sequence will change, and these two functions will return the appropriate values that represent these actions. Whilst the only reason this information is needed is for the caret can be repositioned during undo/redos, it is still an important aspect of the user interface.


Caret Placement


It has been difficult to test Neatpad's keyboard navigation behaviour because up until this point it was not possible to enter any text into the editor. Well it turns out there was a bug in the way the caret was being positioned in bidirectional text. I was never quite sure I had the logic right, and using Neatpad to edit mixed English / Arabic text confirmed this.


The Uniscribe topic Displaying the Caret in Bidirectional Strings (in MSDN) describes this problem by saying that "in bidirectional text, the caret position between runs of opposing direction is ambiguous." What this means is that for a string that contains both English and Arabic text, for certain character offsets in the string the caret can appear in more than one visible position.


Let's look at the familar "HelloيُساوِيWorld" example again, with the caret positioned at offset 5 in the string:



You will see from the example above that there are two text-caret positions. This ambiguity arises because the last letter of 'Hello' immediately precedes the first letter of 'يُساوِي'. So because the Arabic string is printed right-to-left there are two visible positions for the caret, depending on whether it is considered to follow the 'o' of 'hello' or to precede the first letter of 'يُساوِي'.


In Unicode text one can therefore think of the caret being displayed at the leading or trailing edge of a character. Another way to think of this is that the caret position is dependent on whether the caret is advancing forwards or backwards through the text. Any text editor supporting bidirectional text must therefore make this distinction as it is positioning the caret. It is not enough to simply convert the cursor offset to a visible x-coordinate. The direction the cursor was moving must also be taken into account:


if( fAdvancing )
ScriptCPtoX(nCursorOffset - 1, TRUE, ..., &iCaretX);
else
ScriptCPtoX(nCursorOffset, FALSE, ..., &iCaretX);

The logic is actually quite simple: if the caret is advancing forwards, it is positioned at the leading edge of the preceding character, whereas if the caret is moving backwards it is positoned on the trailing edge of the current character. This new logic is represented in Neatpad with a change to the UpdateCaretOffset function, which is used internally by the TextView:


VOID TextView::UpdateCaretOffset( ULONG    nCursorOffset, 
BOOL fAdvancing,
int * outx,
ULONG * outlineno )

The fAdvancing parameter is now required in addition to the character-offset. UpdateCaretOffset applies the logic shown above to position the caret correctly for bidirectional strings. It wasn't very difficult to adjust the keyboard navigation code either - the only change necessary was to set the fAdvancing variable according to the direction the cursor is moving. Keys such as VK_LEFT and VK_BACKSPACE are considered to move backward through the text, so fAdvancing is set to FALSE. For all other keys fAdvancing is set to TRUE. For example:


case VK_LEFT: 
fAdvancing = FALSE;
nCursorOffset --;
break;

case VK_RIGHT:
fAdvancing = TRUE;
nCursorOffset ++;
break;

Vista is here


Like it or not Vista's arrival is imminent, and whilst I haven't quite made up my mind whether I'm a fan myself, Neatpad does require a couple of modifications in order to make it work correctly under this new operating system. The issue revolves around Administrator privileges and Vista's new User Account Control prompt. The problem occurs when Neatpad's Options dialog is invoked - because a couple of the settings ("Add Neatpad to Explorer context menu" and "Replace Notepad as default editor") require write-access to the HKEY_LOCAL_MACHINE branch in the registry. Obviously this requires Administrator privileges, and Vista's solution is to require applications to conform to the new User Account Control guidelines.



The first noticable change is the new Vista Shield Icon - which indicates to the user that elevation to Administrator via the UAC prompt will be necessary. There are three mechanisms available to application programmers which invoke the UAC prompt:



  1. Embed a UAC manifest (an XML file) into an executable's resources that indicates which elevation level is required. The choices for the requestedPrivileges section are 'asInvoker', 'requireAdministrator' and 'highestAvailable'. Whilst we could employ this method to 'requireAdministrator', the problem is that the UAC prompt would be displayed every time Neatpad was run. This would be very frustrating to the user, especially as the only time Administrator access is required is when the two 'system settings' need to be changed. This is a sledge-hammer approach that is only intended to support older applications and is not suitable for a simple utility such as Neatpad.

  2. The second option is to use the new CoCreateInstanceAsAdmin function call. This new COM API allows a program to instantiate a COM object at Administrator level (assuming that the current user is a member of the Administrator's group of course!). The COM object lives inside a separate process and allows the calling application to make controlled calls into the object. Whilst this is a technically neat way of doing things there are two problems. Firstly, we require an external DLL to be shipped, and secondly we have to be Administrator in order to install the COM DLL in the registry in the first place. This method is best suited for larger applications that use a proper Windows installer.

  3. The final option is to request a separate process be spawned at Administrator level. The ShellExecuteEx function can be called in this case, using the "runas" verb. When running under Vista, this will cause the UAC prompt to be displayed prior to launching the new executable. The child process will run with Administrator privileges, assuming that the user authorised the elevation.


Neatpad uses option#3 but instead of launching a separate program, it simply attempts to re-spawn itself. A special command-line option ("-uac") is used to instruct the new instance not to display it's GUI. In this special mode, Neatpad sets the appropriate registry-keys under HKEY_LOCAL_MACHINE and exits. The registry access succeeds because the respawned Neatpad will be running as Administrator.


Conclusion


Unicode text-editing is a complicated subject, mostly due to the occurance of combining sequences. Fortunately the Uniscribe API vastly simplifies the amount of work we might otherwise have to do - I certainly wouldn't look forward to writing an editor without this kind of language support. The topic this time was also helped a great deal by having a solid piece-table implementation which plugged in very neatly to the editor design. I strongly advise anyone writing an editor to invest the time in developing a functional back-end - which should include all of the necessary edit operations as well as undo/redo support.


There is still a long way to go before Neatpad's editing is fully complete. The line-buffer implementation needs an urgent overhall and this will be the topic of the next tutorial. Following on from this will be memory-management for large-file editing, which should hopefully be a very simple topic because the piece-table lends itself very well to on-disk editing.



Neatpad running under Windows Vista, using the Aero Glass theme.


I wrote last time that I was thinking of migrating to Visual Studio 2005 and C++ templates. Thankyou first of all to everyone who provided feedback on this topic - however at this point I decided to stick with VC6 for the time being. Firstly, C++ templates are unnecessary because the sequence class is only being used to store raw bytes. And secondly, VC2005 executables introduce a dependency on a new C-runtime DLL (MSVCRT8.DLL) which is not present on Windows systems by default. So rather than complicate matters the Neatpad project will remain VC6 compatible - but it will still build cleanly under VC2005 as well.


In the meantime I've come across a few interesting editors which are worth a mention. The first is JujuEdit from Jujusoft. It's freeware but closed-source. The editor has 'very large file support' - files up to 2Gb - and is actually very impressive in it's large-file handling. Next is Intype which looks to be an interesting editor still in development, and also e - the collaborative text editor for Windows, which has an interesting approach to undo/redo. Thanks to Franck Marcia for providing these last two links.


Thanks also to everyone who pointed me at Colorer - a very impressive open-source library for regular-expression-based syntax colouring. When the time comes to implement syntax-colouring in Neatpad this library will hopefully make things much easier.


AttachmentSize
neatpad18.zip216.98 KB