Wildcards matching

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Wildcards matching

Salvo Isaja
Hi there,
the new FAT driver version (currently read-only) seems to work properly
and I think I'll be able to commit it in a few days. I'm now facing
with file name matching using wildcards, as there are different
solutions that can be chosen.

DOS (8.3 names) takes the ugly way: '*' is converted to as many '?' as
to fill the name or the extension.
For example ab*cd.e* becomes ab??????.e?? and comparison is at character
level. I have no idea of how this works with double byte character
sets, if a '?' is in the place of only the leading or trailing byte
(maybe Hanzac can help).

Linux uses Posix wildcards: '*' and '?' work as you would expect, the
former for any string, possibly empty, the latter for exactly one
character. There are also other wildcards (namely, [...], see "man 7
glob"), that we could support, but I'm not sure if they are needed.

Windows (hence, as far as we are concerned, DOS with long file names),
seems to use '*' and '?' with their usual meaning, but... Both "*" and
"*.*" match all files. "*.*" matches even files without an extension,
hence without a dot, consistently with DOS, but unconsistently with the
standard meaning... This is what I called the "implicit dot" in the old
FAT driver wildcard matching.
Very appropriate, "*." matches all files without an extension, even if
they have not a dot, but also "*.??" does, even if the file have no
extension at all! In practice, any wildcards after the trailing dot are
ignored and the wildcards is silently considered as ending with "*".
Moreover, IIRC, searching for "test.." matches the file "test" which
has no extension... several implicit dots!

That said, what wildcard matching convention should we use in FD32,
namely in the new version of the FAT driver?

Bye,
   Salvo
--
 

 

 --

 Email.it, the professional e-mail, gratis per te: http://www.email.it/f

 

 Sponsor:

 Audio, Video, HI-FI...oltre 2.000 prodotti di alta qualit? a prezzi da sogno solo su Visualdream.it

 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=2955&d=27-5


-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
freedos-32-dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/freedos-32-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Wildcards matching

Martin "E.T." Misuth
If I can suggest, I would propose following pattern:

*     - any file, with or without extension
*.    - only files without extension
*.*  - only files with extension
test... - file with exact name name "test..."
test.* - file with start test and with any extension
test.*.bmp -every file with name begining with "test." and with
extension ".bmp"

- as extension delimiter only last dot is considered, if present.

What do you think about it?



-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
freedos-32-dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/freedos-32-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Wildcards matching

Hanzac Chen
In reply to this post by Salvo Isaja
Hi, Salvo

Salvo Isaja wrote:
>...
>DOS (8.3 names) takes the ugly way: '*' is converted to as many '?' as to
>fill the name or the extension.
>For example ab*cd.e* becomes ab??????.e?? and comparison is at character
>level. I have no idea of how this works with double byte character sets, if
>a '?' is in the place of only the leading or trailing byte (maybe Hanzac
>can help).
I don't know ... Isn't the new FAT driver using UTF-8 character encoding?

>Linux uses Posix wildcards: '*' and '?' work as you would expect, the
>former for any string, possibly empty, the latter for exactly one
>character. There are also other wildcards (namely, [...], see "man 7
>glob"), that we could support, but I'm not sure if they are needed.
>
>Windows (hence, as far as we are concerned, DOS with long file names),
>seems to use '*' and '?' with their usual meaning, but... Both "*" and
>"*.*" match all files. "*.*" matches even files without an extension, hence
>without a dot, consistently with DOS, but unconsistently with the standard
>meaning... This is what I called the "implicit dot" in the old FAT driver
>wildcard matching.
>Very appropriate, "*." matches all files without an extension, even if they
>have not a dot, but also "*.??" does, even if the file have no extension at
>all!
I think *.?? should return all the files have extentions which have 2
characters. 'cause no one wants to use this to match all files without
an extension, they will tend to use "*.".

I do think we should keep the original meaning of these wildcards as
much as possible. If it can't support some rare and strange combination,
it's not a mistake.

Because long names can have spaces and dots, so the dot, ".", doesn't
have the strong meaning as before to indicate the following characters
is an extension name. So "*.*" matches every file seems to be
reasonable.

BTW: Maybe we can support posix wildcards and even regular expressions
...

I also don't know if the file matching part is implemented in the file
system or the higher level (e.g. DPMI module) ??

Hanzac

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar ? get it now!
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/



-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
freedos-32-dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/freedos-32-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Wildcards matching

Salvo Isaja
In reply to this post by Martin "E.T." Misuth
Hi all,
thanks for your suggestions.

On Saturday 28 May 2005 01:07, Martin "E.T." Misuth wrote:
> *     - any file, with or without extension
> *.    - only files without extension
> *.*  - only files with extension

The problem with this is, that when I read "*.*" and I forget for a
moment that it used to work in DOS (it was translated to "???????????"
after all), it should mean "anything, a dot, and anything", but the dot
may not be present. For example under linux "*." will not show files
without an extension, as they don't end with a dot.
BTW, under linux "*" does not match names starting with dot, while under
Windows/DOS with long names it does.

> test... - file with exact name name "test..."

AFAIK under Windows "*...." matches any file without extension, as well
as the "." and ".." directory entries! That is, any file with *at most*
the specified number of dots, that makes no sense to me.

> - as extension delimiter only last dot is considered, if present.
>
> What do you think about it?

These are simple cases (that are the most used and useful indeed), and I
have no problem with them. As we are reproducing a DOS system we have
to use less strict behavior for "*" and "?" than Posix, as a lot of DOS
users is used with "*.*" (but also "*.???" for the matter) for any
file.

On Saturday 28 May 2005 03:09, Hanzac Chen wrote:
> >DOS (8.3 names) takes the ugly way: '*' is converted to as many '?'
> > as to fill the name or the extension.
> >For example ab*cd.e* becomes ab??????.e?? and comparison is at
> > character level. I have no idea of how this works with double byte
> > character sets, if a '?' is in the place of only the leading or
> > trailing byte (maybe Hanzac can help).
>
> I don't know ... Isn't the new FAT driver using UTF-8 character
> encoding?

Yes it does, as the old one did. Long names are stored in UTF-16, while
short names in a national code page: when doing lookups I convert them
to wide characters and do character matching in Unicode.

The problem is when we have to implement the DOS-style FINDFIRST and
FINDNEXT system calls, where you get a 11-byte template, 8 bytes for
the name, 3 for the extension, no dot between them ("???????????"
matches any file). This makes sense only for short names, and maybe only
for the FAT file system, anyway it must be implemented as, for example,
"our" command.com replacement uses them (like in "dir"). I match that
11-byte template with the short name as stored in FCB format in the
directory entry, hence the problem.

BTW, do you know what OEM code page are you using for the DOS shell and
short file names on FAT? You could test how the above works by creating
several files named with one Chinese ideogram (I assume it will
be a multibyte character) with the same lead byte and that differ only
for the trailing byte. Next you can call INT 21h AH=4Eh/4Fh with a mask
composed by the common lead byte and a '?', and see what files are
returned (I guess any of them, as I guess DOS will do byte matching,
not character matching).

Corollary: in a perfect world we would use Unicode for everything, and
people would live in peace; unfortunately neither happens...

> I think *.?? should return all the files have extentions which have 2
> characters. 'cause no one wants to use this to match all files
> without an extension, they will tend to use "*.".

AFAIK under Windows if you search for "*.??" you would get any file with
*at most* two characters of extension. This is very non-standard, and
comes from the old DOS convention where name and extension used to be
space padded to 8.3. There is some sample code I once read where the
author used "*.???" to match any extension.

> Because long names can have spaces and dots, so the dot, ".", doesn't
> have the strong meaning as before to indicate the following
> characters is an extension name. So "*.*" matches every file seems to
> be reasonable.

Even if the file name contains no dot at all, you mean? This is
non-standard, but we have to emulate it to some extent. Basically, we
have to chose "how much" we have to deviate from the standard to act
like Windows (the idea is to keep FD32 general).
Maybe we can put a command line switch for "strict" (Posix) or
"loose" (DOS-compatible) wildcard meaning.

My proposal:
* is any string, possibly empty, ? is exactly one character.
The last dot in the wildcards pattern is used as a conventional
extension separator instead of an actual dot in the file name: if the
name contains no dots, it is like if it conventionally ends for dot.
That is to say, we split the match in two parts: name and extension.
This should let us use "*." and "*.*" as said before.
This is somewhat incompatible with DOS/Windows, but should save a lot of
headaches. What do you think?
BTW, I was talking of the "serious" name matching services, whereas the
FINDFIRST and FINDNEXT DOS services for short names need the "*"
wildcards to be expanded in "?" wildcards as said before.

> BTW: Maybe we can support posix wildcards and even regular
> expressions ...

Please don't, they seem to me a nightmare to implement! :-)

> I also don't know if the file matching part is implemented in the
> file system or the higher level (e.g. DPMI module) ??

In a previous version of the file system stuff, the file system driver
used to expose a readdir operation, and name matching done at a higher
level (in the fs layer, but it could have been the DPMI driver as
well).

This approach, per Luca's request, has been replaced with the current
one, where name matching is performed inside the file system driver,
for the following reasons:
1) different file system drivers may need to perform different
operations for name matching, for example case sensitive or not, or
using different character sets;
2) case insensitive matching needs case folding, and as we use (and must
use, as far as I'm concerned) Unicode, this is not trivial, and the
Unicode library must be used in the module (or even the kernel)
performing name matching.

The reason 1 is actually simple to solve, as we could do anything in
Unicode (readdir returns a UTF-8 string) and use a "case sensitive"
flag in the matching function (see "man fnmatch" under linux). But this
brings back the need for the Unicode library in that part of FD32,
whereas it is currently needed only for the FAT driver.

Bye,
   Salvo
--
 

 

 --

 Email.it, the professional e-mail, gratis per te: http://www.email.it/f

 

 Sponsor:

 Scarica Sweety e la Rana Pazza sul tuo cellulare.

* La prima suoneria ? GRATIS!

 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=3537&d=28-5


-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
freedos-32-dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/freedos-32-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Wildcards matching

Hanzac Chen
Hi, Salvo

>From: Salvo Isaja
>...
>The problem is when we have to implement the DOS-style FINDFIRST and
>FINDNEXT system calls, where you get a 11-byte template, 8 bytes for
>the name, 3 for the extension, no dot between them ("???????????"
>matches any file). This makes sense only for short names, and maybe only
>for the FAT file system, anyway it must be implemented as, for example,
>"our" command.com replacement uses them (like in "dir"). I match that
>11-byte template with the short name as stored in FCB format in the
>directory entry, hence the problem.
>
>BTW, do you know what OEM code page are you using for the DOS shell
>and short file names on FAT?
On my Windows, it uses CP 936 for Chinese GBK encoding.
On DOS, it uses CP 437 ...

>You could test how the above works by creating
>several files named with one Chinese ideogram (I assume it will
>be a multibyte character) with the same lead byte and that differ only
>for the trailing byte. Next you can call INT 21h AH=4Eh/4Fh with a mask
>composed by the common lead byte and a '?', and see what files are
>returned (I guess any of them, as I guess DOS will do byte matching,
>not character matching).
In this circumstance, I guess the same result as you. But AFAIK, eastern
Asia characters' lead byte is always bigger than 0x7F. So normally, such a
char is not typed ...

>Corollary: in a perfect world we would use Unicode for everything, and
>people would live in peace; unfortunately neither happens...
If you convert both the template and the filename to Unicode, is it possible
to do a character-to-character matching? Even when doing DOS style
FINDFIRST and FINDNEXT.

If it's able, it will ease a lot of nerves I think. :-)

> > I think *.?? should return all the files have extentions which have 2
> > characters. 'cause no one wants to use this to match all files
> > without an extension, they will tend to use "*.".
>
>AFAIK under Windows if you search for "*.??" you would get any file with
>*at most* two characters of extension. This is very non-standard, and
>comes from the old DOS convention where name and extension used to be
>space padded to 8.3. There is some sample code I once read where the
>author used "*.???" to match any extension.
"*.???" can't match any extension, I tested on my Windows. *.??? strictly
indicate that files with 3-character extension name.

> > Because long names can have spaces and dots, so the dot, ".", doesn't
> > have the strong meaning as before to indicate the following
> > characters is an extension name. So "*.*" matches every file seems to
> > be reasonable.
>
>Even if the file name contains no dot at all, you mean? This is
>non-standard, but we have to emulate it to some extent. Basically, we
>have to chose "how much" we have to deviate from the standard to act
>like Windows (the idea is to keep FD32 general).
>Maybe we can put a command line switch for "strict" (Posix) or
>"loose" (DOS-compatible) wildcard meaning.
In a long file name world, maybe we should think that files not having an
extension name. :-) Therefore, I consider "*.*" is just a special matching
template and it means all.

>My proposal:
>* is any string, possibly empty, ? is exactly one character.
>The last dot in the wildcards pattern is used as a conventional
>extension separator instead of an actual dot in the file name: if the
>name contains no dots, it is like if it conventionally ends for dot.
>That is to say, we split the match in two parts: name and extension.
>This should let us use "*." and "*.*" as said before.
>This is somewhat incompatible with DOS/Windows, but should save a lot of
>headaches. What do you think?
>BTW, I was talking of the "serious" name matching services, whereas the
>FINDFIRST and FINDNEXT DOS services for short names need the "*"
>wildcards to be expanded in "?" wildcards as said before.
I'm fine with it.
BTW: I still think "*.*" should match all the files in a directory with or
without
a extension name.

Hanzac

_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfee?
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963



-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
freedos-32-dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/freedos-32-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Wildcards matching

Salvo Isaja
Hi Hanzac,

On Saturday 28 May 2005 13:47, Hanzac Chen wrote:
> >BTW, do you know what OEM code page are you using for the DOS shell
> >and short file names on FAT?
>
> On my Windows, it uses CP 936 for Chinese GBK encoding.
> On DOS, it uses CP 437 ...

So, the short names are CP 437? Can't you save Chinese characters in
short file names?

> >You could test how the above works by creating
> >several files named with one Chinese ideogram (I assume it will
> >be a multibyte character) with the same lead byte and that differ
> > only for the trailing byte. Next you can call INT 21h AH=4Eh/4Fh
> > with a mask composed by the common lead byte and a '?', and see
> > what files are returned (I guess any of them, as I guess DOS will
> > do byte matching, not character matching).
>
> In this circumstance, I guess the same result as you. But AFAIK,
> eastern Asia characters' lead byte is always bigger than 0x7F. So
> normally, such a char is not typed ...

Yes, the lead byte in MBCS is >7F (BTW, I type them with my Italian
keyboard òàùé ;-), anyway, it could happen the opposite. At least with
the Japanese SJIS, I read that the trailing byte can be in ASCII range,
so "?<value below 80h>" may actually map to a double byte character.

> >Corollary: in a perfect world we would use Unicode for everything,
> > and people would live in peace; unfortunately neither happens...
>
> If you convert both the template and the filename to Unicode, is it
> possible to do a character-to-character matching? Even when doing DOS
> style FINDFIRST and FINDNEXT.
>
> If it's able, it will ease a lot of nerves I think. :-)

I'm doing this way because I have to save the search state in a fixed
structure, imposed by the DOS API (see INT 21h AH=4Eh), and that
structure has very little space for the file name, namely 11 bytes.
Hence, no room for a generic template (it could overflow the 11 bytes),
and I'm using the FCB format (packing name and extension) as regular
DOS does, expanding "*"s to "?"s, and once they're all silly "?"s, you
can't go back. If you have a better idea, I'll be glad to go for it.

> >AFAIK under Windows if you search for "*.??" you would get any file
> > with *at most* two characters of extension. This is very
> > non-standard, and comes from the old DOS convention where name and
> > extension used to be space padded to 8.3. There is some sample code
> > I once read where the author used "*.???" to match any extension.
>
> "*.???" can't match any extension, I tested on my Windows. *.???
> strictly indicate that files with 3-character extension name.

Have you tried from the command prompt, or the Find dialog or using the
plain system call (which AFAIK is supposed to work as from the command
prompt)?
Anyway, I'm more confortable with your result! :-)

> In a long file name world, maybe we should think that files not
> having an extension name. :-) Therefore, I consider "*.*" is just a
> special matching template and it means all.

OK.

> >My proposal:
> >* is any string, possibly empty, ? is exactly one character.
> >The last dot in the wildcards pattern is used as a conventional
> >extension separator instead of an actual dot in the file name: if
> > the name contains no dots, it is like if it conventionally ends for
> > dot. That is to say, we split the match in two parts: name and
> > extension. This should let us use "*." and "*.*" as said before.
> >This is somewhat incompatible with DOS/Windows, but should save a
> > lot of headaches. What do you think?
> >BTW, I was talking of the "serious" name matching services, whereas
> > the FINDFIRST and FINDNEXT DOS services for short names need the
> > "*" wildcards to be expanded in "?" wildcards as said before.
>
> I'm fine with it.
> BTW: I still think "*.*" should match all the files in a directory
> with or without
> a extension name.

So do I, just because it is the "standard" (maybe not the right word
here ;-) way DOS behaved and lots of people are familiar with it.
The above proposal should cope with this.
I've implemented the above proposal, and, until my girlfriend arrives to
my home, I'm debugging it, since of course it doesn't work at the first
try :-)

Bye,
   Salvo
--




 --

 Email.it, the professional e-mail, gratis per te: http://www.email.it/f



 Sponsor:

 Il mondo dei minerali, gemme, cristalli: riscopri il tuo equilibrio

* interiore e l?energia spirituale su Erboristeria.com

 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid#77&d(-5


-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
freedos-32-dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/freedos-32-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Wildcards matching

Hanzac Chen
> > On my Windows, it uses CP 936 for Chinese GBK encoding.
> > On DOS, it uses CP 437 ...
>
>So, the short names are CP 437? Can't you save Chinese characters in
>short file names?

Yes, they are. and the Chinese characters can be saved in short names,
directly with the original encoded characters. I also assume these
characters
are encoded into UTF-16 when they are saved in long names, 'cause I can't
recognize them anymore. :-)

_________________________________________________________________
Don?t just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/



-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
freedos-32-dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/freedos-32-dev
Loading...