Monday, January 5. 2009Seven Things About Me You May Not Know (And Probably Won't Care About)Due to the efforts of Brian Moon and Michelangelo van Dam, I've been sucked into a meme started by Tony Bibbs. My initial reaction to this unfortunate event was ... (envision Steven Colbert, hands raised...) "Noooooo!!!" But I got over it. Hey, it's the holiday season, I might as well be a good boy and fulfill the modern-day geek's equivalent of a chain letter. So, without further ado, here is my list of seven things about me you probably could care less about and will skip over to see if you are on my list of tagged people. (Yeah, you know you will.)
Here are the folks I'd like to know a little more about:
And, of course, the rules in case anyone missed them:
Thursday, December 18. 2008I Didn't Say "Screw Windows"A number of commenters on my previous entry thought I was basically saying "Screw Windows". Lukas Smith and Bill Karwin, both of whom I respect enormously, noted that Windows is a dominant development platform for MySQL users, and that one of the reasons for MySQL's popularity was that it runs smoothly (for a while now) on Windows. Bill and Lukas: You are 100% correct. That said, what I wrote was this: Forget Windows for now: Use open source, community-maintained, and standardized libraries within the kernel. Don't rewrite libc and various other quality open source libraries because of Not Invented Here syndrome or because Windows lacks these things. Focus on the standards and don't bother with platforms that don't conform to POSIX. If Microsoft wants future MySQL versions to run on its platforms, partner with Microsoft and have them do the port. While you're at it, drop support for old platforms like Netware and other crap that is obselete. The above is advice to the team in charge of re-architecting the database server. What it boils down to is this: focus first and forement on using standard, community-maintained open source libraries and on creating a neatly-architected, clean and lean kernel. After that, worry about Windows ports. So, it's not about kicking Windows to the curb. It's about priorities. The priority should be clean code. Because having clean, easy-to-understand code leads to:
I posit that the above three things would make a Windows port much easier, and cleaner. More developers able to contribute, less time trying to figure out the spaghetti, and less chance the port would break tangential pieces of code. And thus, I believe a solid Windows port of a core kernel would be easier and faster to complete once a POSIX-compliant and clean kernel is completed. Monday, December 15. 2008My advice to MySQLHere is my advice to MySQL. Take it or leave it. Time will tell whether I'm full of shit. MySQL 5.1 is out the door. Awesome. Great job to all the folks who fixed the thousands of bugs over the last 3 years. MySQL 5.1 should be faster and more stable than 5.0 because of those bug fixes, and features like partitioning are welcome additions to the small percentage of MySQL users who need that functionality. And, even if there are some bugs in partitioning (what feature doesn't have any bugs?), the partitioning feature is as good or better than other competing products. Good job. However, going forward, here is my advice to MySQL engineering: stop all work on new 6.0 features entirely. Don't scrap the features, just stop development on them now. Take one month to figure out how to restructure MySQL engineering and priorities with the following steps: Suggested StepsDrop the current roadmap: Continuing down the current roadmap without addressing the core problems of a Frankenstein-like core database server kernel will mean that the current roadmap features will take 2-3 times as long to develop. Stop this now. Refocus on re-architecting the kernel to a 21st century, modular design. Tell sales and marketing that you are taking these steps to ensure the long-term viability of the MySQL name and product line. Make two teams: a maintenance team which maintains server versions <= 5.1 and a team which is dedicated entirely to redesigning the MySQL server kernel into a streamlined, black box. Reduce the headcount of the MySQL engineering team if necessary to contain only those engineers who have the ability to design modular, pluggable systems. Give up on backwards compatibility: To make the changes necessary without making the kernel even more complex than it already is, you will need to relinquish the idea that backwards compatibility is necessary. Guido van Rossum already made this decision for Python 3, recognizing the need for it. MySQL needs to do the same. SQL_MODE? Scrap it. Only do what is correct according to data integrity and SQL standards. Reduce the code complexity and code paths and you will find that features are easier to develop and fixes are easier to identify. Forget Windows for now: Use open source, community-maintained, and standardized libraries within the kernel. Don't rewrite libc and various other quality open source libraries because of Not Invented Here syndrome or because Windows lacks these things. Focus on the standards and don't bother with platforms that don't conform to POSIX. If Microsoft wants future MySQL versions to run on its platforms, partner with Microsoft and have them do the port. While you're at it, drop support for old platforms like Netware and other crap that is obselete. Make all decisions open and transparent: For the non-maintenance team, make a policy that all decisions about the kernel design be done in an open forum, with the community able to participate in the discussion. Have stewards that are willing to negotiate the design decisions with the community and do everything in a transparent manner. Focus all energy towards the APIs: Think of the server kernel as simply a provider of services. Clearly and consistently define these services (as interfaces and plugin APIs) and have the community of engineers vet the design of these APIs. Once these interfaces are clearly defined, document them on public wikis. Clean up the abysmal messiness of the code base: Refactor, decruft, and standardize the code base to a C99 (minimal), warning-free, environment that uses stdint, stdbool, proper STL templates, and other stuff that is now standard for 5+ years. Clear your heads of the premature optimization syndrome that infects the code base and makes it messy and cluttered. You will find that there are many community resources that would be happy to help in this effort. Once done, BSD the kernel and turn it over to the open source community: Once the above is done, BSD the kernel code base and let the community support it entirely. Then, focus your energies on creating value-added features as plugins around that community-supported core kernel. Use the resources and expertise in your engineering department to develop niche addons that paying customers want. Package branded versions of the MySQL server (closed or open source) that include a number of these value-added plugins that target a specific industry, such as data-warehousing or security-conscious environments. Sell those packages as Enterprise packages with an Enterprise price point. Provide all support and services for these Enterprise-branded MySQL server packages. How Long?Based on what the Drizzle project has been doing, I predict that doing ALL of the above steps would take approximately 12 months to achieve a version 1.0 of a stable, modular kernel. I believe that features could be developed as plugins to that kernel in less time than if the work was not done and the features for 6.x are developed as they currently are. PredictionsIf the above steps are taken, here is what I predict would be the outcome: Reduction in maintenance costs of the core server by 80%: By turning over maintenance costs to the community, there will be a reduction in maintenance costs. By simplifying the kernel code base, there will be an even bigger reduction in maintenance costs: since one "fix" won't break other things nearly as often as "fixes" do today. This reduction in maintenance costs means that Sun can allocate more of its internal engineering resources to developing value-added plugins which are sold to customers. Because more developer resources are now dedicated to revenue-producing activities, the long-term viability of the database engineering department is ensured. Sales and marketing efforts become easier: Currently, MySQL sales and marketing are undeniably hindered by two things:
By following the steps above, these problems are tackled. A simpler and community-supported kernel means a more stable kernel. A more stable kernel means a shorter, more incremental release cycle. The lack of differentiation is solved by MySQL now being able to focus on value-added plugins in branded MySQL packaging. These branded packages are much easier for a sales force to sell, since they represent clear, differentiated value to the customer. When sales and marketing of a product become easier, only one thing is bound to happen: a strong increase in sales. MySQL will once-again return to the Open Source community: Much has been made of the inability of community contributors to get contributions into the MySQL server in a reasonable timeframe. By opening up the design and development of the kernel to the community, MySQL would restore much of the trust it has lost in recent years. Instead of being seen as "throwing a bone" to the open source community every once in a while, Sun/MySQL engineering would be seen as an active and trusted partner in open source contributions, stewardship and development. ConclusionOr, I am completely full of it and the above is a waste of time. Saturday, November 22. 2008The Drizzle Snowman - WIN!Stewart, Brian and myself are having a little fun this morning. One of the niceties of having real UTF8 support in Drizzle is now we can really fun table names. Behold, the glory of Drizzle:
drizzle>> create table ☃ (a int not null);
Query OK, 0 rows affected (0.01 sec)
drizzle>> show create table ☃\G
*************************** 1. row ***************************
Table: ☃
Create Table: CREATE TABLE `☃` (
`a` int NOT NULL
) ENGINE=InnoDB
1 row in set (0.00 sec)
Yep, that's a snowman. MySQL? Well, not so much: mysql> select @@character_set_system; +------------------------+ | @@character_set_system | +------------------------+ | utf8 | +------------------------+ 1 row in set (0.00 sec) mysql> create table ☃ (a int not null); ERROR 1064 (42000): You have an error in your SQL syntax; \ check the manual that corresponds to your MySQL server version \ for the right syntax to use near '�� (a int not null)' at line 1 /me goes off to record snowman.test. UPDATE: There isn't an error apparently, in MySQL. As long as you set names UTF8 in the client, all works as expected. Friday, November 21. 2008Drizzle Cirrus Milestone - Moving Forward
Although the MySQL server does have community contributions in some of the releases, the Cirrus milestone marks something of a new day in MySQL-related development. Cirrus contains tasks which are actively being developed by external contributors. This may not sound like a huge deal, but it is. In the past, contributions have been included in the MySQL server, however these contributions have always been included after the code has been contributed. For instance, Jeremy Cole's SHOW PROFILES patch, although heavily modified from its original submitted form, was included in MySQL Community Server after a long period of code review and modification. However, to my knowledge, the code contributor community has never been actively involved in either ongoing feature development for a release, nor actively involved in the direction in which the server is developed. Cirrus marks a new day. Not only are tasks for Cirrus assigned to external contributors, but the decision-making and strategic power of the release is in the community's hands. The only reason a community member would not have a say in the direction of the server is if they don't speak up and share an opinion. As of this morning, there are 299 members of the drizzle-discuss mailing list. All of these members have a say in what gets done in Drizzle. This makes me a happy boy. A Note on What a "Release" IsBefore the emails start firing off about what's in the first release of Drizzle and when it will come, I'd like to note that we are not going for a "big bang" approach to releasing software. The tasks I outline below are targets for a milestone. These tasks do not mean that the first release of Drizzle will contain all of the listed items. In fact, to be sure, some of them likely won't make it into the first release, and other tasks not listed currently for the milestone will "make it in". Although the community will eventually decide the release model, most (all?) of the developers sitting at Brian's table agree that an Ubuntu-like release model leads to more stable and consistent releases. By "Ubuntu-like", I mean that it is the release date which is important to be kept stable, and not the list of features contained in the release. People want consistency in when to expect the next release; it makes it easy to look forward to a certain date. What is less important is what is included in the release. What counts is that each release is stable and demonstrates incremental improvements at a consistent rate. I'll be blogging more about this concept shortly and will start a discussion on the mailing list regarding possible release dates and a schedule for locking down commits before that date. Whatever is feature-complete at the time of lock-down goes into the release. Nothing more. Why? Because stability is more important. With a set release cycle, the feature that "missed the deadline" will eventually make it into the code base in a shorter amount of time, in a consistent and stable manner. Targets for CirrusThere are a number of major areas that Cirrus is targeting:
Many tasks in the "cleanup, reuse and refactor" category have already been completed, by Monty, Brian, myself, and community contributors such as Toru Maesaka, Patrick Galbraith, Eric Day, C.J. Collier, and Yoshinori Sano. These tasks are listed on the blueprints page starting with "code-cleanup". They are also not as dependent on each other as some of the other task areas. Feel free to click through on the various links to the milestone and blueprint tasks in this blog post, comment on the mailing lists, and be an active contributor. Nothing is off limits. Thursday, November 6. 2008On Bullsh*t Blog PostsWhen you write a blog post and tag it with something you know will allow it to be aggregated into PlanetMySQL (or any other technical aggregation service), ask yourself one thing: If I was a technical person interested in MySQL, would I want to read what I just wrote? If you answered "No" to the above question, don't click the Publish button Friday, October 24. 2008Drizzle Tests - Unearthing the Pompeii of MySQL
Like the bodies underneath the piles of ash in Pompeii, many of the individual tests in the MySQL test suite are frozen in time. In a way, this is understandable, for a few reasons. Developers, in general, hate writing test cases. Let's face it, it's a lot more fun to write code than write and run test cases. And, when something is more fun, we tend to devote more energy to that kind of something, and neglect other not-so-fun stuff. We also tend not to understand the value of tests. Most developers think of tests as a way of validating their work — did the code I write do what I think it should do? But tests are more than that. Among other things, tests fulfill all of the following:
So, tests are A Good Thing™. However, a poorly written test case can be deadly to the overall health of a project. Why? Because bad test cases often will pass when run in a test suite, but not actually test anything, or not properly test what they are supposed to. This gives the software developers the illusion of health, which is worse than tests failing and the developer knowing the code base is broken. Oh, and don't get me started on disabling test cases... I'll leave that for a later post. For now, let's focus on what makes a good test case. Good tests are not easy to write. Here are some things that I think make a test A Good Test. Tests should validate a single thingI have stressed this before in various performance tuning sessions that I've given. Suppose you run a benchmark and get 1000 queries/second throughput. Afterwards, you edit your my.cnf configuration file and change two variable settings. You then re-run the same benchmark and get 1200 queries/second throughput. Question: What does this tell you? Answer: Absolutely nothing. Why? Because you don't know what effect each of those two changes had on the performance of the server. The change of variable A could have resulted in a 50% performance improvement, while the change in variable B could have had a 20% negative impact on the performance. The point is, you don't know what the results of the benchmark mean. Similarly, if a test case attempts to validate more than a single thing, you can't count on the test case's results meaning anything. Here is a perfect example from the current MySQL test suite which, as I complained about last night, the Drizzle build is actually enabling and passing. Below, the bench_count_distinct.test, in its entirety.
#
# Test of count(distinct ..)
#
--disable_warnings
drop table if exists t1;
--enable_warnings
create table t1(n int not null, key(n)) delay_key_write = 1;
let $1=100;
disable_query_log;
while ($1)
{
eval insert into t1 values($1);
eval insert into t1 values($1);
dec $1;
}
enable_query_log;
select count(distinct n) from t1;
explain extended select count(distinct n) from t1;
drop table t1;
# End of 4.1 tests
What precisely does the above test case validate? Well, here is a list of answers I thought up:
Tests should be well commentedThis is a no-brainer, but it's simply amazing to me how few relevant comments are in the existing MySQL test suite. Like the bench_count_distinct.test above, most tests either have a useless top comment which typically is just the name of the test, or something like the following, taken from 1st.test file (yes, there is a test called 1st.test): # # Check that we haven't any strange new tables or databases # show databases; show tables in mysql; Unfortunately, the above is not a joke. What, precisely, does "any strange new tables" mean? Clarity should be king in test case comments (as in code comments). "Strange" is completely vague. Instead, the comment should specify something like: # Check that we only have two databases: "mysql" and "test" show databases; # Check that the tables in the "mysql" database match the correct # system tables for this version of MySQL show tables from mysql; Better still, the test should be completely scrapped in favor of a more traditional setup() type test block, which resets a test environment to a pristine condition. Here is an example of what an excellent test case comment looks like, taken from the analyze.test test case:
#
# Bug #14902 ANALYZE TABLE fails to recognize up-to-date tables
# minimal test case to get an error.
# The problem is happening when analysing table with FT index that
# contains stopwords only. The first execution of analyze table should
# mark index statistics as up to date so that next execution of this
# statement will end up with Table is up to date status.
#
create table t1 (a mediumtext, fulltext key key1(a)) charset utf8 collate utf8_general_ci engine myisam;
insert into t1 values ('hello');
analyze table t1;
analyze table t1;
drop table t1;
Although the comment is in a little broken English, it's fairly clear what it attempting to be validated: the commands below will test to see if a FULLTEXT index containing only stopwords will break the ANALYZE TABLE. Nicely, the relevant MySQL bug ID is included in the comment. Man, do I wish all the tests in the MySQL test suite were more like the above. Tests should not mix tests and assertionsMy wife, Julie, and I have a saying about why our marriage works well: "low expectations". In the case of tests, I would say the key to a good test is "no expectations". What is an assertion? Essentially, it's a declared expectation. You assert that the behaviour of a specific command or event will be the expected value. A test case should be free of any sign of known expectations. Expectations belong in a results file, not in the test file. If expectations are in the test file, the developer or writer of the test file has polluted the test itself with known expectations. Instead, test files should contain only the statements or commands to reproduce a scenario. Need an example? Take a look at the following, also from analyze.test: # # Bug #30495: optimize table t1,t2,t3 extended errors # create table t1(a int); --error 1064 analyze table t1 extended; --error 1064 optimize table t1 extended; drop table t1; The test looks simple enough, and indeed it is. However, the --error 1064 is mixing the assertion with the test statements. Instead, the results file should contain the assertion for a returned error code of 1064 (Syntax error). By including the assertion in the testing block, we tightly couple the execution of the test with the expected results. Why is this bad? What if for the given SQL statement, storage engine A instead returned error code 1067? I would now need to create an entirely different test file containing the same test with different error assertions. Indeed, this situation is quite common. Some storage engines behave differently than others. Results of the same series of statements can be different, and yet valid for the individual engine. The proper way to deal with these situations is to record a result file for the different storage engines, but keep a single test file. The new test framework allows for this kind of differentiation. Going ForwardMy work this weekend and next week will be focused around updating all the tests in the existing test suite, seeing what is obselete, correcting and cleaning tests, and better organizing test files into suites of related functionality. The organization of tests into suites has already been done on my local laptop, and I've added a command to the new Python test runner to display some suite statistics, as shown below... ok, more on testing later. I very much welcome input on the test and results file syntax on the wiki page. Also feel free to post to the Drizzle Discuss mailing list any thoughts you have on the testing infrastructure. Thanks in advance for any suggestions.
[537][jpipes@serialcoder: runner]$ python drizzle_test_runner.py --command=list-suites
--------------------------------------------------------------------------------
Suite Name # Tests # Results # Sub-suites
--------------------------------------------------------------------------------
clients 9 12 3
optimizer 9 9 0
storage 27 57 7
replication 176 181 2
types 18 18 0
vcol 19 15 1
variables 5 5 0
stress 5 5 1
binlog 27 27 1
charsets 8 9 0
information_schema 3 3 0
sql 166 155 0
functions 23 25 0
--------------------------------------------------------------------------------
495 521 28
--------------------------------------------------------------------------------
Tuesday, October 7. 2008Performance Tuning Webinar for Commercial Application Developers Tomorrow
Tomorrow, at 1pm EDT/10am PDT, I'll be giving a webinar on performance tuning MySQL for commercial application developers. The webinar is open to all participants. I'll be covering my normal Join-fu material but will try to tailor the talk to developers working on applications for the commercial market. I'm actually looking forward to the different questions this group might propose. I can't say I'll have answers to all the questions, but I'll certainly try my best! Thursday, October 2. 2008Character Sets, Collations and the JörmungandrOne of the (many) ongoing discussions in the Drizzle developer community is the level of support the database server kernel should provide for non-Unicode character set encodings. Actually, when I say non-Unicode, I actually mean non-UTF8, since we've stripped out all other character sets and "standardized" on 4-byte UTF8. I'll come back to why exactly I put standardized in quotes in just a bit...but to sum up, in childish terms, my thoughts after spending 4 hours tonight reading about character sets and collations, here is an exchange between Toru and myself on Freenode #drizzle: <jaypipes> tmaesaka: how do you write "I wish everyone would just speak English" in Japanese? A Little BackgroundFor those of you new to the world of character sets and collations, I'll briefly summarize the concepts and terms I'll talk about in this article. Incidentally, I consider myself to be in this crowd, since I've never really had to deal with anything more than a cursory knowledge of them in reference to how they work in MySQL (not the internals). Character Sets and EncodingsA character set, or character encoding scheme, is a system for matching characters — such as "A" or "み" or "ß" — with a machine-readable code for the character. This machine-readable code can be represented simply as a decimal number, or in more complex character sets, a hexidecimal number. The "encoding" of the character set is the protocol, or instructions, that the character set uses in order to enable the computer to understand a series of byte sequences and interpret the sequence as a specific character.
Other more-complex character encodings are localized for a specific language, or writing system. For instance, the Shift_JIS character encoding scheme encodes, in 2 bytes, the ASCII character set (with 2 exceptions), the "half-width Katakana" characters, and the JIS X 0208 set of kanji symbols. Sound complicated? It is. And it gets even more complicated the further down the rabbit-hole one goes. Which leads me to Unicode... What the Heck is Unicode and UTF?Many folks think that Unicode is merely another character set or encoding scheme. It's not. It's actually more than that. It's an entire system which endeavours to standardize the way that computers can read, sort, and transform characters encoded in various character sets. Actually, The Unicode standard according to Wikipedia ...consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order... Got all that? So, Unicode is a set of standards for dealing with lots of varying languages and characters, and transcoding character codes from one encoding scheme to another. What, then, is UTF[8|16|32]? UTF stands for Unicode Transformation Format, and is a set of mapping methods for translating one of Unicode's 1,114,112 code points (characters or control sequences) to a hexadecimal number. UTF8 is a variably-sized mapping method, which uses between one and four bytes to represent one of the code points. ASCII and most Western character sets take up 1 byte of storage, whilst CJK (Chinese/Japanese/Korean) characters typically consume 3 bytes of space per character. It is important to note that this 3 bytes is one more byte per character than encoding schemes like Shift_JIS, which use either 1 or 2 bytes for characters. Yoshinori Matsunobu published a short article today on these storage space differences. UTF16 is a variable-width mapping scheme which uses the first 16 bits of the hexadecimal number to represent what "category" or "plane" of characters the code point belongs to. UTF16 generally uses a little bit less storage space for CJK characters versus UTF8. However, when analyzing actual CJK text, which includes spaces and other ASCII characters, the storage difference seems to be negligible. UTF32 is a fixed-length mapping method which uses 4 bytes to store each code point. UTF8 is dominant in the web space, with all modern browsers able to understand and encode for UTF8. OK, So What is a Collation?So, if a character encoding scheme, such as UTF8, is used to identify a set of characters and symbols as a machine-readable sequence of bytes, then what exactly is a collation, and why are they important? Glad you asked. A collation , or collating sequence, refers to the order in which different characters in a character set should appear when sorted in a list. The alphabetic collating sequence is the one some of us, in our little English-only world, are familiar with. But in various regions of the world, the same set of characters may be ordered differently when appearing in a list of characters. And therefore, even with a character encoding scheme like UTF8, one must also specify a collation when listing textual results in a specific order. In MySQL, as well as Drizzle, the method for ordering results by a specific collation is fairly simple: one merely specifies the collation in the ORDER BY clause, like the example below shows:
mysql> CREATE TABLE utf8_tests (
-> my_text VARCHAR(100) NOT NULL
-> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO utf8_tests VALUES ('comb'),('cukor'),('csak'),('folyik'),('folyó'),('folyosó'),('fő'),('födém');
Query OK, 8 rows affected (0.00 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_general_ci;
+----------+
| my_text |
+----------+
| comb |
| csak |
| cukor |
| födém |
| fő |
| folyó |
| folyik |
| folyosó |
+----------+
8 rows in set (0.00 sec)
mysql> SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_hungarian_ci;
+----------+
| my_text |
+----------+
| comb |
| csak |
| cukor |
| fő |
| födém |
| folyó |
| folyik |
| folyosó |
+----------+
8 rows in set (0.00 sec)
You'll notice that the words "fő" and "födém" are reversed depending on the collation used in the ORDER BY clause. Any Hungarians reading this article? If there are, you'll likely have already spotted the problem with the above output. The problem is that it's wrong. "csak" should appear after cukor, since "cs" is a digraph (two-characters interpreted as one) which comes after "c" in the Hungarian alphabet. The above behaviour is known bug in MySQL since August 2005, over three years. The above bug is something I noticed while reading up on collations and comparing what's going on in MySQL/Drizzle to what the standard expects. The ICU project has a set of HTML pages where you can type in a list of words in a language and sort by various collations, and it will show you the correct sort order. I ran into the bug above, as well as a new bug in the German collation I found today. Where Drizzle Is Right NowCurrently, all but the UTF8 character set have been removed from Drizzle. Furthermore, the UTF8 implementation in Drizzle is full 4-byte UTF8, which differs from the 3-byte variety used in MySQL <= 5.1. There are two major benefits that this decision and subsequent removal has given Drizzle:
So, it seems that although we've stripped out a lot of complexity by moving to only UTF8 and its collations, we've inherited a system that, frankly, was never designed to handle complex collations. Instead, it is designed to be fast, not entirely accurate. So, what is a project to do? We have a number of options, all of which we've been debating over on the mailing lists:
libICU is, frankly, quite a large library, and it's not certain that the performance of it would be satisfactory. However, I can certainly envision taking libICU's test case suite and converting it to the Drizzle test suite format. This would certainly poke holes in our current character set handling that need to be discovered. Although Yoshinori-san's objections about UTF8 storage requirements versus localized Japanese character sets are valid, I don't think at this point that we'll re-introduce non-UTF8 character sets into the server at this time. If there is a huge uproar over this, in the future, pluggable character sets are a possibility, after changes to the plugin API to enable it. Pluggable collations too... This last option is the one which interests me the most, and I find most appealing. In fact, I compiled a small test program based on the C++ <locale> facilities which actually produces the correct collation order for the bug demonstrated above:
Compiling and running the program shows the correct sorted order for the words: [518][jpipes@serialcoder: /home/jpipes/repos/drizzle/test-hun]$ g++ test.cc [519][jpipes@serialcoder: /home/jpipes/repos/drizzle/test-hun]$ ./a.out comb cukor csak I'm thinking that the refactoring work that still needs to be completed around CHARSET_INFO and MY_CHARSET_HANDLER should experiment with the technique above and verify any performance regression (or improvement) that may occur. Accuracy, in my opinion, and the ability to let a library not written by Drizzle developers do the heavy lifting, is more important than a small performance increase. The Edwin Strikes Back
Notable voices on the thread include Matz, creator of Ruby and a core influencer in its direction, and Tim Bray, of our own Sun Microsystems and XML fame. The original poster, one Michael Selig, began the thread, entitled Character encodings - a radical suggestion, with an ostensibly simple suggestion: Remove internal support for non-ASCII encodings completely, and when reading/writing UTF-16 (and UTF-32) files automatically transcode to/from UTF-8. Unfortunately for Michael, this small suggestion was the online equivalent of stepping in a pile of elephant dung.
Until reading the above-mentioned forum thread, I really had no idea about the complexities involved in character set handling, especially in the Asian countries. If you are interested in character sets, collations, and Unicode vs. local encodings, reading through the forum thread will truly enlighten you as to the various arguments for and against UTF8. It's highly recommended reading, but be warned, it may leave you gasping for breath at some points...enjoy. So Long, and Thanks for all the FishWell, as Giuseppe announced, I am leaving the MySQL Community Team after almost three years. I'll still be working at Sun, but as a staff engineer on the Drizzle project in the Sun CTO organization. We are looking for someone to pick up the reins in the North American MySQL community and assume the role as Community Relations manager. Interested? Get in touch with Giuseppe or myself after reading his article about the requirements of the job.
I should add that candidates should be advised about Giuseppe. As your team lead, he may subject you such horrors as excellent project and managerial skills, a kind and encouraging shoulder on which to vent, and a deep, heartfelt connection with open source and community issues. In addition, you can look forward to working with Kaj, Lenz, and Colin, three of the hardest-working people at MySQL which will eventually make you feel like you just can't do enough to keep up. So, MySQL Community, thanks for all the Fish! Of course, I won't be too far at all. Working on Drizzle, it's pretty likely you'll be hearing from me. That is, if Brian and Monty let me out of my coding rabbit-hole... Tuesday, September 30. 2008Yo! Get Your MySQL Conference Submissions In.Although Colin Charles has taken over the illustrious duties of Program Chair for the MySQL Conference and Expo 2009, I'm participating in the conference as a member of the submission voting panel along with over a dozen other folks. The deadline for submitting abstracts is October 22nd. Yes, that's 23 days away. So, if you haven't submitted yet, please consider doing so. Giuseppe and others have done a good job outlining guidelines for you to follow in order to get a submission accepted. There is an online form for submitting an abstract. Get your submissions in today! Thursday, September 25. 2008Another Quick Feature Added to MySQL Forge
Small Feature Addition to MySQL Forge
Monday, September 22. 2008A Contributor's Guide to Launchpad.net and Bazaar SlidesToday at the Riga Sun Database Group Developer Meeting, I'm giving a MySQL University session about using Launchpad.net and Bazaar for Contributors. Below, I've posted links to the slides.
A Contributor's Guide to Launchpad and Bazaar Open Office Impress slides
PDF slides
Topics included in the slides:
Thursday, September 18. 2008Slides from Drunken Query Master and Join-fu Talks at ZendCon
Below are the Open Office Impress and PDF versions of the slide decks for Legend of Drunken Query Master and Join-fu for ZendCon. Enjoy.
Legend of Drunken Query Master: The Apprentice's Journey Open Office Impress slides
PDF slides
Join-fu: The Art of SQL - ZendCon 2008 Open Office Impress slides
PDF slides
Topics included in the slide decks:
(Page 1 of 17, totaling 241 entries)
» next page
|
Calendar
QuicksearchArchivesCategoriesSyndicate This Blog |
|||||||||||||||||||||||||||||||||||||||||||||||||
