Kinetic: technology

Showing posts with label technology. Show all posts

September 05, 2011

MongoDB replica sets - high level overview

Here is a very very brief overview of MongoDB replica sets and a tip to enable read access to read-only replica slaves.
http://www.codypowell.com/taods/2011/08/a-cloud-hosting-architecture-for-mongodb.html

The full definition from the MongoDB site is here http://www.mongodb.org/display/DOCS/Replica+Sets

December 17, 2009

Algorithmic (almost) content creation

This article from Wired on Demand Media and their demand-based creation and delivery of 'content' is an important movement on the Web (and off the Web too).

The choice quote is :

Instead of trying to raise the market value of online content to match the cost of producing it — perhaps an impossible proposition — the secret is to cut costs until they match the market value.

The costs to be cut are the costs of creation (manufacturing). The delivery costs are already nearly zero. Currently Demand Media is generating answers to unfulfilled questions using 'crowd sourcing' and blending media assets like video and photos and quickly written text. I wonder if someday even the text could be auto-generated.

I'm sure in the next six months we'll see a blooming of clones - 'DemandMedia for FooBar' style.

Quite a while ago I had thought about what it would take to build a content site with heavy automation on the gathering, review and approval of content. But I had not thought of optimizing that process based on audience demand. Quite clever really.

update
Just found this post on ReadWriteWeb from a writer that previously worked with DemandMedia - required reading to see things from the viewpoint of someone actually creating DemandMedia content.

Choice quote:

They [writers] appear to be overwhelmingly women, often with children, often English majors or journalism students, looking for a way to do what they love and make a little money at it.

Compare those demographics to Wikipedia: more than 80% male, more than 65% single, more than 85% without children, around 70% under the age of 30.

September 29, 2009

Tokyo Tyrant tuning parameters

We've been working with Tokyo Tyrant for some large scale key-value lookups and the performance has been very nice, but has degraded over time. I've been poking around the various options to try to improve the performance and although there is documentation of various options, the pages are hard to read and figure out what's what. So I thought I'd collect them here for reference. I'll describe the results of tuning and tweaking in a future post.

The most recent authoritative references are here:

Tokyo Tyrant (actually Tokyo Cabinet – the storage engine) supports various types of storage – B+ Tree indexing, hash index, etc. This is configured by setting the filename or file extension to a particular value:

If the name is "*", the database will be an in-memory hash database.
If it is "+", the database will be an in-memory tree database.
If its suffix is ".tch", the database will be a hash database.
If its suffix is ".tcb", the database will be a B+ tree database.
If its suffix is ".tcf", the database will be a fixed-length database.
If its suffix is ".tct", the database will be a table database.

Each has its own set of options and while different flavors of storage may accept the same option name (like bnum), the optimal value likely should be different across storage types.
Tuning parameters can trail the filename, separated by "#". Each parameter is composed of the name and the value, separated by "=". For example, "casket.tch#bnum=1000000#opts=ld" means that the name of the database file is "casket.tch", and the bucket array size is 1000000, and the options are large and deflate.

For disk-based storage, several tuning parameters specify the on-disk layout while others specify memory and caching settings. Changing the on-disk layout requires scanning and re-writing the database data file which requires exclusive access to the file – which means taking the database offline. This scanning and re-writing process is done via tools provided with the distribution (ex: tchmgr and tcbmgr). Changing the memory and caching settings only requires a restart of Tokyo Tyrant.

We've been working only with on-disk storage via the hash and B+ Tree database engines. For a hash database the tuning parameters for the on-disk layout is limited to the size of the bucket array and the size of an element in the bucket array (choosing 'large' gets you 64-bit addressing and addressable data greater than 2GB). When a hash database file is first created, space is allocated on disk for the full bucket array. For example a database with 100M bucket size and 'large' option would start out at around 800MB. This region of the data file is accessed via memory mapped IO. There is an additional 'extra mapped memory' setting which default to 64MB – I'm not sure what this is used for, but for performance more memory is always better.

For a B+ Tree database, there are additional tuning parameters for the structure of the B+ Tree – how many members (links to child nodes) in an interior non-leaf node and how many members in a leaf node. Records are not stored in the B-Tree leaf nodes, but within 'pages'. The leaf nodes point to these pages and each page holds multiple records and is accessed via an internal hash database (and since this is a B+ Tree the records within a page are of course stored in sorted order). There is also a parameter for the bucket size of this internal hash database. One subtle detail is that the bucket size for a B+Tree database is the number of pages, not the number of elements (records) being stored – so this would likely be a smaller number than a hash database for the same number of records.

I've not yet figured out how the dfunit tuning parameter works or what impact that has on a running server, but it looks interesting.

bnum: the number of buckets
capnum: the capacity number of records
capsiz: the capacity size of using memory. Note - records spilled the capacity are removed by the storing order.

capnum: the capacity number of records
capsiz: the capacity size of using memory. Note - records spilled the capacity are removed by the storing order.

opts: "l" of large option (the size of the database can be larger than 2GB by using 64-bit bucket array.), "d" of Deflate option (each record is compressed with Deflate encoding), "b" of BZIP2 option, "t" of TCBS option
bnum: number of elements of the bucket array. If it is not more than 0, the default value is specified. The default value is 131071 (128K). Suggested size of the bucket array is about from 0.5 to 4 times of the number of all records to be stored.
rcnum: maximum number of records to be cached. If it is not more than 0, the record cache is disabled. It is disabled by default.
xmsiz: size of the extra mapped memory. If it is not more than 0, the extra mapped memory is disabled. The default size is 67108864 (64MB).
apow: size of record alignment by power of 2. If it is negative, the default value is specified. The default value is 4 standing for 2^4=16.
fpow: maximum number of elements of the free block pool by power of 2. If it is negative, the default value is specified. The default value is 10 standing for 2^10=1024.
dfunit: unit step number of auto defragmentation. If it is not more than 0, the auto defragmentation is disabled. It is disabled by default.
mode: "w" of writer, "r" of reader,"c" of creating,"t" of truncating ,"e" of no locking,"f" of non-blocking lock

opts: "l" of large option,"d" of Deflate option,"b" of BZIP2 option,"t" of TCBS option
bnum: number of elements of the bucket array. If it is not more than 0, the default value is specified. The default value is 32749 (32K). Suggested size of the bucket array is about from 1 to 4 times of the number of all pages to be stored.
nmemb: number of members in each non-leaf page. If it is not more than 0, the default value is specified. The default value is 256.
ncnum: maximum number of non-leaf nodes to be cached. If it is not more than 0, the default value is specified. The default value is 512.
lmemb: number of members in each leaf page. If it is not more than 0, the default value is specified. The default value is 128.
lcnum: maximum number of leaf nodes to be cached. If it is not more than 0, the default value is specified. The default value is 1024.
apow: size of record alignment by power of 2. If it is negative, the default value is specified. The default value is 8 standing for 2^8=256.
fpow: maximum number of elements of the free block pool by power of 2. If it is negative, the default value is specified. The default value is 10 standing for 2^10=1024.
xmsiz: size of the extra mapped memory. If it is not more than 0, the extra mapped memory is disabled. It is disabled by default.
dfunit: unit step number of auto defragmentation. If it is not more than 0, the auto defragmentation is disabled. It is disabled by default.
mode: "w" of writer, "r" of reader,"c" of creating,"t" of truncating ,"e" of no locking,"f" of non-blocking lock

width: width of the value of each record. If it is not more than 0, the default value is specified. The default value is 255.
limsiz: limit size of the database file. If it is not more than 0, the default value is specified. The default value is 268435456 (256MB).
mode: "w" of writer, "r" of reader,"c" of creating,"t" of truncating ,"e" of no locking,"f" of non-blocking lock

opts: "l" of large option,"d" of Deflate option,"b" of BZIP2 option,"t" of TCBS option
idx: specifies the column name of an index and its type separated by ":"
bnum: number of elements of the bucket array. If it is not more than 0, the default value is specified. The default value is 131071. Suggested size of the bucket array is about from 0.5 to 4 times of the number of all records to be stored.
rcnum: maximum number of records to be cached. If it is not more than 0, the record cache is disabled. It is disabled by default.
lcnum: maximum number of leaf nodes to be cached. If it is not more than 0, the default value is specified. The default value is 4096.
ncnum: maximum number of non-leaf nodes to be cached. If it is not more than 0, the default value is specified. The default value is 512.
xmsiz: size of the extra mapped memory. If it is not more than 0, the extra mapped memory is disabled. The default size is 67108864.
apow: size of record alignment by power of 2. If it is negative, the default value is specified. The default value is 4 standing for 2^4=16.
fpow: maximum number of elements of the free block pool by power of 2. If it is negative, the default value is specified. The default value is 10 standing for 2^10=1024.
dfunit: unit step number of auto defragmentation. If it is not more than 0, the auto defragmentation is disabled. It is disabled by default.
mode: "w" of writer, "r" of reader,"c" of creating,"t" of truncating ,"e" of no locking,"f" of non-blocking lock

August 02, 2009

Faster MySQL B-Tree performance

In our ongoing quest for low-cost, high-performance solutions for our platform I've run across TokuDB from TokuTek. This is a MySQL storage engine that uses a different indexing technology that makes updates to an index faster. Making index maintainence faster means being able to have more indices and that allows for richer queries and more interesting applications and analytics.

For one part of our system, I'm interested simply in faster inserts to the index. Having more indices for each table would help in other areas but I haven't started working on those yet. The table in question manages a many-to-one mapping of tokens (text) to internal database row IDs (integers). Essentially, it's a simple key/value lookup table. Currently a single machine does 150 inserts per second, which is fairly abysmal for a table with only 200M records.
This system uses MySQL and the InnoDB storage engine which uses a B-Tree indexing approach. Since the lookups are simply key/value lookups there really is no need for a B-Tree - a hash indexing scheme would work, but unfortunately InnoDB does not support that. We may migrate to Tokyo Tyrant and Tokyo Cabinet since that supports hash indexing and also seems to have better concurrency support (many non-blocking concurrent requests). The keys inserted are almost sequential, but from what I've read the sequential nature of the inserts should help (although the bottom-up building of a B-tree could be optimized by detecting that the key causing a page split is greater than all the values in the page being split - I think Oracle does this).

While looking for hash based indexing for MySQL I found TokuDB which uses the sexy phrase 'fractal tree' for their indexing. Their site has a few introductory whitepapers but a for deeper understanding I wanted to get to the theoretical foundation of their technology.

I soon found that there are several meanings to 'fractal tree' data indexing

fpb+ tree. Fractal pre-fetching b+ trees, embeds cache-optimized trees within disk-optimized trees. Unrelated to TokuDB
Fractal Trees (TM) from TokuTek. Uses 'cache oblivious' algorithms to improve B-tree usage.

One of their founders was kind enough to post some links on the mostly useful MySQL performance blog - here are the links (however, a few seem to be missing)

The core algorithms seem to be for 'cache oblivious' b-tree structures, based on a description from a masters thesis from 1999. Maybe not all things 20th century suck (postscript files excluded).
Although I haven't finished reading all the papers, what I like about this approach is the way they model algorithm performance as the cost to transfer blocks of data between layers of storage and also that they consider multiple layers. The most important transfer costs being between disk and memory, but including considerations like an OS managed disk cache is good. This models the multi-layer caching that nearly all large scale systems use. We are big fans of measuring algorithm performance and selecting the right tool for the right job.

As an aside, one of the comments on the MySQL performance blog pointed me to InfoBright, a column-oriented store, which might be useful in some of our analytics systems for ad-hoc reporting.

March 17, 2009

A clean desktop. The hard way.

My hard drive has been complaining recently and yesterday gave up the ghost. I found a few decent replacement drives online, but needed a new one right now. We had a gift certificate to Staples and what do you know - they had a reasonable 500GB SATA drive for $80. I called them last night just before 9pm - not only were they helpful late at night, but they put one behind the counter for me. Nice people there.

This morning I backed up all my files to my Linux server here at home and swapped in the new drive. I re-installed Win XP Pro and my machine was now a blank, default machine with no drivers for all the hardware plugged into it - like the big monitor, the speakers, power management, etc. I finally figured out Dell's horrible user interface for installing drivers and soon was back to a running machine with a clean desktop (first time in several years).

The only oddity was the clock was off by an hour. The old Win XP install didn't know about Congress changing when Daylight Savings Time started.

After restoring all my files, I needed to update Win XP and install all the old applications again. Not too hard, but time consuming.

Here's the order that I restored things

Dell drivers
Firefox browser
iTunes (so I could have music while restoring everything else)
PuTTY
PasswordSafe
Windows XP service pack 3
Microsoft Office 2003
Java 6 SDK
Eclipse
Apache 2.2
Tomcat 6
DBVisualizer
TortoiseSVN shell
GTalk

March 02, 2009

Yahoo Query Language and Open Tables

I've been looking at the Yahoo Open Data Tables and Query Language documentation. This is truly amazing stuff! It provides a service API that accesses many well known data sources (many are Yahoo) and transforms the data into XML or JSON. The data sources can be external URLs that provide XML and Yahoo does the fetch, parse, extract and transform that you want. You can provide a definition of some other external data source and they will hook it into their unified API fetch/query/transform service.

Some of the data sources are Flickr, local listings, geo location info, web search, image search, news search, weather and so on. One stop shopping for lots of great data.

Their console http://developer.yahoo.com/yql/console/ is a great way to see what's possible.

This is what I've wanted for many years. A long time ago I wanted to build a service that would provide "XML data sources" (I even registered xmldatasource.com) for everything available on the Web - now it looks like Yahoo has actually done it.

Let's hope they keep this data access service open to all.

January 29, 2009

Scalable, reliable key-value lookup service

Here's a great summary of most of the available key-value storage services from Richard Jones of last.FM - great stuff.

From what I can tell, Scalaris is only memory-resident at the moment and doesn’t persist data to disk. This makes it entirely impractical to actually run a service like Wikipedia on Scalaris for real - but it sounds like they tackled the hard problems first, and persisting to disk should be a walk in the park after you rolled your own version of Chord and made Paxos your bitch.

I'm leaning towards Voldemort, but need to look into this more.

December 31, 2008

I Hate Computers - Log entry #762

Here are some helpful tips, if you ever find yourself using 'computers', especially the Windows species.

Never ever disable the "VgaSave" video adapter. Ever. This is the fallback software used by Windows to display things on the screen if no other video driver works. If this is disabled, Windows will start but your screen will show nothing but the finest shade of black. If you fail to follow this advice, be prepared to follow these instruction to re-enable vgasave service from the Windows recovery console. Not that it worked for me, but hey, good luck with that.

If you perform a 'repair' installation while your screen shows nothing but the black darkness which has enveloped the heart of every poor Windows operating system developer, hoping for your video adapter driver to be repaired, do not (no not ever) turn the power off. Not even once, just for fun.

If you happen to boot up your Windows computer and it fails to start due to 'Registry is corrupted' or some such, and you choose to re-install Windows rather than becoming a Tibetan monk (who would likely have fewer problems than a Windows user with a failing disk drive, even considering the Chinese government's approach to freedom), be prepared for a long stretch of file recovery and application re-installation. I recommend Fat Tire Amber Ale from the New Belgium Brewing Company.

If you happen to have not followed these tips and you have re-installed Windows and you now have a default user account with a default green and deceptively bucolic grassy field, where before you had many files and folders and possibly (if you are reading this) insipid desktop wallpaper, here is an actual, useful tip : you can change the settings for this newly created user account to use the folders and settings of your previous user account. This won't solve all your problems, but really, do you think that's even possible? Even with advice from me (and I'm composed of nearly 100% pure awesome). Here is what you do :
- use the regedit program (and if you don't know what that is, give up now) to view HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList
- Look for a sub-entry that has a really long value with an entry of ProfileImagePath that points to the fairly empty and useless 'new user account' (e.g. "%SystemDrive%\Documents and Settings\Mike.NAUTILUS". Change the value to point to the location of the old and wondrous user account (e.g. "%SystemDrive%\Documents and Settings\Mike").
- Go back to the ProfileList registry key, and update the "AllUsersProfile" and "DefaultUserProfile" settings - you'll probably want to continue to use the old folder that had settings for all the old applications you had previously installed and which are most likely useless, since you've re-installed Windows.
- Log out, then log back in. Hopefully you now see your old desktop. In either case, remember - Fat Tire Amber Ale from the New Belgium Brewing Company.
- To be honest, you could just copy the files from the old user account folder into the folder for the new user account. But I'm lazy. Just be aware that by pointing the new user account to the old user folder, there may be permission issues with accessing some of the folders or files. I haven't seen that happen but it seems a likely next failure.

July 29, 2008

MySQL and LAST_INSERT_ID()

We've recently migrated our software servers from a dedicated server environment (BlueGecko - really good people and service) to Amazon's EC2 'virtual compute cloud' environment.

The new system has a very nice performance monitoring capability based on Ganglia that gives us visibility into the performance of service requests as well as details on more fine-grained functions that take place within each request. We can now see not only the number of requests or functions but also the average duration and the time it took for 50%, 95% or 99% of the requests to complete in a five minute interval. This percentile breakdown gives a quick feel for how 'spiky' performance is and how common outliers are.

So far, things have gone very well but there was one function of the system that seemed to suddenly have terrible performance. I was able to quickly see which area of the code was involved, which pointed me to some SQL statements.

As our system adds new anonymous profiles, the code retrieves the unique number assigned by MySQL using the special query that MySQL provides:

SELECT LAST_INSERT_ID() AS user_id FROM oo_anon_user

However, it turns out that this SQL is actually incorrect - the extra "FROM oo_anon_user" caused the database to return every single record from the table back to the client, on every request that created a new record. This of course took some time.

The correct syntax to retrieve an auto-increment field from a MySQL table is

SELECT LAST_INSERT_ID() AS user_id

Remember to omit any FROM clause.

It's much much much faster now.

April 24, 2008

CouchDB

I'd heard the CouchDB name but but didn't realize that it's an HTTP-accessed, Erlang-implemented, schema-free, distributed datatabase - and it supports Javascript for defining views. Time to dive in!

From Apache CouchDB: The CouchDB Project:

"Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language."

April 21, 2008

What programmers like

Quote of the day, found on the High Scalability blog while looking into Amazon's SimpleDB

"Programmers like problems they can solve with more programming."

April 19, 2008

MyOpenID for Your Domain

This post from ReadWriteWeb about
MyOpenID for Your Domain is interesting. I've looked into OpenID several times in the past but it didn't have enough immediate value to get me to do anything about it. Now that more sites are 'supporting' OpenID and with the simple to use MyOpenID provider maybe I'll set something up. It would be interesting to see decentralized profiles and authentication, it seems a shame that everyone is getting locked into a few social networks for hosting and managing personal profiles and shared contact info.

April 14, 2008

Of Liquidity, Competition and Platforms

This post by Bob Wyman "Of Liquidity, Competition and Platforms" reminds me why I subscribe to tech blogs - occasionally you get a deep thinker that clearly and concisely describes something that likely took them a long time to figure out.

"As many others have said, much of what people are building today is 'features' not products. As long as that is the case, raw economics is the real problem with the software business. Competition strengthens the platform builders while eviscerating the component builders... The rich get richer and the builders of innovative features must be satisfied with the 'personal rewards' of doing a job well unless they get lucky in the buy-out lottery game."

With the various cloud computing efforts well underway, and with each providing a different approach as a 'platform', it would be good for developers and entrepreneurs to think about which road to travel.

March 01, 2008

Seattle's nuclear reactor

I knew that some universities have or had nuclear reactors for research, but I didn't know they could look this beautiful.

From Crosscut:

Abby Martin hopes that, whatever its fate, the Nuclear Reactor Building finally gets public acknowledgment of its role in history, and credit for its architectural originality. That would be in keeping with its original intent. It's hot core may be gone, but it can still teach us something.

February 01, 2008

Yowza! MicroHoo in the future?

Looks like Microsoft is serious about buying advertising's future - they have offered $44B for Yahoo.
Favorite punditry (from Stowe Boyd)

"Personally, I think the Microsoft and Yahoo matchup is like two tired swimmers who bump into each other and then wind up drowning each other in their scramble to survive. But Yahoo will be the first to go under in this embrace."

This one from iMe on Twitter is good too:

"Microsoft and Yahoo! Its like a blind man trying to lead a deaf guide dog...."

January 19, 2008

Stonebraker on MapReduce

I've too busy lately to post on all the exciting things happening in the database world - more people getting into column-oriented storage, Amazon's SimpleDB service, consumer database Web services like blist and LongJump, Sun buying MySQL. But I couldn't pass up commenting on Joe Gregorio's post on Stonebraker on MapReduce.
It seems everybody is panning Stonebraker's evaluation of MapReduce as incorrectly comparing it to a DBMS. One commenter even said (ironic comment of the year) Michael Stonebraker should learn what a DMBMS is.

The point, which I'm sure someone has pointed out, can be found by looking at the summary of their findings. Here are a few:
- A sub-optimal implementation, in that it uses brute force instead of indexing
- Missing most of the features that are routinely included in current DBMS
- Incompatible with all of the tools DBMS users have come to depend on

These read like a quote from The Innovator's Dilemma. People enjoy the benefits of MapReduce for large scale data processing because of these points. The lack of support for these DBMS features and tools are the reason it scales like a mother fucker. That is the feature people want. And it does it 10x better and cheaper than anything else. Those other things simply don't matter to them. It's a new audience and a new market.

November 04, 2007

Nielsen on Generic Commands

I ran across this interesting bit from Jakob Nielsen's UseIt site about user interface design. It's about using the same few commands in a UI and is an interesting twin to the 'uniform interface' aspect of REST. Although a 'user interface' and a 'network interface' are very different beasts I can't help but wonder if there's some fundamental reason for the similarity.

Summary: Applications can give users access to a richer feature set by using the same few commands to achieve many related functions.

In application design, there's a tension between power and simplicity: Users want the ability to get a lot done, but they don't want to take the time to learn lots of complicated features.

September 27, 2007

Amazon Payment System

I was looking at the Amazon Flexible Payment System service API and it's a whopping big set of docs. Unfortunately, I was sorely disappointed to see the operations in the HTTP based API (called a REST Request) seem to all be GET requests with an Action= query term. Ugh.
The documentation doesn't even mention what HTTP method to use, what content-type to submit or expect as a return, etc. I'm pretty sure that they simply don't know the difference. Sad but true. At least there are docs.

I suppose they didn't have the time to make it simple.

September 25, 2007

NASA Tech Briefs

A long time ago I used to subscribe to NASA Tech Briefs - a slim monthly magazine full of hard-core engineering articles and light scientific reporting.
A few months ago a friend had an issue and I decided to see what kind of online presence they have now - and they have a good one. No Atom feed, but still lots of techno-bits!

NASA Tech Briefs

September 03, 2007

CouchDB: Thinking beyond the RDBMS

From Assaf at labnotes,
CouchDB: Thinking beyond the RDBMS

It stores document in a flat space.

There are no schemas. But you do store (and retrieve) JSON objects. Cool kids rejoice.

And all this happens using Real REST (you know, the one with PUT, DELETE and no envelopes to hide stuff), so it doesn’t matter that CouchDB is implemented in Erlang. (In fact, Erlang is a feature)

Here’s where it gets interesting. There are no indexes. So your first option is knowing the name of the document you want to retrieve. The second is referencing it from another document. And remember, it’s JSON in/JSON out, with REST access all around, so relative URLs and you’re fine.

Hmm, Erlang. Hmm, no schemas. Hmm, no write consistency. Sounds perfect.