Saturday, July 24, 2010

Thoughts on Velocity 2010

I really enjoyed the Velocity 2010 conference. Before I start, here are links to some photos:
Here is a summary of day 1 most interesting notes:
Metrics 101 by Sean Power:
Sean gave me some important insight about metrics for monitoring/application performance, I'll just share my notes "as is" more or less, so forgive me if it's a little raw:
Why collect metrics:
* to detect problems
* user experience
* Needed for a scalable system based on automation
* Mesurement for optimization
Create a baseline.
It's not enough to just collect analytics.
Metrics are used to tie them in to the aspects of your business.
Page load time:
Lots of objects creates latency
3rd party dependency can cause your site to break without your knowledge
Importance of monitoring your external 3rd party.
Look for an external monitoring service with detailed info for performance and what part of the process could be impacting your website.
Synthetic isn't enough-
testing with a browser based test along with scripted tests.
Real user monitoring- Each part of your servers can report performance to a centralized place, then you can parse and get true metrics.
(truesite, tealie)
RUM Pros and Cons
corolates with Analytics you have.
Watches everything
Can be used to re-produce problems
Measures traffic as well as Performance.
May require physical installation
Can be a privacy risk
Summary- shows you that your site is working
Getting the math right:
* Forget about Averages- they miss represent data
* 80th percentile is much more meaningful.
* You can use histograms to build buckets of performance values and see how much fell in to each.
* By using the percentage you can measure 80-95% or the "knee curve" to get a hint that your site can't handle increased load.
* Reporting can't be complicated, when you have complicated data, a lot gets lost.
Consider 3 histogram's for as an example:
Checkout Page
If you had all aggregated, everything looks bad, but per histogram you can find the useful data.
Add context such as:
74% of tranactions took less than 4s
Trends- you can use it to predict and corolate.
This can help with planning- but don't trust it.
Target your metrics for your audiance-
Define it first
How technical
How will they use it?
To fix something
To escalate to others To plan for the future.
What words do they use?
Marching orders:
Collect data and increase a baseline
Evaluate what is causing the most slowdown
DNS etc
Set target thresholds
Train your audiance:
Get them used to the information
All in the same format
At the same time
From the same place.
Some tools to checkout:
Ajaz Gomez Coradiant truesight
Agent based (Aternity)
Coradiant tealleaf Beatbox Atomica labs moniforce
IBM Coremetrics
Open Source
Firebug- can be used to test load time - great external tool to test from several sources
Planning for the future:
Mobile devices are the future
HTML 5 is awesome
Onload time from many placed to the top landing page
Server time from one place, often to your core business process
Server time from one place, often to each tier web/io/cpu)
TopN worst pages by error rate server latencey network latency
I also went to a cassandra workshop. Here were my notes from that:
10 things you should answer when designing your cluster
1. What is the avg record size?
2. What is the service level obkective (latency)
3. What do I know about my traffic patterns
4. How fast do we need to scale this thing?
5. What percentage of my usage is read VS. write?
6. How much of my data is "hot"?
7. How big will this data set be?
8. What is my tolerace for data loss?
9. how many 9s do my {boss users sanity) expect?
While computing is becoming cheaper, managing your data record sizes is important, keep them efficient.
Users expect instant response
Broadband speed exceeds DB latency
Adding a Cassandra node as an "Instant Provision" can cause a slow down because it needs to transfer all your data.
REad vs write Am I picking the right tool?
Writing is 10X faster than reading
But reading is generally what the user "feels".
Write your data as you want to read it.
Try to detect if you have any "Hot spot" of data, is your data evenly divided in your cluster? load?
8 nodes per cluster is the recommended amount so you can scale and perform maintenance easily with minimal impact to your production environment, helps distribute the data even more.
Do we need to backup?
Yes. (Snapshot?)
jconsole- is your friend
Especially vm Summary
Backup strategy?