Select your font size 
 
about us products & services consulting & support news & events contact us
Paul Meagher shows how to use a database query to calculate conditional probability.

Conditional Probability and SQL - Manitoba

print this article 
 

P(A | B) can be mapped onto database-query operations. For example, the probability of cancer given a positive test result, P(+cancer | +test), can be obtained by issuing this SQL query then doing some tallies on the result set like this:

SELECT cancer_status FROM Data WHERE test_status='+test'

If I gather information about how several boolean-valued tests co-vary with a boolean-valued diagnosis (like that of cancer or not cancer), then I can perform slightly more complex queries to study how diagnostically useful other factors are in determining whether a patient has cancer, such as in the following:

SELECT cancer_status FROM Data WHERE genetic_status='+' AND age_status='+' AND biopsy_status='+'

In the case of detecting e-mail spam, I might be interested in computing P(+spam | title_word='viagra' AND title_word='free'), which could be viewed as a directive to issue the following SQL query:

SELECT spam_status FROM Emails WHERE email_title LIKE 'viagra' AND email_title LIKE 'free' 

After enumerating the number of e-mails that are spam and have "viagra" and "free" in the title (like so):

count_emails(spam_status='+spam' AND email_title LIKE 'viagra' AND email_title LIKE 'free')

and dividing by the overall number of e-mails with the words "viagra" and "free" in the title:

count_emails(email_title LIKE 'viagra' AND email_title LIKE 'free')

I might arrive at the conclusion that the appearence of these words in the title strongly and specifically co-varies with the message being spam (after all, 18/18 = 100 percent) and this rule might be used to automatically filter such messages.

In Bayes spam filtering, you need to initially train the software in which e-mails are spam and which are not. One can imagine storing spam_status information with each e-mail record (for example, email_id, spam_status, email_title, or email_message) and doing the previous queries and counts on this data to decide whether to forward a new e-mail into your inbox.



Page:   1  2  3  4  5  6  7  8  9  10  11 Next Page: Frequency versus probability format

The content shown in this page was first published by IBM developerWorks and is reprinted with permission from Paul Meagher (www.datavore.com)


Most Recent Website and Regional Updates

 Emergency Management Services
The prototypical emergency involves a shutdown of essential services for a finite period of time. What will your organization do when a world-wide financial crisis strikes?

 
 High Scalability - Large Systems Optimization
Transparen Corporation lends its expertise to clients experiencing rapid and sudden growth in traffic or server utilization, bottlenecks, systems instability, downtime during peak traffic, or which would like to plan to avoid such issues.

 
 01/12/2008: The Big Three & the Future of the Auto Industry
One way or another, the Big Three automakers will have a huge impact on Windsor's future. But the future of those companies is being decided by forces well beyond this city's borders.

 
 28/11/2008: Greenpeace and the DRC
For more than a decade, the Democratic Republic of Congo has seen one humanitarian disaster after another. But there's an environmental catastrophe as well. And Greenpeace thinks it deserves our attention too.

 
 27/11/2008: The Agony of Stephen Harper
Stephen Harper got into politics to make government smaller. Now he's facing an economic crisis that seems to cry out for big time public intervention.

 
 26/11/2008: Albino Killings in Tanzania
A horrifying story of an underground trade in human body parts, one that allegedly targets albinos in Tanzania and is said to be fueled by witchdoctors.

 
 25/11/2008: Cyber Crime and Bullying
Today on the podcast, the tragic story of a young life cut short that shows there's nothing virtual about on-line bullying. Find out how social networking web sites are being used to promote bullying and hate and why some people think our real world laws are failing to keep up with these kinds of cyber crimes.

 
 24/11/2008: The Story of Private Joseph Dwyer
Today on the Current podcast, a documentary about an American photo-journalist named Warren Zinn. Back in March of 2003, he snapped a now iconic photo of a U.S. Army medic carrying an injured Iraqi boy in his arms. Earlier this year, the medic killed himself. And Zinn has been trying to piece together what ? if any ? role his photograph played in what happened.

 

Google
 
Web transparen.com

Contact Information

Related Information

 
  Winnipeg
Portage la Prairie
The Pas
Flin Flon
Brandon
Churchill
 
 
E C M | © 2003-2007 Transparen Corp.      

Standardized Services: Data Recovery Service / Creative Services / Premium Web Hosting Services / System Administration Tech Support Services
Recent Projects: Full-Service Mortgage and Financing Company / System to manage flights from Vancouver to Tofino / Photo exchange verification service
Our Vancouver BC Server Proudly Hosts: automated parking and revenue control systems, leafside lane at southlands, cost effective alternative power sources, Higher Grade Learning Centres, pacific forage bag supply, sunburst medical, neosonic design, roger mahler photography - passionate, intriguing, desirable, the connection between east and west, affordable flights to victoria and tofino, low interest mortgage brokers in vancouver, richmond, surrey, toronto, Toronto Calgary and Vancouver IT staffing and talent search
Winnipeg, Portage la Prairie, The Pas, Flin Flon, Brandon, Churchill