I'm starting a blog to talk about a hobby project I'm undertaking for my house. The concept and basic implementation is best explained by showing a mock-up in action. Check out the video link below!
Why? Ok, now that you've seen the concept video - you may wonder about the why & how. To answer the why question, whenever I go somewhere beautiful I wish I could take that view home with me. I can take a photo, but when I get back home and look at - I just don't get the same feeling. To make something appear real, I need:
Motion - the more frames per second that can be shown, the more realistic movement looks
Immersion - the image needs to surround you. At any point in time, we see ~110 degrees around us which a small picture or screen fails to recreate
High resolution - The human eye can decern close to 0.3 pixels per arc minute. Low resolution loses realism.
Sense of depth - We take depth ques through many means, one important que for distant objects is motion parallax which occurs when objects change position more slowly the further away they are from you.
Sound - Audio plays in important role in recreating a scene as well as proving some depth ques.
Some of my favorite views:
How?Ultra-high resolution isn't a new idea, today's newest movie video projectors use resolutions up to 4096x2160. While IMAX film can supposedly resolve vertical lines at nearly twice this resolution. Red digital cinema sells digital cameras that can film at 4096x2160 (8 megapixels at 24fps), and has announced plans to sell a 28000 X 4000 camera later this year (112 megapixels) at 30 fps. My goal is to achieve somewhere around 100 megapixels at 60 fps which more than double announced cameras from red.com. To accomplish this resolution there are several challenges for recording, editing, and display.
Filming. At this time, I'm thinking the least expensive option is to buy a bunch of low-end HD video cameras and configure them such that each one films a part of the scene, and then stitch the video frames from each camera together to form a larger frame. 100 megapixels could be accomplished for about $30,000 which is considerably cheaper than an IMAX camera (up to $500k new), Red camera ($30k for 1/5th the resolution). As well, prices at the low end of cameras come down significantly faster than the high-end. I'm still looking at cameras but my leading choices right now are Aiptek Action HD GVS and Sanyo VPC-FH1 HD.
Aug. 30, 2009
Updating with some related links
A True Virtual Window, A thesis by ADRIJAN SILVESTER RADIKOVIC
A PHD student who describes how he implemented a similar idea. The main difference between his implementation and what I'm targeting are:
- He used a static image, I'm planning to do video at 60fps
- I'm planning higher reolutions than he used (100 megapixels versus 67 megapixels
(100 megapixels at 60fps may be a pipe dream, but I'd like to try!)
- He used only one camera for head tracking sampling at a fairly low framerate, I'm planning to use 6+ sampling at 100fps
- He performed head tracking using the normal visible color specture, I'm planning to use infared lights and sample images in infared color specture.
- He used a pan and zoom camera for head tracking, I'm planning on using static cameras.
- He used only one video display and computer for rendering, I'm planning on using multiple displays and computers
Panoramic Video Textures
A method to create high-resolution looping video is presented using a single panned video camera.br<>
This technique is very interesting, but is limited in the types of scenes that can be recorded and replayed.
Sept. 7, 2009
Results of some initial filming test
Over memorial day weekend, I went to a few spots around San Francisco and filmed some scenery using Sony's HD Handycam (1920x1080). Since I only had one camera, I filmed different parts of a large scene at different time intervals - this will create artifacts at the seams for moving objects but should give a rough feel for how the final results will look. I learned a few things using this borrowed camera:
30 fps looks good most of the time, but definitely noticeable for faster moving objects. 60 would look a bit nicer
Interlaced video is unacceptable. You really need progressive frame captures to ensure you don't get ugly
tearing artifacts
The more motion in the scene the more interesting it is too watch. I found getting close to the ocean and filming waves looks great compared to far away shots, however too close means depth perception won't match reality when projected onto a 2d surface.
My quad-core Xeon doesn't has enough horse power to decode 4 1920x1080x30 Mpeg2 streams
simultaneously, I need to investigate where the bottlenecks are coming from
Anyway, I updated the video above to include some of my test footage, check it out (jump to 1:30 inside the video to see what I filmed today).
A month ago, I got the book
Practical Multi-Projector
Display Design and read it cover-to-cover with great interest. I
highly recommend the book as it was pretty easy to follow compared to the usual
thesis papers you get from PhD students. The basic idea is to
calculate geometry and luminance mappings from screen space to projector space
by projecting various patterns, taking a photo of the screen with a high
resolution camera, and then determining how the original patterns actually got
mapped. This mapping information can be expressed as triangle strips when
projecting onto non-linear surface such as curved walls. During rendering,
transforming triangle strips can be accomplished with very little overhead.
Luminance and color correction can also be specified at points in the triangle
mesh while the alpha component can be used to blend the edges of two displays
together to make it look seamless.
The company Scalable Display has
implemented many of the techniques outlined in Practical Multi-Project Display
Design and using their software would be a great way to save time on the
project.
Coincidently, I was contacted by a company called Scalable Display and
recently meet with the CEO and Founder of
Scalable Display (Andrew and Raj).
They expressed interest in doing some joint development on this project. I
got a demo of their system with using 2 projectors similar to
this YouTube demo.
They said they are working on one 100 megapixel project with the Navy which was
pretty exiting to hear. I'd love to see that system in action.
Scalable Display's largest commercial market is military simulators.
I have 2 projectors now and going to play around with edge blending.
For simple 2 projector edge blending, there are some easier solutions including
nVidia's PowerWall, Matrox
PJ-4OLP, and others. Many people with 2 projectors just want to
watch movies, and VLC has a plugin called "Panoramix" that will also do
projector blending.
To do some initially test & software, I need some high resolution footage.
My eventual plan is to create super high resolution video by stitching a lot of
individual video sequences together (just like how you create panoramic images).
However, there is no such video stitching application out there, which means
I'll have to write it myself. In order to skip that for the short term, I hired clai.tv
to come up to San Francisco and film some test 4k footage (4096x2048) using
their Red Cinema digital camera. Without their help, I'd have a hard
time working this camera. As you can see in the video below, the camera is pretty complex.
Below is a thumbnail from one frame captured by the camera,
click it to see the full resolution image capture.
Nov 28, 2009:
"Hiring" Open Source Developer for Video
Stitching project
If you have experience working on image stitching and
interesting in apply/extending this to process video, I'm willing to fund your
work. The project should take a set of movie files and perform the
following:
- Calculate feature descriptors for individual video streams and
automatically find overlaps in both space and time
- Produce a single high resolution output video (jpeg stills are sufficient)
- Support various projections and mappings
- Support image, gamma, and color blending
- Experiment with synthesizing "tween" frames in order to blend videos that are
not synchronized at the millisecond level
Ideally this work would extend an existing open source project
like Panotools.
If you are interested, please contact me at
jc@thisdomain.
Dec 13, 2009:
First attempt at head tracking using TrackIR
I was experimenting with head tracking using TrackIR from
Natural Point. This near infrared camera does a pretty good job for
the price. The main limitation is has is a pretty limited capture
volume with one camera, though it appears you can use Natural Point's SDK to get
data from multiple cameras, so this could be used to increase the capture volume
you are working with. Natural Point's other software tools which do
multi-camera calibration and multi-camera triangulation are not available for
TrackIR, it appears they try hard to encourage non-gaming consumer to move to
the OptiTrack camera systems. As well, I don't think TrackIR support a
sync-signal so capturing fast moving objects will likely have higher errors (but
ok for my purposes).
The results of my test turned out pretty well, here is a video:
April 24, 2010:
Using a depth map to enhance 3d effect for video
footage
I found a depth map can be used to convert 2d video into 3d to
provide more realistic views as the user's head position changes.
Below is a demo using hand drawn depth maps. I'm als investigating
automated depth map creation using feature point matching from 2 or more camera
views of the same scene. If that works, a depth-map per frame could be
generated which would allow for objects that move over larger z distances.
A multi camera view would also allow for better texture estimation for the area
behind occluded parts of the frame. Even without that it appears a
static hand drawn depth map would provide convincing results for motion that
occurs at approximately the same z distance.
July 11, 2010:
Real-time head tracking
An important part of this project is real-time head tracking and
I've been exploring various options.
WiiMote. The first option I tried was the WiiMote system
popularized by Jonny Lee's
video and also demoed by project with similar goals called
Winscape.
Although low-cost, when I
tested the WiiMote, I immediately found that it falls far short in real-life. Some of the problems it has
including limited range and sample accuracy. You need to stand in a pretty
small "sweet spot" in order for it to work. Moving more than a few steps
from this spot will cause it to stop tracking altogether. Another
important issue is the accuracy of tracking, because of the limited resolution
of the camera tracking IR sources (128x96), the XYZ locations that can be
calculated are fairly "jittery". In order to keep the scene from jumping
around, you need to smooth the sample points (average them over 5-10 frames),
but this introduces a lot of latency. Further adding to the latency
problem, the WiiMote only samples at 40Hz. If you move your head or
body, the screen lags behind half a second to a second and this destroys the
illusion of the window being real. In Youbue videos, latency isn't
easy to observe and it's easy to control your position to stay in a specific
sweet spot. In short WiiMore is only good for Youtube videos. :)
The PS3 Move Controller would be an interesting option to investigate - it's
similar to the WiiMote, but samples at a higher rate and resolution.
Natural Point TrackIR. TrackIR5 also operates in a
similar fashion to the WiiMote, however it's resolution and sample rate a way
better than the WiiMote. TrackIR5's camera samples at 640x480 with a
frequency of 120Hz, in terms of raw data that is 75 TIMES better than the
WiiMote, and for a price tag of $150 it's pretty affordable as well.
The downside with the TrackIR is that it also has limited range, you need to be
in a small sweet spot for it to work. Natural Point provides an SDK
to access data from the TrackIR directly and supports multiple TrackIR units on
a single computer, it would take a little work, but it's possible to
create a system that uses data from multiple TrackIR units to extend the range.
TrackIRs are nice because they are very compact and powered and operate over
standard USB cabling. One other downside for TrackIR is the need to wear
IR reflectors.
Natural
Point OpiTrack
Of the 3 solutions I've tested so far (Wii,TrackIR, and
OptiTrack), I like the OptiTrack the best. Optitrack is the commercial
grade version of TrackIR, it is designed up front to cover a large area and
support multiple cameras. Natural Point also provides software that
will automatically calibrate cameras and calculate 3d positions from multiple
views. The calibration works by waving a reflector wildly around the
room to allow for some initial data to by analyzed, from there the position of
each camera can be calculated. Both TrackIR and OptiTrack camera have
built-in hardware point detection so they can provide your computer with a list
of 2d points they see rather than a full 2d image, this eliminates the amount of
data that needs to be transferred (helps in lowering latency) and also reduces
the amount of CPU consumed on the host PC. The host PC doesn't need
to process a large set of pixels to find IR reflectors, so it's work-load is
pretty light. Both TrackIR and OptiTrack have a pretty small impact
on your CPU so they could potentially be run on the same computer doing
rendering. I found there is about a 5ms latency for obtaining
samples in the real world, which is pretty good (though not as good as TrackIR).
OptiTrack cost runs at $6,000 for 6 cameras and various supporting equipment and
cabling. They main things I don't like about Optitrack are:
- You need at least 8 cameras to cover a 10x10 room well.
This leads to a lot of cabling.
- Cameras positions are flexible, but to get good results they should be above
the head - making them a bit of an eyesore if you want the solution to feel like
a normal living space.
- Like all the other solutions, you need to wear at least one IR reflector that
the cameras can track.
ViCon products
I didn't test these, but they are worth mentioning.
ViCon provides IR tracking cameras similar to Natural Point, but designed for
the higher-end needs. From discussion forums on the net, it looks like
ViCon cameras are frequently used by commercial motion capture studios.
Vicon's cameras range from $3,200 to $20,000 per camera, so they are 5-30 TIMES
more expensive than those from Natural Point, however they can achieve some
pretty impressive stats. ViCon cameras can have a 2-3ms latency per
frame and sample up to 200HZ. Their high-end camera has a resolution of 16
megapixels (4096x4096), pretty impressive! They say you can
track objects in very large areas (football fields) with this camera.
Robotic Pan/Tilt/Zoom (PTZ) Video with face tracking
This is the area I'm currently exploring. The
concept here is
use 2 or more cameras that can be programmatically controlled for pan, tilt, and
zoom to follow a subject and use face tracking to determine the position of a
user's head and eyes. Determining the position of the user's eyes
from two or more cameras should allow for the calculation of accurate 3d
positions. This was the solution I original had in mind, but put it
off because it's also the hardest to get working. The advantages of this
approach are many:
- Markless tracking. The subject doesn't have to
wear any reflectors.
- Large coverage area with small number of cameras. The ability to
pan/tilt/zoom allows you to use resolution where you need it rather than trying
to cover the entire room all the time.
Some of the challenges include:
- To cover a room with you need to track the subject as he/she
moves by performing pan/tilt/zoom operations in the camera. There is an
ideal resolution and position you'd like to keep the subjects face at so that
future movements will stay on camera and there is enough resolution to
distinguish the face position accurately.
- The bandwidth and CPU required to perform face tracking is
pretty heavy. Face processing needs to be done on the PC, so the entire
image from the camera needs to be transferred into PC memory. For a
720p HD black & white video signal at 60fps, this means 55MB of data needs to be
transferred and processed every second. For a 3.2GHz processor, this
means you have ~50 million cycles per pixel to perform transfer and processing
(more if you can split across cores). One key to fast face
processing is to reduce the amount of image data you need to process by keeping
the face size as small as needed and then intelligently skipping areas of the
image where the face is unlikely to be in. The Sony EVI HD1 camera
supports video at 1080p @60hz, but this would result in 124MB of data
transferred and processed per second, which leaves ~25 million cycles per pixel.
This is probably doable, but there isn't much CPU left over for 1) multiple
cameras and 2) rendering subsystem.
- I haven't seen any PTZ cameras that support frame rates higher
than 60fps. I'm guessing latency will end up 10-15ms, but could be higher
if the capture and transfer of images takes additional time. Right
now, I'm trying to determine what kind of latency PC capture cards can grab 720p
or 1080p with minimal latency.
- To accurately calculate 3d points from multiple 2d cameras,
you need to know exactly where each the cameras are located and looking at.
If you use pan/tilt/zoom operations, you need to recalibrate these settings.
Doing this in real time could be tricky, but should be doable. My planned
approach is to use background feature points from previous frames to determine
precisely where the camera moved to.
- Face tracking itself is tricky. There are 3 potential
options that I plan to explore. CAMSHIFT, FaceAPI, and PittPatt.
CAMSHIFT is an algorithm provided as part of the
OpenCV graphics library. I'm not expecting great results from this,
but worth giving a shot.
FaceAPI has a "free for non-commercial use license"
and charges $4k per developer plus royalties for commercial applications.
I think it's smart of FaceAPI to provide a free version for non-commercial
use, it helps get interest from the crowd that would otherwise use and
contribute to OpenCV. I have done some basic test with FaceAPI
and found it can handle a 640x480@30fps video feed with approximately 12%
CPU load on my 3.2Ghz quadcore. Further testing is need to see
how it performs at higher resolutions and frame rates.
PittPatt charges $5k per developer per year with
royalties for commercial products, they have a 30 day free trial that is
available. I haven't tested them yet, but have high hopes.
I'm concerned they are not as fast as FaceAPI and won't be able to handle
high resolution and frame rates. A representative I talked to on
the phone mentioned a top speed for one CPU core of 320x240 at 40fps.
Scaling up to 8 cores wouldn't be fast enough for
1080p@60.
A few PTZ cameras currently under consideration:
Max Rez
Fps
Video out
List Price
Width
Length
Height
Tilt (deg)
Pan (deg)
sony evi HD1
1080p
59.94
HD: HD-SDI
$4,028
10.24
6
6.75
50
200
Analog Component (Y/Pb/Pr)
SD: VBS
Y/C
sony evi HD1
720p
59.94
DVI-I (Digital and Analog)
$3,499
9.8
5.98
5.31
50
200
sony BRCZ330
720p
59.94
BRBK-HD2: HD-SDI
$5,100
6.375
7.375
7.625
60
350
Sept 12, 2010:
Image feature point detection using SURF
One fairly simple method of tracking tracking someone's head
without requiring the user to wear markers is:
1. Find one or more unqiue points in the image that might be
associated with a person's head.
2. Repeat step #1 for 2 or more cameras
3. Provided your cameras reside at a known location, you can triangulate
matching points from 2 or more images and obtain 3 dimensional positions.
To find points inside an image seen by one camera that can be correlated with
points seen another camera, there are two widely used algorithms - SIFT & SURF.
The SIFT (Scale-invariant
feature transform) algorithm published in 1999 introduced a means of
detecting and describing features in an image. For the first time it was
possible to reliably detect distinct 2d points with associated 128 dimensional
vectors that represented qualities about the point. The vectors from one
image could be compared with vectors in another image to determine how similar
the points where. This soon became a fundamental tool used by
computer vision developers to perform a wide range of task including object
recognition, camera calibration, and structure from motion. The
SIFT problem had a few problems that kept it from being used more widely, the
biggest being it's execution speed. On a modern processors, execution the
SIFT algorithm on a single image might take several seconds to run. As
well because it produces high-dimensional vectors (128), comparing a large
number of vectors is also slow.
The SURF (Speeded Up
Robust Features) algorithm was first presented 2006 and executed much more
quickly. Using SURF it's possible to process a 720p frame at approximately
2fps on a single core Xeon 3.33Ghz machine and 5fps on a 6core machine.
Additionally the SURF algorithm is patent free (unlike SIFT) and produces
vectors of only 64 dimension but having the same or better accuracy in
identifying similar points across multiple images. 5 fps
is not exactly real-time, luckily SURF can be accelerated by GPU processors.
Using CUDA SURF (a CUDA
implementation of the OpenSURF project source), I can detect feature points for
a 720p image at a cost of approximately 12ms and compare ~1400 feature points to
find matches in about 10ms allowing for 30fps execution speed. The
video below demonstrates this performance. Since I'd like to hit
60fps @720p (speed of camera and capture card), I need to keep the total
execution time under 16ms. My next plan is to using delaunay
triangulation so I can quickly determine nearby neighbors for feature points and
reduce the number of point comparisons I need to do from approximately 2 million
down to 20,000.
Some specs I used for this test:
- Sony EVI HD1 camera (described above) with SDI ouput
Camera is set to 1280x720p 24bit RGB output
- Blackmagic Decklink HD Extreme video capture card
Blackmagic's drivers have been very buggy, but with more recent updates
seem to be working well. I couldn't find many other options for capture
720p video. Earlier versions of Blacklink's drivers would cause Skype to
crash on startup, as well nVidia's CUDA libraries would fail to work entirely.
It took a while to isolate Blacklink's drivers as the source of the problems.
- nVidia GF100 card
When playing a 3d game with this card, the fan noise is like a jet
engine - but the noise seems bearable when executing "CUDA SURF".
The GF100 card has 480 cores and runs through SURF detection about 10 times
faster than the same multi-core CPU based version.
- Xeon 6core 3.33Ghz. Since the GPU is doing the heavy lifting,
the CPU is only hitting 11% utilization on one core. I need to look
for more task for the CPU to do.